Home
Cracking the 500-Language Problem programming languages
Contents
1. plex program Then our tools converted this into SDE which was subsequently fed to a sophisti cated parser generator accepting arbitrary context free grammars The output was Program header statement row END PROGRAM gt Plex program The tools we built automatically recov ered most of the 3 000 production rules in an afternoon Then we tested each sector grammar separately We used a duplicate detector to weed out production rules that were used in more than one sector grammar so that we could construct an overall gram mar able to parse complete PLEX programs One assembly sector parser was hard coded see Figure 2 so we had to recover its grammar through reverse engineering The comments accompanying the code con tained reasonably good BNE so we had no problem with this task With all the sector grammars combined we generated a parser to test it with an 8 MLOC PLEX test suite The only files that did not parse were com piler test files that were not supposed to parse the rest passed the test In addition we generated a Web enabled version of the BNF description as a basis for a complete and correct manual November December 2001 recovered most of the 3 000 production pules in an afternoon IEEE SOFTWARE 85 3 30 SEARCH Statement Format 1 Serial Search gt gt __ SEARCH _identifier 1 VARYING identifier 2 index name 1_ END__imperative statement 1_ l
2. 6th Intl Workshop Pro gram Comprehension IEEE CS Press Los Alamitos Calif 1998 pp 108 117 www cs vu nl x ref ref html current 20 Sept 2001 12 D Blasband Parsing in a Hostile World Proc 8th Working Conf Reverse Eng IEEE CS Press Los Alamitos Calif 2001 pp 291 300 13 J Brunekreef and B Diertens Towards a User Controlled Software Renovation Factory Proc 3rd European Conf Maintenance and Reengineering IEEE CS Press Los Alamitos Calif 1999 pp 83 90 14 L Wall T Christiansen and R L Schwartz Program ming Perl 2nd ed O Reilly amp Associates Cambridge Mass 1996 15 M P A Sellink H M Sneed and C Verhoef Restruc turing of Cobol CICS Legacy Systems to be published in Science of Computer Programming www cs vu nl x res res html current 20 Sept 2001 16 M P A Sellink and C Verhoef Generation of Software Renovation Factories from Compilers Proc Intl Conf Software Maintenance IEEE CS Press Los Alamitos Calif 1999 pp 245 255 www cs vu nl x com com html current 20 Sept 2001 17 M P A Sellink and C Verhoef Development Assess ment and Reengineering of Language Descriptions Proc 4th European Conf Software Maintenance and Reengineering IEEE CS Press Los Alamitos Calif 2000 pp 151 160 www cs vu nl x cale cale html current 20 Sept 2001 18 R Lammel and C Verhoef VS COBOL II Grammar Version 1 0 3 1999
3. In addition because the manual contains code examples there is a good chance that the compiler has tested these examples This means that the manual s formal content could be of a much higher quality than you would expect from such documents Finally we must deal with the case in which we have no access to the compiler sources or a reference manual Capers Jones mailed us that for a significant number of applications with Y2K problems the com pilers may no longer be available either be cause the companies that wrote them have gone out of business or for other reasons He did not come up with actual examples Recall Wang s bankruptcy key customers just bought the source code and hence could solve their problems using the upper half of Figure 2 Theoretically we cannot exclude Jones s case for instance responding emo tionally Wang s core developers could have thrown away the sources You can learn an important lesson from this Contracts be tween you and the vendor of a business crit ical language should include a solution for source access in case of bankruptcy or termi nated support for example giving the sealed source code to key customers Summarizing our coverage diagram shows that you can re cover virtually any grammar whether you have the compiler sources or not But what about semantics Some people think you need up front in depth knowledge of a language s semantics to change code If you
4. year programming languages would be in vented at the rate of one per week or more if we consider the ones that never make it to the literature and enormously more if we consider dialects gt Peter de Jager also helped raise aware ness of the 500LP He writes this about the availability of Y2K tools 4 There are close to 500 programming lan guages used to develop applications Most of these conversion or inventory tools are di rected toward a very small subset of those 500 languages A majority of the tools are focused on Cobol the most popular business program ming language in the world Very few tools if any have been designed to help in the area of APL or JOVIAL for example If everyone were using Cobol and only a few systems were written in uncommon languages the 500 Language Problem would not be important So knowing the actual language distribution of installed software is useful First there are about 300 Cobol dialects and each compiler product has a few versions with many patch levels Also Cobol often contains embedded languages such as DMS DML CICS and SQL So there is no such thing as the Cobol language It is a polyglot a confusing mixture of dialects and embed ded languages a 500 Language Problem of its own Second according to Jones the world s installed software is distributed by language as follows m Cobol 30 percent 225 billion LOC m C C 20 percent 180 billion LOC m A
5. 3 G M Weinberg The Psychology of Computer Pro gramming Van Nostrand Reinhold New York 1971 4 P de Jager You ve Got To Be Kidding www year2000 com archive NFkidding html current 20 Sept 2001 5 R Lammel and C Verhoef Semi automatic Grammar Recovery Software Practice and Experience vol 31 no 15 Dec 2001 pp 1395 1438 www cs vu nl x ge ge pdf current 20 Sept 2001 6 PT Devanbu GeNoa A Customizable Front End Retargetable Source Code Analysis Framework ACM Trans Software Eng and Methodology vol 8 no 2 Apr 1999 pp 177 212 7 B Hall Year 2000 Tools and Services Symp ITxpo 96 The IT Revolution Continues Managing Diversity in the 21st Century Gartner Group Stamford Conn 1996 8 N Jones Year 2000 Market Overview tech report Gartner Group Stamford Conn 1998 9 M G J van den Brand and E Visser Generation of Formatters for Context Free Languages ACM Trans Software Eng and Methodology vol 5 no 1 Jan 1996 pp 1 41 10 M G J van den Brand M P A Sellink and C Verhoef Generation of Components for Software Renovation Factories from Context Free Grammars Science of Computer Programming vol 36 nos 2 3 Mar 2000 pp 209 266 www cs vu nl x scp scp html current 20 Sept 2001 11 M G J van den Brand M P A Sellink and C Verhoef Current Parsing Techniques in Software Renovation Considered Harmful Proc
6. compilation to six executables 2 6 Mbytes each it took 25 lines of code to coordinate them into a distributed com ponent based software renovation factory which then converted Cobol 85 code to Cobol 74 at a rate of 500 000 LOC per hour using 11 Sun workstations Measuring this and other projects it be came clear to us that the total effort of writ ing a grammar by hand is orders of magni tude larger than constructing the renovation tools themselves So the dominant factor in producing a renovation tool is constructing the parser Building parsers using our ap proach reduces the effort to the same order of magnitude as constructing the tools Build ing parsers in turn is not hard use a parser generator But the input for the generator is a grammar description so complete and cor rect grammars are the most important arti facts we need to enable tool support When we find an effective solution for producing grammars quickly for many languages we have solved the SOOLP But how do we produce grammars quickly For years we and many others have been recapturing an existing language s syntax by hand we took a huge amount of sources manuals books and a parser generator and started working But then we realized that this hand work is not necessary Because we are dealing with existing languages we just steal and massage the underlying grammars ac cording to our needs Grammar stealing covers almost all languages The fo
7. compilers just tap a compiler s parser output and feed it to a modification tool Prem Devanbu s pro grammable GENOA GENI tool can turn a parser s idiosyncratic output format into a more suitable format for code analysis There is however one major drawback to this approach as Devanbu points out the GENOA system does not allow code modifica tion This is not a surprise a compiler s parser removes comments expands macros includes files minimizes syntax and thus ir reversibly deforms the original source code The intermediate format is good enough for analysis in some cases but the code can never be turned into acceptable text format again Hence the approach does not help regarding mass modifications for which the Gartner Group recommends tool support to handle a larger code volume Obviously this concerns Y2K and Euro conversions code restructuring language migrations and so on Another real limitation of Devanbu s approach is that even if you only want to do code analysis you often cannot get access to a compiler company s proprietary source How we are cracking the 500LP Recall that Yourdon claimed that the large number of programming languages would virtually eliminate any chance of a single tool method or technique being uni versally applicable Nevertheless there is a single feasible solution for the 5OOLP It is 80 IEEE SOFTWARE WNovember December 2001 cracked when there is a cheap r
8. such as a recursive descent algo rithm So the quality of a hard coded parser implementation is usually good in which case you can easily recover the grammar from the code the comments or both Ex cept in one case the Perl language the quality of the code we worked with was al ways sufficient to recover the grammar If the parser is not hard coded it is gen erated the BNF branch in Figure 2 and some BNF description of it must be in the compiler source code So with a simple tool that parses the BNF itself we can parse the BNF of the language that resides in the com piler in BNF notation and then extract it When the compiler source code is not accessible we enter the Language Refer ence Manual diamond in Figure 2 either a reference manual exists or not If it is avail able it could be either a compiler vendor manual or an official language standard The language is explained either by exam IEEE SOFTWARE WNovember December 2001 ple through general rules or by both ap proaches If a manual uses general rules its quality is generally not good reference manuals and language standards are full of errors It is our experience that the myriad errors are repairable As an aside we once failed to recover a grammar from the man ual of a proprietary language for which the compiler source code was also available so this case is covered in the upper half of Figure 2 As you can see in the coverage diagram we hav
9. www cs vu nl grammars vs cobol ii current 20 Sept 2001
10. DPOOPAMMING VANGUADES cece es Cracking the 900 Language Problem Parser Implementation effort dominates the construction of software renovation tools for any of the 900 languages In use today The authors propose a Way to rapidly develop suitable parsers DY Stealing the grammars They apply this approach to two nontrivial representative Ralf Lammel and Chris Verhoef Free University of Amsterdam t least 500 programming languages and dialects are available in commercial form or in the public domain according to Capers Jones He also estimates that corporations have developed some 200 proprietary languages for their own use In his 1998 book on estimating Year 2000 costs he indicated that systems written in all 700 lan guages would be affected His findings inspired many Y2K whistle blowers to characterize this situation as a major impediment to solving the Y2K problem this impediment became known as the 500 Language Problem In 1998 we realized that we had discov ered a breakthrough in solving the SOOLP so we had something to offer regarding the Y2K problem We immediately informed all the relevant Y2K solution providers and peo ple concerned with the Y2K awareness cam paign In answer to our emails we received a boilerplate email from Ed Yourdon explain ing that the 500LP was a major impediment to solving the Y2K problem which we knew of course Ed was apparently so good at cre at
11. a par ticular language their implementation uses generic language technology a parser is pro duced using a parser generator a pretty printer is created using a formatter genera tor and tree walkers for analysis or modifi cation are generated similarly 9 All these generators rely heavily on the grammar Once you have the grammar and the rele vant generators you can rapidly set up this core for developing software renovation tools Leading Y2K companies indeed con structed generic Y2K analyzers so that deal ing with a new language would ideally re duce to constructing a parser The bottleneck is in obtaining complete and correct gram mar specifications The longest arrow in Fig ure 1 expresses the current situation it takes a lot of effort to create those grammars Implementing a high quality Cobol parser can take two to three years as Vadim Maslov of Siber Systems posted on the Usenet news group comp compilers he has constructed Cobol parsers for about 16 dialects Adapt ing an existing Cobol parser to cope with new dialects easily takes three to five months Moreover patching existing grammars using mainstream parser technology leads to un maintainable grammars gt significantly in creasing the time it takes to adapt parsers In contrast Table 1 lists the effort expended on various phases of a typical Cobol renovation project that used our grammar centric solu tion Notice that the grammar part of this pro
12. apid and re liable method for producing grammars for the myriad languages in use so that existing code can be analyzed and modified Cheap is in the US 25 000 5 000 range rapid is in the two week range for one person and re liable means the parser based on the pro duced grammar can parse millions of LOC Why is this a solution A grammar is hardly a Euro conversion tool or a Y2K an alyzer It is because the most dominant fac tor in building renovation tools is con structing the underlying parser From grammar to renovation tool Renovation tools routinely comprise the following main components preprocessors parsers analyzers transformers visualizers pretty printers and postprocessors In many cases language parameterized or generic tools are available to construct these com ponents Think of parser generators pretty printer generators graph visualization packages rewrite engines generic dataflow analyzers and the like Workbenches pro viding this functionality include Elegant Refine and ASF SDE for instance but there are many more Figure 1 depicts a grammar centric ap proach to enabling rapid development of renovation tools Arrow length indicates the degree of effort involved longer arrows im ply more effort As you can see if you have a generic core and a grammar it does not take much effort to construct parsers tree walkers pretty printers and so on Al though these components depend on
13. e not found low quality language reference manuals containing general rules for cases where we did not have access to the source code That is to be successful compiler vendors must pro vide accurate and complete documenta tion even though they do not give away their compilers source code for economic reasons We discovered that the quality of those manuals is good enough to recover the grammar This applies not only to com piler vendor manuals but also to all kinds of de facto and official language standards Unusual languages rarely have high quality manuals either none exists for example if the language is proprietary or the company has only a few customers In the proprietary case a company is using its in house lan guage and so has access to the source code in the other case outsiders can buy the code because its business value is not too high For instance when Wang went bankrupt its key customers bought the source code for its operating system and compilers to create their own platform and dialect migration tools This explains why we do not know of low quality manuals containing general rules In one case that of RPG the manual explains the language through code exam ples and general rules are absent We can examine this case in more detail if we are asked for an RPG renovation project in volving a large amount of RPG code We think we can systematically extract RPG s general rules from the code examples
14. here are about a trillion lines of installed software written in myriad languages its solution is a step forward in managing those assets Solutions that don t work A solution sometimes suggested for the SOOLP is just to convert from uncommon languages to mainstream ones for which tool support is available However you need a full blown tool suite including a se rious parser to do this And obtaining a parser is part of the 5OOLP So language conversion will not eliminate the problem on the contrary you need a solution for the SOOLP to solve conversion problems A Usenet discussion on comp compilers offered a second suggestion to solve the SOOLP generating grammars from the source code only in the same way linguists try to generate a grammar from a piece of natural language In search of solutions we studied this idea and consulted the relevant literature We did not find any successful ef fort where the linguistic approach helped to create a grammar for a parser in a cost effective way We concluded that the lin guistic approach does not lead to useful grammar inferences from which you can build parsers November December 2001 IEEE SOFTWARE 79 Figure I Effort shift for renovation tool development The longer the arrow the more effort is needed The dashed line represents the greater effort needed if the traditional approach is used Another more reasonable suggestion is to reuse the parser from
15. here were BNF files and a hard coded parser 2 Find the majority of the grammars in some BNF dialect 3 Find a hand written proprietary assem bly parser with BNF in the comments 4 Write six BNF parsers one for each BNF dialect used 5 Extract the plain BNF from the com piler sources and convert it to another syntax definition formalism SDF for technical reasons 6 Find the files containing the lexical an alyzer and convert the lexical defini tions to SDF 7 Combine all the converted grammars into one overall grammar 8 Generate an overall parser with a so phisticated parser generator 9 Parse the code We recovered the PLEX grammar in two weeks including tool construction parser generation and testing with 8 MLOC PLEX code at a cost of US 25 000 Ericsson told us that a cutting edge reengineering company had estimated this task earlier at a few million dollars When we contacted this company they told us that US 25 000 was nothing for such a grammar To illustrate the limited complexity of the work consider the fragment of raw com piler source code in Figure 4 A PLEX pro gram consists of a header a list of state ments the phrase END PROGRAM and a clos ing semicolon The other code in the figure deals with semantic actions relevant to the compiler Our tools converted this to a com mon BNF while removing the idiosyncratic semantic actions program header statement row END PROGRAM
16. illion lines of pure VS Cobol II code As in the PLEX case we generated a fully Web enabled version of both the cor rected BNF and the syntax diagrams that could serve as the core for a complete and correct language reference manual part from PLEX and Cobol we have recovered several other gram mars as have others From our ef forts in solving the 500LP we learned two interesting lessons First the more uncom November December 2001 The more mainstream a language Is the more likely that you will have direct access to a reasonably language reference IEEE SOFTWARE 87 mon a language is the more likely that you will have direct access to the compiler s source code an excellent starting place for grammar recovery Second the more main stream a language is the more likely that you will have direct access to a reasonably good debugged language reference also an excellent source for grammar recovery Acknowledgments Thanks to Terry Bollinger Prem Devanbu Capers Jones Tom McCabe Harry Sneed Ed Yourdon and the reviewers for their substantial contributions For more information on this or any other computing topic please visit our Digital Library at http computer org publications dlib Ralf Lammel is a lecturer at the Free University of Amsterdam and is affiliated with the Dutch Center for Mathematics and Computer Science CWI His research interests include pro gram transformation and
17. ing awareness that this had backfired on him he got 200 to 300 messages a day with Y2K questions and was no longer able to read interpret and answer his email other I recognize that there is always a chance that someone will come up with a brilliant solu tion that everyone else has overlooked but at this late date I think it s highly unlikely In particular I think the chances of a silver bul let solution that will solve ALL y2k problems is virtually zero If you think you have such a solution I have two words for you embedded systems If that s not enough I have three words for you 500 programming languages The immense variety of programming lan guages yes there really are 500 hardware platforms operating systems and environ mental conditions virtually eliminates any chance of a single tool method or technique being universally applicable The number 500 should be taken poeti cally like the 1 000 in the preserving process for so called 1 000 year old eggs which last only 100 days For a start we than in write only mode Although he pre sumably missed our input his response re garding the SOOLP is worth quoting languages PLEN and US Cobol il 78 IEEE SOFTWARE WNovember December 2001 0740 7459 01 10 00 2001 IEEE should add the 200 proprietary languages Moreover other estimates indicate that 700 is rather conservative in 1971 Gerald Weinberg estimated that by the following
18. iver better quality IEEE SOFTWARE WNovember December 2001 In another two week effort we recovered the VS Cobol II grammar from IBM s manual VS COBOL II Reference Summary version 1 2 For the fully recovered VS Cobol II gram mar see www cs vu nl grammars vs cobol ii Again the process was straightforward Retrieve the online VS Cobol II manual from www ibm com Extract its syntax diagrams Write a parser for the syntax diagrams Extract the BNF from the diagrams Add 17 lexical rules by hand Correct the BNF using grammar trans formations 7 Generate an error detection parser 8 Incrementally parse 2 million lines of VS Cobol II code 9 Reiterate steps 6 through 8 until all errors vanish 10 Convert the BNF to SDE 11 Generate a production parser Nn BW NH 12 Incrementally parse VS Cobol II code to detect ambiguities Resolve ambiguities using grammar transformations Reiterate steps 11 through 13 until you find no more ambiguities 13 14 So apart from some cycles to correct errors and remove ambiguities the process is the same as in the earlier case where we had ac cess to the compiler source An error detection parser detects errors in the grammar from which it is generated In this case we used an inefficient top down parser with infi nite lookahead It accepts practically all con text free grammars and does not bother with ambiguities at all We use this kind of parse
19. ject took only two weeks of effort so the team could start developing actual renova tion tools much more quickly This Cobol renovation project concerned one of the world s largest financial enter prises which needed an automatic con verter from Cobol 85 back to Cobol 74 the 8574 Project The Cobol 85 code was ma chine generated from a fourth generation language tool so the problem to convert back was fortunately restricted due to the code generator s limited vocabulary It took some time to solve intricate problems such as how to simulate Cobol 85 features like explicit scope terminators END IF END ADD and how to express the INITIALIZE statement in the less rich Cobol 74 dialect The developers discussed solutions with the customer and tested them for equivalence Once they solved these problems imple menting the components was not difficult because they had the generic core assets gen erated from a recovered Cobol 85 grammar They cut the problem into six separate tools and then implemented all of them in only five days The programming by hand was limited fewer than 500 LOC but compiled into about 100 000 lines of C code and 5 000 lines of makefile code linking all Effort for the 8574 Project Phase Effort Extract the grammar Two weeks Generate the parser One day Build six tools Five days Assemble all the components One hour Total Three weeks the generated generic renovation functional ity After
20. llowing exhaustive case distinction shows that our approach covers virtually all languages Let s look at the coverage diagram for grammar stealing shown in Figure 2 Be cause the software we want to convert al ready exists it can be compiled or interpreted We first enter the Compiler Sources diamond There are two possibilities the source code is or is not available to you If it is you just have to find the part that turns the text into an in termediate form That part now contains the grammar in some form You do this by lexi cally searching the compiler source code for the language s keywords Compiler constructors implement a parser in one of three ways they hard code it use a parser generator or do both in a complex multilanguage compiler for in November December 2001 an effective solution for producing grammars quickly for the S00LP IEEE SOFTWARE 81 82 Figure 2 Coverage diagram for grammar stealing stance Figure 2 shows the first two cases the third is just a combination If you start with a hard coded grammar you must re verse engineer it from the handwritten code Fortunately the comments of such code of ten include BNF rules Backus Naur Forms indicating what the grammar comprises Moreover because compiler construction is well understood there is a known reference architecture compilers are often imple mented with well known implementation al gorithms
21. nt 01 A PIC 9999999 MOVE ALL 123 to A DISPLAY A Depending on the compiler flags used to compile this code the resulting executables display either 3123123 or 1231231 There are hundreds of such problems so it is also infeasible to capture the semantics in ad vance for a single compiler No single se mantics is available and gathering all vari ants is prohibitively expensive and error prone given the semantical differences be tween compilers compiler versions and even compiler flags used The good news is that you only need spe cific ad hoc elements of the semantics on a per project basis We call this demand driven semantics For instance the NEXT SENTENCE November December 2001 up front In depth IEEE SOFTWARE 83 IF X 1 THEN IF Y 1 THEN CONTINUE END LF DISPLAY END IF DISPLAY IF X 1 THEN IF Y 1 THEN NEXT SENTENCE END IF DISPLAY END IF DISPLAY a We recovered the PLEX grammar in two weeks Including tool construction Nested IF passed Nested IF passed SENTENCE passed SENTENCE passed Figure 3 A code segment a transformed inappropriately into the code segment in b The change in line 3 results in wrong output 84 parser generation and testing at a cost of US 25 000 phrase in Cobol directs control to the state ment after the next separation period de noted with a dot So depending on where people put a dot the code jum
22. ording to the compiler more than one statement was cor rect whereas the manual insisted on exactly one statement We repaired this error with a grammar transformation Next in a separate phase we removed ambiguities For example the following fragment of a syntax diagram is present in the Cobol CALL statement identifier __ADDRESS OF _identifier__ file name_ si This stack of three alternatives can lead to an ambiguity Namely both identifier and file name eventually reduce to the same lexical category So when we parsed a CALL statement without an occurrence of ADDRESS OF the parser reported an ambiguity because the other alternatives were both valid With out using type information we cannot sepa rate identifier from file name This is the ambiguous extracted BNF fragment identifier ADDRESS file name OF identifier With a grammar transformation we elimi nated the ile name alternative resulting in identifier ADDRESS OF identifier The adapted grammar accepts the same language as before but an ambiguity is gone Note that this approach is much sim pler than tweaking the parser and scanner to deal with types of names In this way we re covered the entire VS Cobol II grammar and tested it with all our Cobol code from earlier software renovation projects and code from colleagues who were curious about the proj ect s outcome For the final test we used about two m
23. programming languages As a freelancer and consultant he has de signed implemented and deployed developer tools migration tools and software develop ment application generators based on Cobol and relational databases He received his PhD in computer science from the University of Rostock Germany Contact him at the Free Univ of Amsterdam De Boelelaan 1081 A 1081 HV Amsterdam Netherlands ralf cs vu nl www cs vu nl ralf Chris Verhoef is a computer science professor at the Free University of Amsterdam and principal external scientific advisor of the Deutsche Bank AG New York He is also affiliated with Carnegie Mellon University s Software Engineering Institute and has con sulted for hardware companies telecommunications companies financial enterprises software renovation companies and large service providers He is an elected Executive Board member and vice chair of conferences of the IEEE Computer Society Technical Council on Software Engi neering and a distinguished speaker of the IEEE Computer Society Contact him at the Free Univ of Amsterdam Dept of Mathematics and Computer Science De Boelelaan 1081 A 1081 HV Amsterdam Netherlands x cs vu nl www cs vu nl x 88 IEEE SOFTWARE WNovember December 2001 1 C Jones Estimating Software Costs McGraw Hill New York 1998 2 C Jones The Year 2000 Software Problem Quantify ing the Costs and Assessing the Consequences Addi son Wesley Reading Mass 1998
24. ps directly be hind the dot Omitting a dot can lead to dif ferent behavior One of our customers wanted tools to get rid of this potentially hazardous implicit jump instruction Luck ily it turned out that for this project we could replace the implicit jump instruction NEXT SENTENCE with the innocent no op CONTINUE So after our semantical investi gation we knew we could use a simple transformation tool to make this change However in another project this transfor mation might break down for example if the NEXT SENTENCE phrase is used in prob lematic code patterns The transformation of the code in Figure 3a into the code in Figure 3b has changed the program s meaning you cannot turn NEXT SENTENCE into CONTINUE in the context of Figure 3 Specifically as suming both X and Y are equal to 1 the code in Figure 3a prints SENTENCE passed while the code in Figure 3b prints first Nested IF passed and then SENTENCE passed As you can see you must be utterly aware of the semantics to find out whether it is necessary to implement any of it In most cases we have seen implementing this type of intricate semantical issue was not neces sary but knowing about the potential problems was necessary if only to check whether they were present To give you an idea how far you can go with demand driven semantics consider this we have developed relatively dumb tools for some Cobol systems that can wipe out com ple
25. r to test the grammar not to produce parse trees Because we only used compilable code all the errors that this parser detects raise potential grammar problems In this way we found all the omissions given our Cobol testbed When all our test code passed the top down parser we converted the grammar to SDE generated a parser that detects ambiguities and cor rected them This project also took two weeks of effort including tool construction and test ing We did this for free so that we could freely publish the grammar on the Internet as a gift for Cobol s 40th birthday To get an idea of the limited complexity of this technique consider the syntax diagram shown in Figure 5a taken from the manual After conversion to BNF and correction the diagram looks like the one in Figure Sb We used grammar transformations to re move the dash between NEXT and SENTENCE and to replace the two occurrences of imperative statement with statement list The diagram was overly restrictive allowing only one statement However in the manual s informal text we learned A series of imperative statements can be spec ified whenever an imperative statement is allowed Our error detection parser found these errors first the tool parsed code when it found NEXT SENTENCE that is without a dash After inspecting the manual and grammar we wrote a grammar transforma tion repairing this error The error detection parser also found that acc
26. recover the BNF you can generate a syntax analyzer that produces trees but the trees are not decorated with extra knowledge such as control flow data flow type annotation name resolution and so on Some people also think you need a lot of semantical knowledge to analyze and modify existing software but this is not true iy You can try to capture a language s semanti cal knowledge on three levels m for all the compilers of a language dif ferent dialects m for one compiler product or mM on a project by project basis Because we are trying to facilitate the construction of tools that work on existing software there is already a compiler This has implications for dealing with the se mantical knowledge Consider the follow ing Cobol excerpt PIC A X 5 RIGHT JUSTIFIED VALUE IEEE DISPLAY A The OS VS Cobol compiled code prints the expected result namely IEEE which is right justified However because of a change in the 1974 standard the same code com piled with a Cobol 370 compiler displays the Output IEEE with a trailing space which is left justified This is because the RIGHT JUSTIFIED phrase does not affect VALUE clauses in the case of the Cobol 370 com piler There are many more such cases so trying to deal with the semantics of all com pilers in advance is not feasible Even when you restrict yourself to one compiler this problem does not go away Consider this Cobol fragme
27. rs in severely lim ited formats With powerful parsing technol ogy you can work with arbitrary context free grammars and test them regardless of their format Without automated testing you can not find many errors quickly Without tool support to transform grammar specifications analyses are inaccurate and corrections are in consistent without transformations you can not repeat what you have done or change ini tial decisions easily So to steal grammars you need to know about grammars powerful parsing techniques how to set up testing and automated transformations lt plex program gt lt program header gt lt statement row gt END fo 6 xnsmtopg 1 PROGRAM 7 lt sect The tools Compound Reverse STATEMENT ROW stat_list gt PROGRAM HEADER sect as_prog_stat ix_stat_list_p PROGRAM HEADER sect ix_sect_node_p Figure 4 Raw compiler source code for the PLEX language Stealing from compiler source code Ericsson uses the extremely complex pro prietary language PLEX to program public telephone switches PLEX consists of about 20 sublanguages called sectors including high level programming sectors assembly sectors finite state machine sectors marshal ing sectors and others We applied our gram mar stealing approach to PLEX as follows 1 Reverse engineer the PLEX compiler 63 Mbytes of source code on site to look for grammar related files We learned that t
28. ssembler 10 percent 140 to 220 bil lion LOC m less common languages 40 percent 280 billion LOC In contrast there were Y2K search en gines for only about 50 languages and auto mated Y2K repair engines for about 10 lan guages Thus most languages had no automated modification support clarifying the concerns of Jones Yourdon McCabe de Jager and others These alarming figures underscored the SOOLP s importance What is the 500 Language Problem We entered the new millennium without much trouble so you might conclude that whatever the 500LP was it is not relevant now Of course the problem existed before the Y2K gurus popularized it and it has not gone away Why is the problem still relevant If you want tools to accurately probe and manipu late source code you must first convert the code from text format to tree format To do this you need a so called syntactic analyzer or parser But constructing a parser is a ma jor effort and the large up front investment hampers initiatives for many commercial tool builders Indeed a few years ago Tom McCabe told us that his company McCabe amp Associates had made a huge investment in developing parsers for 23 languages Not ing that 500 would be insurmountable he dubbed this problem the number one problem in software renovation Thus the SOOLP is the most prominent impediment to constructing tools to analyze and modify existing software assets Because t
29. t gt WHEN__condition 1 imperative statement 2 NEXT SENTENCE 1 _END SEARCH__ Note 1 END SEARCH with NEXT SENTENCE is an IBM extension a search statement SEARCH AT END statement list WHEN condition statement list END SEARCH b identifier VARYING identifier index name NEXT SENTENCE Figure 5 a The original syntax diagram for the Search statement b the same diagram after conversion to BNF and correction Stealing from reference manuals Some of our colleagues felt a little fooled by the PLEX result You are not really con structing a parser you only converted an ex isting one We can do that too Now try it without the compiler Indeed at first sight not having this valuable knowledge source available seemed to make the work more dif ficult After all an earlier effort to recover the PLEX grammar from various online manuals had failed they were not good enough for reconstructing the language 7 Later we discovered that the manuals lacked over half of the language definition so that the recovery process had to be incomplete by definition We also found that our failure was due not to our tools but to the nature of proprietary manuals if the language s audi ence is limited major omissions can go un noticed for a long time When there is a large customer base the language vendor has to del
30. x co To logic gt You do need to know the semantics for many different tasks but it is not necessary in advance to encode the com piler semantics in a parse tree or otherwise So a tool developer can construct mostly syntactic tools taking semantical knowledge into account on a per project basis IEEE SOFTWARE WNovember December 2001 Grammar stealing in practice We and others from industry and aca demia have applied grammar stealing successfully to a number of languages in cluding Java PL I Ericsson PLEX C Ada 95 VS Cobol II AT amp T SDL Swift messages and more Here we focus on PLEX Programming Language for Ex changes a proprietary nontrivial real time embedded system language for which the compiler source code was accessible to us and at the other end of the gamut VS Cobol II a well known business language for which no compiler code was avail able Both languages are used in business critical systems the AXE 10 public branch exchange uses PLEX and numerous IBM mainframe systems run VS Cobol II These two languages represent the two main branches in Figure 2 Our approach uses a unique combination of powerful techniques automated grammar extraction sophisticated parsing automated testing and automated grammar transformation If one of these ingredients is missing the syn ergy is gone Extraction by hand is error prone and basic parsing technology limits you to work with gramma
Download Pdf Manuals
Related Search
Related Contents
Manual Extranet INSTR2135 0610.indd Horizon Fitness CE6.0 Elliptical Trainer User Manual Samsung Galaxy Fame User Manual Samsung Xpress C460FW Цветен Мултифункционален Принтер (18 / 4 стр.за мин.) Наръчник за потребителя Cut Sheet - ICP DAS USA`s I Copyright © All rights reserved.
Failed to retrieve file