Home
Language development case study (Pdf, 133 pages)
Contents
1. b b INTEGER 1 Figure 8 4 minitree compound statement its full derivation tree and reduced derivation tree minitree intermediate form 89 1 int a 1 b 1 2 3 if a 1 then a 0 4 if gt b then 0 else 1 program int if tf RE E 54 x INTEGER 1 INTEGER 1 INTEGER 1 a INTEGER 0 a b a INTEGER 0 a INTEGER 1 statement 9 O O 9 O O x Y Y x INTEGER 1 1 INTEGER 1 5 a b INTEGER 0 Figure 8 5 minitree if statement its full derivation tree and reduced derivation tree 90 MINITREE MULTIPLE PASS COMPILER 1 int a 1 b 10 2 3 while a lt b do 4 begin 5 print a 6 1 T3 end program tint while zg 1 INTEGER 10
2. INTEGER 3 INTEGER 4 Figure 8 1 minitree declaration its full derivation tree and a reduced deriva tion tree 86 A MULTIPLE PASS COMPILER 1 inta b 2 3 a 3 2 Cprogram gt int a b a D INTEGER 2 zi Pm b INTEGER 3 statement INTEGER 2 a b INTEGER 3 Figure 8 2 A minitree expression its full derivation tree and a reduced deriva tion tree minitree intermediate form 87 1 int a 2 3 print a is a n Program tint print Figure 8 3 minitree print statement its full derivation tree and a reduced derivation tree 88 A MULTIPLE PASS COMPILER m m b 1 int a b 2 3 begin 4 5 b2b 1 6 end int begin a b b INTEGER 1 begin statement
3. result result result result lt result gt result result result result result result eO result e10 right result lt gt The mvmasm grammar result gt right result lt right result gt right result lt right result right result right result right result right right right right right right right right amp amp right operators and long int data s s askak ak3k akak kx temp mvmasm cast symbol lookup key mvmasm name NULL if temp NULL 1 if rdp pass 3 text message TEXT ERROR ECHO Undefined symbol s n result 0 else result temp gt val 11 INTEGER result result FALSE result 1 0 C ei result 2 COMMENT LINE STRING ESC result 59 Greater than Less than Greater than or equal to Less than or equal to Equal to Not equal to Logical inclusive OR Logical AND Bitwise exclusive OR Bitwise inclusive OR Bitwise AND x Shift left Shift right Add Subtract Divide
4. RP RP ERR ADA A AF PPP 00 9 CO wo CO 44 45 46 47 CAN Da FF 0l Hn w Implementing 1 81 7 8 eee fe kk koe EEE koe EEEE EE EEE fe ok RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 ml_aux c miniloop one pass compiler semantic routines This file may be freely distributed Please mail improvements to the author X X EEEE E EEEE EEEE EEEE EE EEE EEE EEEE EEE EEEE EEEE EE EEEE EE EEEE EEEE fe include lt stdarg h gt include lt stdio h gt include lt string h gt include textio h include memalloc h include ml_aux h FILE outfile static long unsigned temp_count 0 int emitf const char fmt d int i va list ap argument list walker va start ap fmt pass parameters to vprintf 1 vfprintf outfile fmt ap remember count of characaters printed va end ap end of var args block return i return number of characters printed void emit open char sourcefilename char outfilename d if outfile fopen outfilename w NULL text message TEXT FATAL unable to open output file s
5. is INTEGER 1 statement 6 INTEGER 1 H INTEGER 10 a Figure 8 6 minitree while statement its full derivation tree and a reduced derivation tree statement Statement statement mw ren INTEGER 8 2 Implementing minitree 91 taz Trees AST s There is little agreement on the formal definition of these objects but broadly speaking a Concrete Syntax Tree is either a full derivation tree or a parse tree made up of just the terminal nodes whilst an AST is usually a tree made up of some of the terminals reason for the distinction between concrete and abstract forms is that some terminals in real programming languages are just there to make the pro gram more readable so called syntactic sugar and some are only there to represent the two dimensional nature of programs The concrete form includes all such terminals but they may be dropped in the abstract forms An example of the first case is the parentheses that appear around the con ditional expression in the ANSI C if else and while do statements it is perfectly straightforward to write an unambiguous grammar that does not include the
6. Figure 7 4 Flow of control through compiled if then else statement 7 6 Compiling if statements if then else statement defines three blocks of code a relational expres sion then block and an else block Figure 7 4 illustrates the code template used by miniloop and the allowed forms of control flow through the construct miniloop maintains an internal label counter which is advanced each time a new unique label is required Such labels are needed for labelling the strings used when assembling print statements containing string parameters and whenever a structured statement such as if then or while do is encountered In the case of an if then else statement the start of the statement corresponding to the first assembler instruction in the compiled version of the relational expres sion is labeled __IF_n where n is the current value of the label counter n is called the number of the control statement Similarly the end of the statement is labeled __FI_n and the start of the else block is labeled with __ELSE_n relational expression is compiled first yielding temporary variable which will contain a zero if the expression evaluates to FALSE and a one other wise The compiler then issues the assembler instruction BEQ __temp t ELSE n ifn temp t go to ELSE n gt where is the number of the temporary containing the result of evaluating the relational expression and n is the number of the control
7. If we extend the definition of rule operand we can similarly allow named variables and non integer literal operands such as REALs Deeper problems arise when we consider the evaluation of an expression The string fred 1 2 3 is a legal minicalc statement but what value should actually be assigned to fred We could begin evaluating from the left effectively carrying out the steps fred 1 fred fred 2 fred fred 3 or from the right effectively carrying out the steps fred 3 fred 2 fred fred 1 fred In the first case fred ends up with value 4 and in the second case with value 2 usual convention is to evaluate subtractions from left to right i e using the first of the two choices above and this is called left associativity The pro grammer may wish to override this and other order of evaluation conventions and traditionally this is done using parentheses The following grammar allows expressions of the form 1 2 3 and 3 x y 2 expression monadic_op operand 4 diadic op monadic_op operand operand INTEGER ID expression monadic op 2 2 diadic op gt x Note that we have introduced recursion into the grammar the operand rule accepts a left parenthesis after which it calls the expression rule before accepting a matching closing parenthesis This nested structure incorporates the notion of do first
8. void this_edge graph_get_next_edge root while this_edge NULL walk children printing results rdp_tree_data this node rdp_tree_data graph get edge target this edge if this node token RDP T 18 x emit print S this gt 14 else emit print I expression walk this node this edge graph get next edge this edge break case RDP T if 4 char relation rdp tree data rel stat rdp tree data graph get edge target this edge then stat rdp tree data graph get edge target graph get next edge this edge else stat rdp tree data graph get edge target graph get next edge graph get next edge this edge integer label new label emitf 41lu WMn label relation expression walk rel stat emitf BEQ 4s ELSE luNt ifn s go to tree walk then stat emitf BRA __FI_ lu t go to tree walk else stat emitf FI 4lu WMn label break ELSE Alu Wn relation label relation label _FI_ 1lu n__ELSE_ 1lu n label label label Figure 8 12 ninitree auxiliary functions part 4 print and if 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 98 MINITREE A MULTIPLE PASS COMPILER case RDP T while 1 char relation rdp tree data rel stat rdp tree data
9. FOR ANSI C 9 5 2 pretty print function pretty print function is called after each lexeme read by the parser It is defined in Figures 9 6 and 9 7 lines 90 176 internal state of the pretty printer is maintained between calls in four static integer variables 1 last kind contains the token kind of the lexeme processed the previ ous call to pretty print 2 indentation contains the current indentation level 3 temporary indent is a boolean flag that is set after an indenting keyword such as if or while is seen and 4 printed contains the number of characters output since the last new line character or equivalently the current output column number In lines 97 and 98 the global variables lexeme count and last line are updated These variables are used within function pretty close to output the number of lines number of lexemes and the average lexeme per line count at the end of a run indentation of lines is performed by the code in lines 105 127 which is only executed if the last token seen was a new line so as to ensure that inden tation is only performed at the beginning of a line Lines 109 110 temporarily increment the indentation level if the temporary indent flag is set and we are not processing a new block temporary increment is reset in lines 123 124 Line 113 detects comments that start in column 1 and suppresses their in dentation Lines 114 121 output tab characte
10. output addressing modes if opers gt 0 emit2 oper1 if opers gt 1 emit2 oper2 if opers gt 2 emit2 oper3 Figure 6 6 mvmasm auxiliary functions part 2 main output routines 64 MVMASM ASSEMBLER FOR MVM 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 void emiti unsigned long val 1 emitted emitf 4 21X val location void emit2 unsigned long val emitted emitf 41X val location 2 void current_label void check that there is a valid label on this line if last_label NULL text_message TEXT_ERROR_ECHO Missing label on directive n return amp dummy_label 1 return last label void init char outputfilename 1 if outputfilename objfile stdout else if objfile fopen outputfilename w NULL text message TEXT FATAL Unable to open object file int quit char outputfilename 1 fclose objfile text message TEXT INFO Transfer address 4 91X Wn transfer if execute sim outputfilename 1 define COMMAND mvmsim t v char command char mem 11 1 strlen outputfilename strlen COMMAND
11. 58 2 Subtract 59 60 61 e2 e3 4 29727 Multiply 62 gt gt 777 Divide 63 i 64 65 e3 e4 66 997 e3 Posite note suppression from intermediate form 67 e3 Negate 68 69 70 e4 eb 4 71 72 eb ID name check declared Variable access 73 INTEGER Numeric literal 74 Q e177 7 Parenthesised expression 75 76 comment COMMENT NEST x Comments stripped by lexer 77 String STRING ESC 9459 Strings for print 78 79 End of minitree bnf Figure 8 8 An BNF specification for minitree part 2 expressions f kc kk ke kk koe ke fe ke koe hee je ke koe ke fk khe fe ok ke koe RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 mt aux c Minitree multiple pass compiler semantic routines This file may be freely distributed Please mail improvements to the author CAN Da FF X X EEEE EEEE E EEEE EEEE EE EEE EEEE EEEE EEE EEE 171 10 include lt stdarg h gt 11 include lt stdio h gt 12 include lt string h gt 13 include graph h 14 include memalloc h 15 include textio h 16 include minitree h 17 include ml aux h 18 include mt aux h Figure 8 9 ninitree auxiliary functions part 1
12. rdp make file contains the commands for constructing a syntax checker from mini syn bnf and running it on a test file called testcalc m These commands will be executed if you type make ms test minicalc syntax checker 15 1 Erroneous minicalc input 2 3 int 4 Error 1 error m Scanned ID whilst expecting 2 5 innt b 3 should be int b 1 6 7 b a 2 a used before being initialised 8 9 Error 1 error m Scanned whilst expecting one of ID INTEGER 99 9 3 4 should be a 3 4 x 9 1 10 11 b 3 undeclared variable 12 2 errors and warnings Fatal errors detected in source file Figure 2 3 Error reporting in the syntax checker This make file target is automatically built as part of the standard installation so if you have already built rdp using the make file then the already compiled syntax checker will simply be run on the test file Figure 2 3 shows the output of the syntax checker for an erroneous program and illustrates the syntax checker s limitations The misspelling of int in line 5 and the incorrect orderings of the arithmetic operators in line 9 have been detected but the use of an uninitialised variable in line 7 and the assignment to an undeclared variable in line 11 have been ignored These kinds of errors can only be detected by c
13. switch root gt token 1 case 0 scan root or begin node s children case RDP T begin 1 void this edge graph get next edge root while this edge NULL walk children printing results 1 tree walk rdp tree data graph get edge target this edge this edge graph get next edge this edge break case RDP T 31 emit CPY rdp tree data graph get edge target this edge id expression walk rdp tree data graph get edge target graph next edge this edge NULL break case RDP T int 1 void this edge graph get next edge root while this edge NULL walk children declaring each variable 1 void child edge rdp tree data this node rdp tree data graph get edge target this edge emitf Nn DATA n s WORD 1 CODE n this node id if 114 edge graph get next edge this node NULL emit CPY this node id expression walk rdp tree data graph get edge target child edge NULL this edge graph get next edge this edge break Figure 8 11 minitree auxiliary functions part 3 program assignment and declaration 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 minitree auxiliary functions 97 case RDP print 4
14. Irdp_supp c rdpi c rdpi c cc erdpi exe rdpi obj obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj copy rdp1 c rdp2 c rdpi F 126 ACQUIRING AND INSTALLING RDP fc rdpi c rdp2 c Comparing files rdpi c and rdp2 c rdpi c Parser generated by RDP on Dec 20 1997 21 05 05 from rdp bnf rdp2 c Parser generated by RDP on Dec 20 1997 21 05 02 from rdp bnf Bibliography J 97a JS97b JS97c PD82a PD82b Adrian Johnstone and Elizabeth Scott rdp a recursive descent com piler compiler user manual for version 1 5 Technical Report TR 97 25 Royal Holloway University of London Computer Science Depart ment December 1997 Adrian Johnstone and Elizabeth Scott rdp_supp support routines for the rdp compiler compiler version 1 5 Technical Report TR 97 26 Royal Holloway University of London Computer Science Department December 1997 Adrian Johnstone and Elizabeth Scott A tutorial guide to rdp for new users Technical Report TR 97 24 Royal Holloway University of London Computer Science Department December 1997 S Pemberton and M C Daniels Pascal implementation compiler and assembler interpreter Ellis Horwood 1982 S Pemberton and M C Daniels Pascal implementation the P4 com piler Ellis Horwood 1982
15. for writing n outfilename emitf 4s generated from s n n outfilename sourcefilename emitf DATA 0x8000 n__MPP_DATA n CODE 0x1000 n__MPP_CODE n void emit_close void d emitf NXn HALT n n DATA n__temp BLOCKW 41 declare array of temporaries n n END __MPP_CODE n temp count fclose outfile Figure 7 11 miniloop auxiliary functions part 1 low level output 82 SINGLE PASS COMPILER FOR MINILOOP 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 void emit char asm op char alg op char dst char src i char src2 1 emitf 4s 4s As asm op dst 1 if src2 NULL emitf 4s src2 Wow output algebraic style emitf t s Ws 4s dst alg op if src2 NULL emitf 4s src2 emitf Nn void emit print char kind char src 1 if kind 25 1 unsigned long label label emitf Nn DATA n__STR_ lu STRING label text print C string file outfile src emitf n n CODE n PRTS __STR_ lu n label 1 1 emitf text print C string file outfile src emitf t print integer n char temporary void char ret char mem malloc 30 sprintf ret temp
16. 48 49 50 51 52 53 54 CAN Da FF A grammar for minicond 25 sk ke kk ke ke of fe khe he ok fe kk he ka ae RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 minicond bnf a decorated mini conditional grammar with interpreter semantics This file may be freely distributed Please mail improvements to the author X X ke 2k ke oe ok oe e ke ke 2k ke ke 2k ok fe 2k 2k TITLE Minicond interpreter V1 50 c Adrian Johnstone 1997 SUFFIX USES math h SYMBOL TABLE minicond 101 31 symbol compare string symbol hash string symbol print string char id integer i program var_dec 1 statement 1 2 05 semantic rules implemented as macros in the C code insert id if interpret symbol insert key minicond amp id sizeof char sizeof minicond_data 1 lookup id ret 1 void sym symbol lookup key minicond amp id NULL if sym NULL not found x 1 text message TEXT ERROR ECHO Undeclared variable s n id sym symbol insert key minicond amp id sizeof char sizeof minicond data ret minicond_cast gt _update id val if
17. Posite Negate Bitwise complement x Logical not integer pow double result double right name Variable Numeric literal Logical TRUE Logical FALSE Parenthesised expression Figure 6 3 An rdp BNF specification for mvmasm part 3 expressions 60 MVMASM ASSEMBLER FOR MVM 1 ke sek ke ke ek ok joke ek koe je ke koe ke ke ek ek ke fe 2 3 RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 4 mvm def h Mini Virtual Machine opcode definitions 6 7 This file may be freely distributed Please mail improvements to the author 8 se fe fe ak 3k kk ese oe ke koe kk ese ek koe ff ke ek fe ok ke hee je ke koe kk e fk khe kk eek khe ke ae fe fe oe ae ke ke 10 enum opcodes 0P_HALT ADD OP SUB MUL DIV OP EXP 11 EQ NE GT OP GE LT LE 12 UP CPY 13 BNE BEQ 14 PRTS Figure 6 4 opcode definitions minicond grammars The alphanumeric label indentifier is stored in the char field id and the value of the label is held in the va1 field 6 4 3 main mvmasm grammar top level rule unit accepts zero or more lines of assembler source unit itself is activated three times once for each pass over the source rdp
18. label if eO left emitf BEQ s __ELSE_ lu t ifn 4 go to __ELSE_ lu n left label left label then statement emitf BRA __FI_ 1u t go to else statement 1 emitf __FI_ lu n label _FI_ 1lu n__ELSE_ lu n label label label Figure 7 9 An rdp BNF specification for miniloop part 1 statements 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 Implementing miniloop 79 integer label new_label while do statement emitf __DO_ lu n label while eO left emitf BEQ s __OD_ lu t ifn 4s go to gt 4 statement emitf BRA UD Alu n left label left label _DO_ 1u t go to __DO_ 1u n__OD_ 1lu n label label label print eO left emit print I left String left emit 25 left 202 99 print statement begin statement compound statement eO char char dst ei left dst new_temporary C el right emit GT gt dst left right Greater than 95 el right emit LT lt dst left right Less than gt gt el right emit GE gt ds
19. 1 R C 32 C ABC D K _ M BOT 33 KR 0 _ E I 0 R CUS 34 C A A I D Y N N A E 35 L 0 AE EWD IAC S TR 36 0P TK EDOULOE TDK SII 37 S E E EW ILI RN EI 38 E NR T TC NI MD T HMCT RN 6 39 40 BLOCK CLOSE 10 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 41 BLOCK OPEN 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 O 42 CHARACTER 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 O 43 CLOSE BRACKET 10 0 0 0 0 0 0 O 1 1 1 0 O O 0 O 44 COMMENT 11 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 45 DIADIC 11 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 46 EULN 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 O 47 FIELD DELIM 10 0 0 0 0 1 0 0 0 0 0 0 0 0 0 O 48 KEYWORD 10 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 49 KEYWORD INDENT 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 O 0 1 51 MONADIC 1 1 1 0 1 1 0 O 1 1 O 1 O 1 O 1 52 PEN BRACKET 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 O 53 PREPROCSSOR 10 0 0 0 0 1 0 0 0 0 O 1 0 0 0 1 54 PUNCTUATION 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 55 STRING 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 56 57 Figure 9 4 Pretty printer aux
20. 1 decrement temp by 1 SUB temp temp decrement temp by the value of x SUB temp temp x decrement temp by the value of the contents of memory location with address x 52 MVMASM ASSEMBLER FOR MVM Priority Greater than Less than Greater than or equal to Less than or equal to Equal to Not equal to Shift left Divide Monadic negate Monadic posite Bitwise complement Logical not Pror 1 1 1 1 7 8 Table 6 1 Operator priorities in mvmasm Assembler syntax reference 53 CPY temp 300 6 2 assign 612 to temp HALT terminate execution BEQ temp start If temp is zero branch to start otherwise continue execution with the next instruction 6 2 5 Directives File inclusion INCLUDE filename Open filename for reading at this point continuing to read the parent file after the end of filename has been reached This directive works just like the include preprocessor directive ANSI C and the INCLUDE directive in rdp s IBNF source language Assembly pointer manipulation CODE optional ezpression Use the CODE pointer for subsequent assembly If the optional ezpression is present set the CODE pointer to its value oth erwise carry on assembling starting at the most recent value of the CODE pointer The CODE pointer is initialised to zero when the assembler starts DATA optional expression Use the DATA pointer for subsequent assembly If the optional ezpression
21. 1000 switch to assembling code at address 1000 hex 1000 6 start 1000 0C010004007B 7 temp 123 1 temp with decimal 123 1006 101100000004 8 temp print the value of temp as integer 100C 0011 9 HALT terminate the simulator 100E 10 100 1000 11 END start transfer address is code start Transfer address 00001000 Calling simulator mvmsim t v mvmasm out mvmsim vi 5 simulator for 1000 CPY 0004 007B 0000 gt 007B 1006 0000 007 0000 gt 0000 123 100 HALT 0000 0000 0000 gt 0000 Halted 0 030 CPU seconds used 3 instructions executed errors and warnings Assembler syntax reference In this section we describe the features of the mvmasm assembler in terms of the lexical structure the available arithmetic operators the directives and the machine instructions The implementation of mvmasm as an rdp translator spec ification is described in the next section Large examples of mvmasm code which exercise most of the features may be found in the following chapters which describe compilers that translate to mvmasm 6 2 1 Line oriented and free format languages mvmasm syntax follows the tradition of assemblers in being line oriented with only one statement allowed per line Early high level programming languages were line oriented in this way but most programming languages designed since the early 1960
22. 4 ELSE 7 ifn temp 4 go to ELSE 7 80 81 DATA 7420657175616C73 82 STR 8 STRING 2 equals 83 84 CODE 0 010000802 85 PRIS STR 8 0E0110A00000 86 BRA __FI_7 go to __FI_7 87 __ELSE_7 88 89 DATA 7A20646F6573206E 90 __STR_9 STRING z does not equal a n 91 92 CODE OF0100008038 93 PRTS __STR_9 94 __FI_7 0210807980000003 95 SUB __temp 5 3 temp 5 3 0 1180248079 96 2 temp 5 Z temp 5 97 IF 10 0611807480248000 98 EQ _ temp 6 2 __temp 6 z 0 1110 8807 99 __temp 6 ELSE 10 ifn temp 6 go to ELSE 10 100 Figure 7 7 miniloop compiled output for the example program part 2 10 804 8058 8058 10 10C2 10C8 10C8 10C8 8058 806C 806C 10C8 10CE 10CE 10D4 10D4 10DC 10E2 10E2 806C 8072 8072 10E2 10 8 10 10 8072 8074 8074 10 10 4 10 1102 1108 1108 1108 110A 110A 8074 8086 8086 7A20657175616C73 0 010000804 0 0110 0000 7A20646F6573206E 0 0100008058 0C0180000003 0810807B80000000 0E111108807B 612069732000 0 010000806 101100008000 0400 0 0100008072 0210807 80000001 0C118000807C 0E0110D40000 0011 1000 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 1
23. 40 5 7 THE MINI VIRTUAL MACHINE MVM also perform branch calculations automatically so that instructions can be re ferred to by a symbolic label rather than by their numeric address Assemblers do not make machine level programming easy but they do free the programmer from a great deal of bookkeeping work Assemblers are in themselves examples of an interesting class of translator We shall describe the implementation of an assembler for MVM called mvmasm in the next chapter We complete this chapter with the description of a simulator mvsim for MVM code mvmsim a simulator for MVM byte codes The MVM instruction set is designed to be efficiently implemented as a simu lator In this section we look at the design and use of such a simulator 5 7 1 Using mvmsim The function of the simulator is two fold firstly it provides a concrete model of the behaviour of an MVM processor and secondly it allows instruction ex ecution to be traced by printing out each instruction as it is executed The mvmsim executable is built as part of the standard rdp installation so if you have already run make on the supplied makefile you should have a working simulator To check type mvmsim at the command line prompt You should see the following output Fatal No source file specified mvmsim v1 5 simulator for MVM Usage mvmsim options source 1 Show load sequence t Print execution trace v Set verbose mode You can contact the a
24. declarations minitree auxiliary functions 95 20 char expression walk rdp tree data root 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 Postorder expression walk if root gt token SCAN P ID return root id else if root gt token SCAN P INTEGER 4 char result char mem malloc i12 sprintf result i lu root data u return result else void left_edge graph get next edge root void right edge graph get next edge left edge char left expression walk rdp tree data graph edge target left edge if right edge NULL monadic operator 4 char dst new switch root 5token 1 case RDP T 26 emit SUB dst left break default text message TEXT FATAL unexpected monadic operator found in expression walk token number 1 identifier s n root gt token root gt id return dst else 4 char right expression walk rdp tree data graph get edge target right edge char dst new switch root 5token 1 case RDP T 17 case RDP T 22 case RDP T 23 case RDP T
25. interpret 1 void sym symbol lookup key minicond amp id NULL if sym NULL not found x 1 text message TEXT ERROR ECHO Undeclared variable s n id sym symbol insert key minicond amp id sizeof char sizeof minicond data minicond gt val dst b dst a amp amp b integer a and dst a b and not dst a b local int a Figure 4 3 An rdp specification for the minicond interpreter part 1 26 THE MINICOND LANGUAGE INTERPRETATION WITH CONDITIONALS 55 var dec interpret integer 56 int ID name insert name Declaration 57 2 eO val _update name val 1 Initialisation 58 59 60 statement interpret integer 61 ID name eO value _update name value Assignment 62 63 _local_int flag 64 if eO cnd then _and flag cnd interpret statement flag if statement 65 else _and_not flag cnd interpret statement flag 66 67 print 2 eO value if interpret printf 1i value output 68 String str if interpret printf 4s str 69 e 70 9 71 72 eO integer 73 el result gt ei right result result gt right Greater than 74 25 ei right result result lt right Less than 75 gt gt el right result result gt right Greater than
26. it is almost universally agreed by programming language designers that an else clause should bind to the nearest if statement in this case to the if statement at line 3 Whilst rdp is an LL 1 parser generator it is not strictly true that it can only generate parsers for LL 1 grammars When is presented with a non LL 1 grammar it is effectively being asked to parse strings that may provide matches for more than one alternative at some point in the derivation By default rdp simply rejects such grammars with an appropriate error message but if we add a F flag to the rdp command line then rdp will be forced to output a parser that disambiguates such cases by choosing the alternative that is lexically first in the grammar long as the grammar writer is able to achieve the non LL 1 behaviour required by putting the most important alternative first then the generated parser will operate correctly In the case of iterators including the optional bracket 1 here rdp will choose to go into an iterator rather than skipping over it if the currently parsed token is in both the FIRST and FOLLOW sets of the iterator This rule has the effect of parsing the else clause in such a way that the derivation tree shows the else clause as bound to the nearest if There are other techniques for handling the so called dangling else problem Perhaps the simplest is to change the language syntax so that if statements are explicitly terminated The Algol
27. referencing to a source file any fixed number of passes is insufficient In practice this situation is rather artificial and real assemblers typically put an upper limit on the number of passes although it is not hard to simply keep re parsing until all the symbols in the symbol table are determined The mvmasm parser makes three passes so it can in fact handle the situation shown above but no more than two levels of forward referencing are allowed As we shall see in a later section rdp generated parsers can be set to make multiple passes by adding a directive of the form PASSES 3 to the BNF specification file 56 MVMASM ASSEMBLER FOR MVM 6 3 2 scanner primitive noted section 6 2 1 assemblers and some early languages like the original FORTRAN are line oriented in that maximum of one statement per line of source file is allowed rdp generated parsers that we have looked at previously have been free format in that line ends and white space may be introduced arbitrarily between language tokens so as to format the source file for human convenience rdp provides a special scanner primitive denoted by EOLN which matches against the line end marker and a special comment primitive COMMENT LINE which can be used to specify comments which are introduced by a grammar token and terminated by a line end If an rdp grammar does not include any instances of EOLN then line end markers are suppres
28. 24 case RDP T 26 case RDP T 27 case RDP T 29 case RDP 30 case RDP T 32 case RDP T 33 case RDP T 34 ox oi lt gt gt x emit NE dst left right break emit MUL dst left right break emit EXP dst left right break emit ADD dst left right break emit SUB dst left right break emit DIV dst left right break emit LT lt dst left right break emit LE dst left right break emit EQ dst left right break emit GT gt dst left right break emit GE gt dst left right break default text message TEXT FATAL unexpected diadic operator found in expression walk token number 1 identifier s n root gt token root id return dst Figure 8 10 minitree auxiliary functions part 2 expression walker 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 96 MINITREE MULTIPLE PASS COMPILER void tree walk rdp tree data root 4 Preorder tree walk if root NULL return else 4 void this edge graph get next edge root
29. 65536 64K MVM instructions can be executed from any location and operands can also reside anywhere in mem ory but miniloop places code in a single sequence starting at location 100016 and data in a single block starting at location 800016 Within the data block internal temporary variables created during the compilation of expressions are placed at the end This memory map is shown in Figure 7 3 We establish this memory map by setting the CODE and DATA assembly pointers appropriately At the start of each program miniloop issues the following assembler directives DATA 0 8000 CODE 0 1000 MPP CODE This has the effect of initialising the start address for data assembly to 800016 and setting the label MPP DATA to the address of the first data item 70 SINGLE PASS COMPILER FOR MINILOOP Unused data space Internal temporary data User data 800016 Unused code space Code 100 Spare memory Figure 7 3 MVM memory map for programs compiled by miniloop and then initialising the start address for code assembly to 100016 and set ting the label MPP CODE to the address of the first instruction which will subsequently be used as the transfer address At the end of each program miniloop writes out directives of this form DATA __temp BLOCKW 9 declare array of temporaries END MPP CODE Here assembly is switched to the data region and a block of temporaries in this case nine w
30. 68 language for instance uses a rule of this form statement 21 0 then statement else statement fi The closing fi which is if backwards marks the end of each if state ment and removes both the grammatical ambiguity for dangling else s and the LL 1 breach caused by the presence of the keyword e1se in both the FIRST and FOLLOW sets of the optional else clause The modern trend in programming languages is to insist on this kind of explicit termination because it has been observed that programmers are more likely to accidentally leave out tokens than to add in spurious tokens Forcing the programmer to mark the end of com pound statements is a useful discipline However languages designed during the 60 s and 70 s such as Pascal and C typically allow unterminated control statements Next steps 31 4 5 1 Semantic actions for conditional execution 4 6 When we write if statement in a program we think of the computer jump ing over one of the two branches In compiler we can generate instructions that do indeed cause a section of code to be jumped over as we shall see in Chapter 7 but in an interpreter we cannot simply jump over part of the input stream because the whole source text must be checked for syntactic correctness Hence the parser will read and process both the then and the else branches of an if statement executing the semantic actions in both branches as it goes In practice of co
31. 72 78 adds a specification for the six new relational operators at a priority level below that of the operators defined in the minicalc grammar expression tree It is not meaningful to write a sequence of relational operators an expression such as 3 lt 2 lt 4 attempts to compare a boolean value in this case FALSE the result of 3 2 with the integer 4 This issue is confused in some languages such as C where booleans are not directly supported integers being used in their place This expression actually yields TRUE in ANSI C because the subexpression 3 2 yields 0 which is indeed less than 4 So as to avoid this kind of confusion in minicond the rule will only accept individual instances of relational operators that is an expression such as at b gt 3 is legal but gt gt gt gt 3 not This is achieved by using the zero or one 1 construct rather than the zero or many bracket in rule e0 and only allowing bracketed expres sions to contain arithmetic operators as specified on line 101 In other respects the rule is entirely conventional on recognition of an operator subexpression the parser will execute the associated semantic action and in each case we have simply used the equivalent operator in the underlying ANSI C language Using inherited attributes In the last chapter we saw how information about a token could be passed from the scanner to the parser s semantic actions using attr
32. 8 3 1 Use of the graph library 8 3 2 The tree walker pretty printer for ANSI C 9 1 9 2 9 3 9 4 9 5 Using the pretty printer 9 1 1 Command line options i indent spacing comment start column 9 1 2 File usage 9 1 3 Making a listing 9 1 4 Error messages Pretty printer features Pretty printer limitations 9 3 1 Operators which may be monadic or diadic 9 3 2 Consecutive indenting keywords 9 3 3 Continuation lines 9 3 4 Embedded comments 9 3 5 Formatting of lexemes A grammar for a superset of ANSI C Auxiliary routines 9 5 1 The space array 9 5 2 The pretty print function Design projects Acquiring and installing rdp A 1 Installation A 2 Build log CONTENTS 83 84 91 91 92 92 92 101 101 102 102 102 103 103 103 104 105 105 105 106 106 106 107 110 110 116 119 121 121 123 Chapter 1 Translation tools 1 1 rdp is a system for implementing language processors It accepts a specification written an extended Backus Naur Form of source language and produces as output a parser for the language written in C It is possible for the user to specify in C actions which should be taken when fragments of the source language are recognised by the generated parser rdp produces as output a program written in C which parses fragments of the specified language and carries out the specified corresponding actions rdp can produce for example compilers the actions specify
33. Alu temp_count return ret unsigned long new_label void 1 static long unsigned label 0 return label End of ml_aux c Figure 7 12 miniloop auxiliary functions part 2 high level output and house keeping Chapter 8 minitree a multiple pass compiler Some translation tasks are difficult to perform during a parse even if a multi pass parser is employed High quality compilers for instance can perform many different code improvement transformations as part of an optimisation phase Typically optimisations work by relating together widely separated parts of the source text Take for example common sub expression elimination which is one of the most commonly applied optimisations an assignment between array elements in ANSI C such as a i jl b i jl actually contains two identical calculations if the sizes of the a and b arrays are the same In detail i must be multiplied by the width of the array and added to 3 single pass translator has to process each of these identical calculations in isolation and so is unlikely to be able to rearrange the calculations into the equivalent but more efficient form temp j array width i a temp b temp If a multiple pass translator is to be used then it is usual to construct a data structure in memory that represents the user program in a manner which may be efficiently processed Simply storing the original program text is ineffic
34. CALCULATOR WITH DECLARED VARIABLES 2 2 2 3 AN aah Eon HR HR HH c NN N e print N kk sk ke ke e ke ke jk ke je ok fe 2k 2k RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 testcalc m a piece of Mini source to test the Minicalc interpreter This file may be freely distributed Please mail improvements to the author 2 kk sk ok ok ok ok ok ok ok kk oko ok ek ke ok ok kk kk oe ek ke hok kk 2 e int 344 b 1 print a is n b a 2 print b is b b is b n int z 3 cubed is z n End of testcalc m Figure 2 1 An example minicalc program testcalc m minicalc expressions are built up using the four basic diadic left associative arithmetic operators and along with unary and and the diadic right associative exponentiation operator Operands may be either numeric literals or variable names The result of an expression may be assigned to a variable as shown in line 15 or used within a print statement as shown in line 17 print statements take a parenthesised comma delimited list of expressions or strings which are evaluated and printed in left to r
35. CPY b 1 b 1 DATA __STR_O STRING a is CODE PRTS STR 0 PRTI a print integer DATA __STR_1 STRING Wn CODE PRTS __STR_1 MUL __temp 1 a 2 _ temp 1 2 CPY b temp 1 b temp 1 DATA __STR_2 STRING b is CODE PRTS STR 2 PRTI b print integer DATA __STR_3 STRING b is CODE PRTS __STR_3 SUB temp 2 0 b __temp 2 0 b PRTI __temp 2 sprint integer Figure 7 6 miniloop compiled output for the example program part 1 76 1054 801 8010 8010 1054 105A 1060 1060 801D 8028 8028 1060 1066 106E 1074 1074 8028 802A 802A 1074 107A 107A 802A 802 802 107A 1080 1080 1088 108E 108E 802C 8038 8038 108E 1094 109A 109A 109A 8038 804 804 109A 10A0 10A0 10A8 10AE 10AE 10B6 10BC A SINGLE PASS COMPILER FOR MINILOOP 51 DATA 0400 52 STR 4 STRING Mn 53 54 CODE 0 010000801 PRTS STR 4 101100008000 56 a print integer 57 58 DATA 2063756265642069 59 STR STRING cubed is 60 61 CODE 0F010000801D 62 PRTS STR 5 0510807780000003 63 EXP temp 3 a 3 temp 3 a 3 101100008077 64 PRTI temp 3 print integer 65 66 DATA 0400 67 STR 6 STRING Mn 68 69 CODE 0 0100008028 70 PRTS STR 6 71 72 DATA 0001 73 2 WORD 0 74 75 CODE 0C118024A8000 76 CPY z a 77 7 0611807880248000 78 EQ __temp 4 2 a __temp 4 2 0E1110948078 79 BEQ temp
36. Chapter 7 of the tutorial manual JS97c Declaring symbol tables Lines 14 19 of Figure 3 1 specify the creation of a symbol table to hold the variables declared in minicalc program A symbol table is a repository for NONMON ON oH HB HB RRP RB HR HB HH WN FO N N an PPP PP 90 090 ADA 00H program N B N E 18 AN INTERPRETER FOR MINICALC RDP release This file m X SYMBOL_TABLE var_dec statement ok ke ok ok she oe ole ok ok oke ode ole ok ok ok ole ok ok ok ok ok ok ok ole ole ok 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 minicalc bnf a decorated mini calculator grammar with interpreter semantics ay be freely distributed Please mail improvements to the author Sk sk ok ok oj ke ke ak ok ke ke 2k ok fe 2k ok oe kk ke jk oe e ok e e ok oe 2k ke ke 2k ok EE ok ae af ke 2k af fe 2k TITLE Minicalc interpreter V1 50 c Adrian Johnstone 1997 SUFFIX USES math h mini 101 31 symbol compare string symbol hash string symbol print string char id integer i i var dec statement int ID
37. Figure 5 5 Extracts from the mvmsim simulator the execute function part 2 46 MINI VIRTUAL MACHINE MVM 223 case GE 224 put memory word dst src1 gt src2 225 display GE dst srci src2 226 8 227 break 228 case LT 229 put memory word dst srci lt src2 230 display LT dst srci src2 231 8 232 break 233 case LE 234 put memory word dst src1 lt src2 235 display LE dst srci src2 236 8 237 break 238 case CPY 239 put memory word dst srci 240 display CPY dst srci src2 241 6 242 break 243 case BNE 244 display dst srci src2 245 if srci 0 246 dst 247 else 248 pe 6 249 break 250 case BEQ 251 display dst srci src2 252 if 1 0 253 dst 254 else 255 pe 6 256 break 257 case OP_PRTS 258 display PRTS dst srci src2 259 printf s memory src1 260 pe 6 261 break 262 case OP_PRTI 263 display PRTI dst srci src2 264 printf i int srci 265 6 266 break 267 case HALT 268 display HALT dst srci src2 269 printf Halted 270 stop 1 271 pe 2 272 break 273 default 274 display dst srci src2 275 text printf Wn 276 text message TEXT FATAL illegal instruction encounter
38. Figure 9 6 Pretty printer auxiliary functions part 3 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 Auxiliary routines 115 Print the lexeme some kinds need special actions switch kind case K_EOLN fprintf outputfile n eoln_count printed 0 break case K_COMMENT comment_count if last kind K_EOLN comments that aren t first on a line move to middle do printed fprintf outputfile while printed lt comment start printed fprintf outputfile s lexeme break case K STRING printed fprintf outputfile printed text print C string file outputfile lexeme printed fprintf outputfile break case K CHARACTER printed fprintf outputfile printed text print C char file outputfile lexeme printed fprintf outputfile break case K_PREPROCESSOR printed fprintf outputfile 5 lexeme break default printed fprintf outputfile As lexeme break if kind K KEYWORD INDENT Set an indent for next line temporary indent 1 last kind kind 177 End of c aux c Figure 9 7 Pretty printer auxiliary functions part 4 116
39. INTEGER or ID tokens In these two cases expression walk simply returns a string corresponding to the lexeme of the token Any other nodes will be operator nodes and the ex pression walker will call their children before emitting an assembler instruction corresponding to the operator The left child is called at line 37 The right child is then examined in line 39 and if it is NULL empty then the node must be a monadic operator so assembler code for the monadic operators only monadic in this case is emitted via the switch statement at lines 43 49 For monadic operators the right child is processed and then the switch statement at lines 57 72 is used to select the assembler instruction corresponding to the operator The tree walker has this following outline form 78 void tree walk rdp tree data root 79 1 80 Preorder tree walk 81 if root NULL 82 return 83 else 84 1 85 void this edge graph get next edge root 86 87 switch root gt token 88 1 89 case 0 scan root or begin node s children 90 case RDP T begin 91 1 92 void this edge graph get next edge root 93 94 while this edge NULL walk children printing results 95 1 96 tree walk rdp tree data graph get edge target this edge 97 this edge graph get next edge this edge 98 99 break 100 case function is designed to be called recursively and at each level
40. VIRTUAL MACHINE MVM copying of a data item from one memory location to another These kinds of instructions are called data manipulation instructions because they allow data to be modified in various ways In addition to the data manipulation instructions Von Neumann processors like MVM have control manipulation instructions which affect the order in which the program s instructions are executed By default the program counter is simply incremented to point to the next instruction after the current one In the absence of any control manipulation instructions therefore all programs would simply be executed once only in strict order by address This is the kind of program that we can write using the simple minicalc language which has no control flow constructs The control manipulation instructions allow sequences of instructions to be jumped over They work by loading a new value directly into the program counter which overrides the simple sequential execution Often the new value is only loaded if some condition is true if condition then action state ment be implemented as a test of condition If condition is false then the program counter is loaded with the address of the instruction after the code corresponding to action and this has the effect of skipping over action without executing it MVM addressing modes Most high level programming languages provide both variables and numeric literals In minicalc for instance th
41. added to support relational expressions and the if statement 24 THE MINICOND LANGUAGE INTERPRETATION WITH CONDITIONALS MODO PO PM DO PO d d oH d dB oonan wne onee 00H EEE EE EEE EEEE EEE EE EEE fe oe ff ke fk E RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 testcond m a piece of Minicond source to test the Minicond interpreter This file may be freely distributed Please mail improvements to the author ok 2k ok 2k oe e ke ke 2k ok ke 2k oe jk oe e ok ae 2k af af fe 2k ok fe 2k fe 2k 2k int 344 b 1 print a is n b a 2 print b is b b is b n print a cubed is a 3 int z a if z a then print z equals else print z does not equal a n z a 3 if z a then print z equals else print z does not equal a n End of testcond m Figure 4 1 An example minicond program testcond m is 7 is 14 b is 14 cubed is 343 equals a does not equal a NNN OO Figure 4 2 minicond output for example program oH HB OR HB HR HR HR HH 0 100 o 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
42. allow control flow to be specified using operators the ANSI C if then else operator is perhaps the most well known example of this feature Syntactically expressions are simply streams of operator and operand to kens For an expression made up of diadic two operand operators we expect to see expressions of the form operand operator operand operand operator operand Monadic single operand operators are distinguished by appearing between diadic operators and operands or next to other monadic operatirs Here is an expression made up of monadic negate and the diadic addition and subtraction operators 3 4 6 2 The following is a simple grammar that generates expressions of this form expression monadic_op operand diadic_op monadic_op operand operand INTEGER monadic op diadic op A grammar for 11 If all we want to do is to check that an expression is syntactically valid then we can simply extend the definitions of op and diadic_op to include all the monadic and diadic operators present in the language being specified Here for instance is a suitable grammar for mini expressions which include monadic and and diadic and exponentiation expression monadic_op operand 4 diadic op monadic_op operand operand INTEGER monadic op 2 2 diadic op 4
43. as x y 2 Similar considerations may be used to decide the relative priorities of oper ators Remember that the higher priority operators need to be evaluated first an expression common usage for instance it is clear that the expres sion 2 2 i e 2 is expected to yield 4 that is it is a shorthand for 2 2 not 2 2 which would yield 4 Hence we must give exponen tiation higher priority than monadic On the other hand it is also clear that 2 2 evaluates to 4 not 4 that is we should interpret the expression as 2 2 from which we deduce that multiplication has lower priority than monadic number of priority levels provided by a language is a fundamental design decision Pascal provides rather few levels and in particular expressions con taining adjacent logical operators must be parenthesised ANSI C goes to the other extreme and provides so many priority levels that many C programmers are unsure of the relative priorities of unusual operators From the perspective of the language user as opposed to designer the golden rule is if in doubt insert explicit parentheses 2 5 minicalc syntax checker As it stands mini syn bnf can be processed by rdp in the normal way to make a syntax checker for mini Such a checker can detect badly spelled keywords and syntactically ill formed expressions but is not able to check that variables have been declared before use
44. emit transfer 1 1 1 1 1 1 1 1 1 1 1 src2 src2 src2 src2 src2 src2 src2 src2 src2 src2 src2 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 1 2 e3 e4 eb e6 7 8 9 e10 integer eili integer e0 string char Comment Express integer integer integer integer integer integer integer integer integer integer ion interpreter using C e2 e2 e2 e2 e2 e2 e2 result 24 gt gm right right right right right right e3 result e3 right e4 result amp amp e4 right eb result 427 eb right e6 result e6 right e7 result 42687 e7 right e8 result 42 lt lt e8 right gt gt gt e8 right e9 result e9 right e9 right e10 result 47 eiO right ei0 result 9 2 e10 result result 972 e10 result result 91 e10 result result eli result mvmasm data temp ID name result result result result result result result result
45. executed by the simulator Implementing mvmasm Assembler syntax is designed to be easy to parse and it is quite straightforward to design an assembler completely by hand However this tutorial manual is about rdp so we shall use rdp to implement mvmasm It turns out that two special rdp features are needed to efficiently implement assemblers support for multiple passes and support for line oriented languages 6 3 1 Multiple pass parsers interesting aspect of nearly all high level languages is that they may be translated in a single pass This requires variables and functions to be declared before they are used a rule rigidly enforced by Pascal and loosely enforced with the help of default behaviour in C In an assembler however such a rule would make writing programs very tedious because of the large number of forward jumps present in real code forward jump has this form BEQ temp done done Recall that labels take the value of the current assembly address at the point of their declaration so declaring a label before it is used is not helpful here On the other hand the actual value of done will be unknown when the BEQ instruction is encountered for the first time There are three solutions to this predicament we can either ban forward references we can use a fizup or we can use a multiple pass assembler The first option is draconian since it means that only backward jumps are allowable Remarkably the standard asse
46. field there is no need for an integer data field as there was for the minicalc and minicond interpreters When a variable is declared it is checked for validity line 39 40 miniloop variable names must not begin with a double underscore __ because these might clash with the internal label names Lines 44 67 show the statement compiler This emits code according to the templates described in the previous sections significant difference between the miniloop grammar and the earlier minicond and minicalc grammars is that here the expression rules return char attributes rather then integer ones In the previous grammars the expression rules formed an interpreter that returned values In miniloop the rules return the labels of locations that will contain the values at run time Each level of the expression tree has this basic form 79 ei char char dst e2 left 1 dst new_temporary 80 e2 right emit ADD dst left right 81 2 e2 right emit SUB dst left right 82 83 left dst 84 3 result left auxiliary function emit outputs one assembler instruction constructed from the supplied opcode and operand parameters new temporary function constructs a string of the form temp n where n is the name of the next available temporary variable This temporary then becomes the destina tion operand for the assembler instruction correspond
47. function part 1 Variable stop declared at line 162 is used as a flag to signal termination of the simulation It is initialised to false and only set to true when a HALT instruction is encountered The main simulation loop comprises lines 164 to 279 The body of the loop comprises the code to fetch the new instruction lines 166 170 the address mode resolution code in lines 174 179 and a large switch statement which decodes the operation code lines 181 278 shown in Figures 5 5 and 5 6 Within the switch statement each of the 17 cases includes a call to the display function which provides the trace output if a t option has been specified on the simulator command line Each case finishes with an increment of the program counter by the length in bytes of the decoded instruction The address mode resolution code examines the mode byte loaded from the instruction If the corresponding mode field see section 5 3 is a one then the source operand is reloaded with the contents of the memory location addressed by the data There is no need to resolve the destination operand because destinations are always assumed to be addresses not literal data The code within the simulation loop provides a detailed specification of the meaning of each instruction in terms of the semantics of ANSI C The BNE instruction for instance lines 243 249 tests the value of the first source operand against zero and if the test succeeds the program counter is loaded
48. input file because under some operating systems such as MS DOS the temporary file will overwrite the input file during processing which results in a corrupted file 9 1 3 Making a listing pretty c can be used to make a line numbered listing of program by using the 1 option However bear in mind that it is the input file that will be listed not the pretty printed file If you run the pretty printer twice on the same file then a listing generated on the second run will show the formatted file 9 1 4 Error messages Although pretty c accepts a very loose grammar it will reject files that contain invalid C lexemes In such cases pretty_c issues the usual syntax error messages In addition one of the following three fatal error messages may appear if pretty_c has difficulty accessing files temporary output filename is the same as the source filename An output file name that is the same as the source file name has been specified It is a fatal error to try and make the temporary output file the same name as the input file because under some operating systems such as MS DOS the temporary file will overwrite the input file during processing which results in a corrupted file Use a different output file name unable to open output file output filename pretty_c was unable to open the temporary output file for writing This may be because there is no disk space left or there may already exist a file of that name that is write protec
49. into the grammar which is exactly what parentheses mean in expressions the parentheses in an expression are used to override the default order of expression evaluation so that the sub expression in parentheses is evaluated first 12 THEMINICALC LANGUAGE A SIMPLE CALCULATOR WITH DECLARED VARIABLES Specifying operator priority We have seen that parentheses can be used to override default evaluation orders in expressions However expressions with lots of parentheses can be hard to read although LISP programmers seem to manage and so the conventions of operator priority have grown up to allow expressions to be written with implicit parentheses Most people use priority rules instinctively because they are taught to us when we first learn arithmetic at school Multiplication and division for instance have higher precedence than addition and subtraction This means that the expression 3 4 5 evaluates to 23 not 35 as would be the case if strict left to right evaluation were used Effectively the priority allows us to write 3 4 5 as a shorthand for 3 4 If we need to force left to right evaluation then we can write 3 4 5 In conventional arithmetic exponentiation has the highest priority followed by negation multiplication and division and finally addition and subtraction We can express these priorities by using the nest of production rules shown in lines 19 27 of Figure 2 2 Specifying operator associativ
50. is present set the DATA pointer to its value oth erwise carry on assembling starting at the most recent value of the DATA pointer The DATA pointer is initialised to zero when the assembler starts Data declaration directives WORD ezpression Reserve a word two bytes of memory and initialise the contents to the value of expression BLOCKW expression Reserve expression words 2 x expression bytes of mem ory No initialisation of the memory is performed STRING character string Reserve sufficient bytes to hold the number of characters character string plus one and initialise them to hold the character string and a terminating zero byte Symbol assignment label EQU expression Force the value of label to be the value of expression This allows symbols to be set to arbitrary values rather than the default behaviour which is for symbols to acquire the value of the assembly pointer at the time they are translated The following are legal uses of EQU large prime EQU 131 large prime 131 top bit EQU 128 top bit 128 54 MVMASM ASSEMBLER FOR MVM length EQU top bit 1 length lt 129 Transfer address specification 6 3 END ezpression Mark the last valid line of the assembler source file and set the transfer address to the value of expression The transfer address is written into the output file and is used by the simulator to specify the address of the first instruction to be
51. more passes over the source file On the first pass a symbol table is loaded with the labels and their values as they are encountered If they are first seen as an operand then the corresponding table entry is loaded with an arbitrary value By the end of the first pass however definitions will have been seen for all the symbols if the source program is syntactically well formed The assembler then repeats the entire process but making use of the label information from the first pass The success of this approach relies on the fact that all instructions use a fixed size field to hold symbol values As a result the position of each instruction and data item in memory is fixed during the first pass so symbol values will not change as a result of other symbols changing their value Are two passes sufficient Well if our only concern is forward references the two passes are enough but consider this use of the EQU directive first EQU second 2 second EQU third 3 third EQU 100 Here we have a chain of forward references On pass one labels first and second are to receive the value of expressions which include unknown data but label third will be correctly set to 100 On pass two label second can be correctly set to 103 but label first is still indeterminate so in this case two passes is not sufficient In general we need as many passes as there are levels of forward referencing plus one Since we can always add another level of forward
52. name 1 1 mini cast symbol insert key mini amp name sizeof char sizeof mini data gt i val if symbol lookup key mini name NULL NULL 1 text message TEXT ERROR Undeclared variable s n name symbol insert key mini name sizeof char sizeof mini data z2 1 1 mini cast symbol lookup amp name NULL gt i val print C ei val print 41i val String str printf 4s str 9 Figure 3 1 rdp specification for the minicalc interpreter part 1 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 Declaring symbol tables 19 ei integer e2 result 47 e2 right result right Add e2 right result right Subtract e2 integer e3 result 47 e3 right result right Multiply gt e3 right if result 0 text message TEXT FATAL ECHO Divide by zero attempted n else result right Divide e3 integer 2 e3 result Posite e3 result result result Negate e4 result e4 integer eb result e4 right result integer pow double result double right Exponentiate eb integer ID name if symbol lookup key mini name
53. shown after a yields sign gt If no value is written back as for instance in the case of the PRTI instruction then a zero is displayed Any output produced by PRTI or PRTS instructions is displayed after the in struction If the t option is not used on the command line then the instruction display is completely suppressed so only program output appears Finally when the simulator encounters a HALT instruction it prints the message Halted and terminates 5 7 4 Implementing nvnsim full source code of the simulator runs to a little over 300 lines of ANSI C which may be found in the file mvmsim c In this section we shall look at the overall structure of the simulator and look in detail at the code corresponding to the MVM instruction execution unit About half of the code in mvmsim c is concerned with processing the com mand line options parsing the input file and then loading of the internal mem ory These functions are easy to understand and we shall not discuss them further here The parts of the code we are interested in are those that model the MVM architecture s memory and program counter and the function that controls the simulated execution of the MVM instructions mvmsim a simulator for MVM byte codes 43 18 define MEM SIZE 655361u 22 unsigned char memory MEM SIZE 23 unsigned long pc 0 39 static int get memory byte unsigned long address 52 static int get memory word unsigned lo
54. specified say n and the operand is understood to be made up of the contents of the addressed cell concatenated with the contents of the next highest address n 1 In the example shown in the figure the 16 bit variable temp resides at locations 2 and 3 If memory 2 contains 910 10013 and memory 3 contains 310 00115 then the value of temp is 0003 concatenated with 0009 which is 319 x 25610 910 77710 11000010013 MVM instructions range in size from two to eight bytes By analogy with the addressing for data items the address of the instruction is taken to be the address of the least significant byte which will be the lowest address of the range of locations occupied by the instruction MVM instruction execution 35 memory data from memory Execution unit It vs address to memory Figure 5 2 MVM internal structure 5 2 MVM instruction execution block diagram of the internal structure of an MVM processor is shown in Figure 5 2 is an example of Von Neumann processor as indeed are most real computers in use today In a Von Neumann machine instructions and data can co exist in the same memory as described in the previous section It is not possible just by looking at the contents of memory to distinguish between instructions and data The processor maintains a pointer to memory that is a register which holds the address of a location in memory called the program counter Before program can be exe
55. statement flag 65 else _and_not flag interpret statement flag 1 66 If we strip out the semantic actions the semantic rules and the attributes from this rule we see that its effect on the language is to define an if statement with this syntax statement 212 e0 then statement else statement 1 This recursive rule allows nesting of minicond statements which leads to an ambiguity in the grammar Consider this fragment of minicond code 1 int x 0 a 10 b 20 c 30 2 if a gt b then 3 if b gt c then 4 1 5 else 6 2 30 THE MINICOND LANGUAGE INTERPRETATION WITH CONDITIONALS Here we see two nested if statements with one optional e1se clause present The ambiguity in the grammar means that we cannot tell just by looking at lines 1 5 whether the else clause at line 5 belongs to the if statement at line 3 or the if statement at line 2 The ambiguity is reflected in the error message that rdp generates when presented with the full grammar Error LL 1 violation rule statement 2 else not statement 1 contains null but first and follow sets both include else Since an else clause may be immediately followed by another else clause and the else clause is optional we have an LL 1 violation because the IBNF phrase else statement is 1 optional and 2 starts with the keyword else and may be followed by the keyword else In practice
56. such as the DEC VAX the Motorola 68000 and the Intel 80x86 family have many exotic address ing modes and Reduced Instruction Set Computer RISC architectures such as the MIPS Sun SPARC and Dec Alpha have essentially only three modes is simple but is not comparable to a real RISC architecture because it does not have any data registers and the efficient use of such registers is perhaps the defining characteristic of a true RISC architecture 5 4 MVM instructions instructions are made up of a string of bytes Depending on the struction the string may be between two and eight bytes long every MVM instruction has an operation code opcode byte followed by an address mode byte and most instructions also contain some operands Each operand is rep resented by a 16 bit number so each operand adds two bytes to the length of an instruction one for the most significant byte MSB and one for the least significant byte LSB The format of a three address instruction therefore is Instructions with less than three operands follow this general format but simply omit the unused operand fields Even zero address instructions such as HALT have a mode byte 5 4 1 Instruction set capabilities The opcode byte could encode up to 256 unique instructions but in fact MVM only uses the first 17 codes numbered 0 16 As we shall see in the next chapter it in more convenient to use hexadecimal base 16 than decimal base 10 to repr
57. the corresponding target code interpreters the actions evaluate the input fragments and pretty printers the actions reformat the input fragments This report describes the design and implementation of a family of lan guage translators based around a simple procedural programming language called mini The tools covered include two different interpreters an assem bler simulator for an idealised processor a naive single pass compiler for that processor and a multiple pass compiler We also include a pretty printer for ANSI C and a list of sugestions for further project work For each tool we look at how the tool is used before covering the design and implementation of the translator All of the tools are included in the rdp standard distribution pack and have been tested on MS DOS Windows 95 and Unix systems If you have not used rdp before we recommend that you read through the accompanying report entitled A tutorial guide to rdp for new users JS97c which is a step by step guide to running rdp and which also describes some of the theoretical underpinnings to parsing and translation of computer languages There are also two reference manuals for the user guide JS97a and the support library manual JS97b These reference guides provide detailed infor mation on rdp s options error messages and support library functions We begin by discussing the roles of interpreters and compilers with some historical background
58. the rules for macro parameters in the ANSI C macro preprocessor minicond interpreter contains six semantic rules _local_int a integer a generates a macro that will de clare a new local variable of type integer The name of the variable will be whatever identifier is supplied as the actual parameter in a call to this semantic production o _and dst a b dst a amp amp b the destination attribute is set to the logical AND of attributes a and b o and not dst a b dst a amp amp b the destination attribute is set to the logical AND of the inverse of attribute a and attribute b o _insert id the identifier id is added to the symbol table minicond o _lookup id ret the attribute ret is set to the integer value field of the symbol table record for identifier id If id is not found in the symbol table then an error message is issued and the symbol is added in so as to suppress subsequent messages o _update id val the integer value field of the symbol table record for identifier id is set to val If id is not found in the symbol table then an error message is issued and the symbol is added in so as to suppress subsequent messages 4 5 Adding conditional execution Our first task when adding an if statement to mini is to define the necessary syntax to the grammar This is done in lines 64 and 65 of minicond bnf 63 e 64 if eO cnd then _and flag interpret
59. to keep track of all typedef statements which would require a much more detailed grammar In fact we would have to implement a complete C preprocessor to perform this task perfectly because it is conceivable that the user defined type mytype had been defined in macro or in an included file pretty c simply ignores these complications and always treats and for that matter amp as a diadic operator with spaces on both sides This is the source of the restrictions described in section 9 3 1 112 PRETTY PRINTER FOR ANSI C f kc kk ke kk koe ke fe ke koe hee je ke koe ke fk khe fe ok ke koe RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 pr_c_aux c pretty printer semantic routines This file may be freely distributed Please mail improvements to the author o include lt stdio h gt include scan h include textio h include pr_c_aux h H HH o static int lexeme count 0 static int eoln count 0 o Static int comment count O0 BR on static int last line 1 Static FILE outputfile 20 unsigned long indent size 21 21 unsigned long comment start 301 22 23 24 static int space table K TOP K TOP 1 25 K 26 27 1 Y 0 P 28 B 0 W P R P 29 L B 5 0 E E U 30 0L CE E R N P N 31
60. were handled in the previous section Instead the compiler keeps count of the number of temporaries used and declares them in a block at the end of the program The temporaries are always referred to as __temp n where n us the number of the temporary This uses the address calculation capability of the assembler to avoid the need for a large number of separate labels Compiling print statements The print statement can take an arbitrary number of parameters of either string or integer type The MVM instruction set provides two opcodes specifi cally for printing strings and integers For an integer parameter code to evaluate the arithmetic expression is is sued which leaves a value in a temporary variable t The compiler then simply issues an instruction of the form temp t print integer For the case of an expression made up of a single variable the expression eval uator returns the name of that variable instead of the name of a temporary so code of the form PRTI v print integer will be issued where v is the name of the variable For a string parameter the compiler outputs code of the form DATA _STR_2 STRING b is CODE PRTS STR 2 The string is stored in data space and given a unique label in this case STR 2 The compiler then switches back to code space and emits a PRTS instruction 72 SINGLE PASS COMPILER FOR MINILOOP IF n relational test then block ELSE n else block
61. 000C 5 CODE 0 1000 switch to assembling code at address 1000 hex 1000 6 start 1000 0C010004007B 7 temp 123 1 temp with decimal 123 1006 101100000004 8 temp print the value of temp as an integer 100C 0011 9 HALT terminate the simulator 100E 10 100 1000 11 END start transfer address is code start Transfer address 00001000 errors and warnings Listing format This listing shows the familiar line numbered source file listing on the right with the assembler generated output on the left first field of the output is the current assembly address that is the MVM memory address to which any data or instructions following on the line will be loaded single space is followed by a string of pairs of hexadecimal digits representing the assembled output You will see that the output is in the same format as the input for the mvmsim Assembly using the DATA and CODE pointers Two internal counters are maintained by mvmasm called the current data ad dress and the current code address Their values are set by the DATA and CODE directives respectively and they keep track of the next available data and code memory locations Since MVM is Von Neumann processor data and code may be loaded at any memory locations but it is conventional to separate them first 49 into blocks Line 2 DATA 0x0004 sets the current data address to hexadecimal 000A and m
62. 1 command strcat command COMMAND command strcat command outputfilename text message TEXT INFO Calling simulator 4s Mn command if system command 0 text message TEXT FATAL Not enough memory or simulator not found Wn return 0 Figure 6 7 mvmasm auxiliary functions part 3 housekeeping functions mvmasm auxiliary functions 65 The emiti and emit2 functions in lines 92 102 output one and two byte two and four hexadecimal digit values and then update the current as sembly location accordingly They are used by the emit op function to out put complete MVM instructions Each instruction comprises an opcode byte and a mode byte constructed from the two mode fields passed into the function Between zero and three 16 bit operands are then output Function current label lines 104 113 returns the value of the last la bel seen At the start of each line of assembler source the parser resets the vari able last label to NULL This flags the error condition for the EQU directive if no label has been seen an error message is issued and a pointer to dummy 1 1 is returned instead This is so as to ensure that subsequent assignments to the label fields of the symbol table record returned by current label do not need to check for a NULL pointer init function in lines 115 121 is straightforward it simply attempts to open the output file and issues an error message if the file open f
63. 24 unsigned long location N unsigned long data location unsigned long code location N o 27 unsigned long transfer 0 28 29 void last label NULL pointer to most recently seen label x 30 void dummy label NULL dummy symbol returned by current label on error 31 32 static int emitted 0 Count of bytes emitted this line w w Figure 6 5 mvmasm auxiliary functions part 1 declarations 6 5 mvmasm auxiliary functions The mvmasm auxiliary functions shown in Figures 6 5 6 7 perform file handling and output to the binary object file Function emitf lines 34 51 forms the heart of the output routines it simulates the behaviour of the ANSI C printf output function by accepting a formatted output string and an arbitrary number of output fields and then using ANSI C vprintf and vfprintf functions to format the output The ANSI C standard library macros va list va start and va end are used to handle the variable number of arguments which emitf may be passed see any good book on ANSI C for an explanation of their use If the emit code flag is false the output is simply discarded but if it is true as it will be on pass 3 then the output is sent to the object file In addition if text echoing is enabled with 1 command line option to construct an assembler listing then up to the first 16 characters are also echoed to the screen The functions emit transfer emi
64. 30 27 28 program enum kinds kind 29 long unsigned line column 30 pretty open rdp sourcefilename rdp outputfilename 31 32 4 33 line scan line number column scan column number 34 35 comment lexeme kind K COMMENT 36 string lexeme kind STRING 37 character lexeme kind CHARACTER 38 block open lexeme kind BLOCK PEN 39 block close lexeme kind K BLOCK CLOSE 40 preprocessor lexeme kind PREPROCESSOR 1 41 monadic lexeme kind MONADIC 42 diadic lexeme kind K DIADIC 43 open bracket lexeme kind PEN BRACKET 44 close bracket lexeme kind K CLOSE BRACKET 45 item lexeme kind K ITEM 46 field delim lexeme kind K FIELD DELIM 47 punctuation lexeme kind K PUNCTUATION 48 keyword lexeme kind K KEYWORD 49 keyword indent lexeme kind K KEYWORD INDENT 50 EOLN lexeme kind K_EOLN 51 52 pretty print lexeme kind column line 53 i 54 pretty close rdp sourcefilename rdp outputfilename 55 Figure 9 1 rdp grammar for pretty printer part 1 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 A gram
65. 40 141 142 STR 11 STRING 2 equals a n DATA CODE PRTS __STR_11 BRA __FI_10 _ELSE_10 DATA 350 to _ Implementing miniloop _FI_10 77 STR 12 STRING 2 does not equal a n CODE PRTS __STR_12 _FI_10 CPY a 3 _DO_13 GT __temp 7 BEQ __temp 7 DATA ja 3 0 0 _13 STR 14 STRING a is CODE PRTS __STR_14 print integer DATA STR 15 STRING Nn CODE PRTS STR 15 SUB temp 8 CPY a BRA __ UD 13 0 13 HALT DATA __temp BLOCKW 9 END __MPP_CODE 1 __temp 8 go to __ 0_13 ifn temp 7 __temp 7 go to temp 8 __temp 8 gt 0 1 declare array of temporaries Figure 7 8 niniloop compiled output for the example program part 3 UD 13 NONMON ON oH HB HB RRP RB HR HB HH WN FO N N an PP wWwwwwwwwwwwh Ne 43 44 45 46 47 48 49 50 51 52 53 54 55 00H B N E var dec 78 SINGLE PASS COMPILER FOR MINILOOP ok ok oe ke ke ek kj ok oe kk ek ke ok oe 2 ok EEEE kj ok ok 2k 2k X RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 miniloop bnf a decorated mini loop grammar with single
66. 5534 Total of 65536 cells addressed as memory 0 65533 to memory 65535 65532 65531 Each cell contains eight bits numbered 0 7 The 16 bit variable temp is stored as two bytes at successive addresses temp LSB The address of temp is 2 the address of the Significant Byte LSB gt O e Figure 5 1 MVM memory structure be used via a software simulator so the many clever devices that hardware designers have introduced into real architectures to aid hardware realisation are irrelevant to our purpose You should bear in mind however that writing a compiler for a real processor is more complex than writing a compiler for MVM the principles remain the same but the large scale design of a real compiler requires much more detail to be handled MVM memory All MVM data and instructions are stored in a single main memory MVM has no special or general purpose registers for data The MVM memory can be regarded as an array of eight bit byte locations individually addressed The size of the MVM memory is fixed at 64K 6553610 bytes This allows all MVM addresses to be specified using 16 bit numbers A diagrammatic representation of the MVM memory is shown in Figure 5 1 Since the memory cells can only hold eight bit numbers and since MVM usually operates on 16 bit quantities in general two adjacent cells are used to hold each data item When accessing 16 bit numbers the address of the least significant byte is
67. ALT instruction which the simulator interprets as an instruction to finish interpreting instructions and return control to the user It is not an accident that the HALT instruction uses opcode number 0 Within the simulator the memory is initialised throughout to zero If a user programming error causes the simulator to try executing from memory that has not been loaded with instructions the simulator will immediately terminate because those zeros will be interpreted as HALT instructions 5 5 1 Data manipulation instructions and address modes In section 5 3 we distinguished between literal and variable addressing Here we look at the MVM instructions that correspond to the minicalc code fragments temp x y and temp x 12 Using an assembler to program MVM 39 If the variable x is resident at location 1019 location 000A in hexadecimal and y resides at location 1210 location 000C in hexadecimal then the instruc tion to add them together and store them in a variable called temp at location 4 is Op Mode Dst Sre1 Src2 01 11 0004 000A 000 we have the operation code for ADD 0116 followed by a mode byte that specifies variable mode addressing for both source operands 1116 Then we have three operands specified as the addresses of the destination 000416 and the two sources 000416 and 000 16 By contrast if we wish to add the number 12 to the contents of x and put the result in temp then the correct instruction is
68. Designing and implementing language translators with rdp a case study Adrian Johnstone Elizabeth Scott Technical Report CSD TR 97 27 December 20 1997 Royal Holloway University of London Department of Computer Science Egham Surrey TW20 0EX England Abstract rdp is a system for implementing language processors It accepts a speci fication written in an extended Backus Naur Form of a source language and produces as output a parser for the language written in C It is possible for the user to specify in C actions which should be taken when fragments of the source language are recognised by the generated parser rdp produces as output a program written in C which parses fragments of the specified language and carries out the specified corresponding actions rdp can produce for example compilers the actions specify the corresponding target code interpreters the actions evaluate the input fragments and pretty printers the actions reformat the input fragments This report describes the design and implementation of a family of lan guage translators based around a simple procedural programming language called mini The tools covered include two different interpreters an assem bler simulator for an idealised processor a naive single pass compiler for that processor and a multiple pass compiler We also include a pretty printer for ANSI C and a list of sugestions for further project work For each tool we look
69. Irdp suppN c rdp_gram c rdp gram c cc Irdp_supp c prnt c rdp prnt c cc Irdp supp c rdp_supp arg c suppNarg c cc Irdp suppN c rdp_supp graph c rdp_supp graph c Irdp_supp c rdp_supp memalloc c rdp_supp memalloc c Irdp_supp c rdp_supp scan c rdp_supp scan c Irdp supp c rdp_supp scanner c rdp_supp scanner c Irdp supp c rdp_supp set c rdp suppNset c cc Irdp supp c supp symbol c suppNsymbol c cc Irdp supp c rdp_supp textio c rdp_supp textio c erdp exe rdp obj rdp_ obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj rdp F omini_syn mini_syn cc Irdp_supp c mini_syn c mini_syn c emini_syn exe mini syn obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj mini syn testcalc rdp F ominicalc minicalc cc Irdp supp c minicalc c minicalc c cc eminicalc exe minicalc obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj minicalc testcalc a is 7 b is 14 b is 14 7 cubed is 343 rdp F ominicond minicond Error LL 1 violation rule rdp statement 2 else not statement 1 124 ACQUIRING AND INSTALLING RDP contains null but first and follow sets both include else Warning Grammar is not LL 1 but F set writing files 1 error and 1 warning cc Irdp_supp c minicond c minic
70. NULL NULL 1 text message TEXT ERROR Undeclared variable s n name symbol insert key mini name sizeof char sizeof mini data result mini cast symbol lookup key mini name NULL gt i Variable INTEGER result Numeric literal C el result Do first comment COMMENT NEST Comments STRING ESC 1 String char End of minicalc bnf Strings for print Figure 3 2 An rdp specification for the minicalc interpreter part 2 20 3 2 3 3 AN INTERPRETER FOR MINICALC records which may be stored and retrieved using a key Typically the key is string corresponding to the name of an identifier but allows keys to be made up of combinations of different fields symbol table package itself is quite general the user must supply a set of routines for comparing hashing and printing keys which effectively tune the package to use tables with a particular kind of key The library itself comes with suitable functions for the very common case in which the single string key field is the first field in the symbol table record and it is those routines that are used here For these routines to work we must be sure to set up the data part of the symbol table record correctly in this case a single char field to hold the variable s identifier is the first field and then integer field is declar
71. Op Mode Dst Sre1 Src2 01 10 0004 000 000C The only difference between these instructions is that the mode field for operand src2 is 1 in the first example and 0 in the second corresponding to variable mode addressing and literal mode addressing respectively 5 5 2 Control manipulation instructions Control manipulation instructions are used in the implementation of if state ments loop statements and goto statements Consider the minicond fragment if temp then a a 1 z z 2j If temp resides at location 000 16 a at location 000 16 and z at location 001616 then the following sequence of instructions based at location 213416 corresponds to the minicond fragment Location Op Mode Dst 571 Src2 2134 OE 10 2142 0004 2134 01 10 001C 001C 0001 2142 02 10 000 000 0001 instruction at location 213416 is BEQ which will restart execution at address 214216 if the variable at location 000A 6 is zero next line adds one to the variable at location 001C16 and the final line subtracts one from the variable at location 000 16 The overall effect of the fragment is to skip over the middle instruction if the value of temp at location 001616 is zero 5 6 Using an assembler to program MVM Writing MVM programs in this numerical code is time consuming and highly error prone assembler is a translator for a very simple language that offers English language like mnemonic names for the machine instructions and can
72. PH src 0 ei src ei label emit BEQ label src 0 1 1 2 ei src ei label emit op 0P BNE label src 0 1 1 2 ei label emit op 0P label 0 0 O 1 2 force immediate int 1 m2 0 m2 0 m2 0 m2 0 m2 0 LI LL LL LL LL m2 0 m2 0 m2 0 m2 0 m2 0 m2 0 LI LL LL LL LL mi 1 2 mode 1 PRTS 1 src emit PRTS 0 src 0 0 1 2 force immediate mode PRTI 1 0 ei src emit op 0P 0 src 0 mi 1 2 HALT emit_op OP_HALT O 0 O 1 1 O INCLUDE string filename 2 if text open filename NULL text message TEXT ERROR ECHO include file s not found n filename else 4 text get scan CODE location amp code location ei n location DATA location amp data location ei n location WORD ei val emit2 val ei val location 2 val string str while str 0 emiti str emiti 0 val mvmasm cast current label val val BLO CKW STRING EQU ei END Figure 6 2 An rdp BNF specification for mvmasm part 2 instructions val transfer val
73. The spectrum of language translators and the limitations of single pass translators rdp can be used to construct many kinds of translator In the tutorial guide JS97c 2 TRANSLATION TOOLS we looked at a single pass interpreter for a very simple language called mini These kinds of translators are limited to reading the source file once and execut ing embedded semantic actions on the fly This makes it hard to implement loop constructs which of course require parts of the source file to be executed over and over again This is the reason why the mini language interpreter described in the tutorial manual does not support looping constructs One approach to handling loops within an interpreter might be to trick the parser by resetting the input pointer to the start of the mini source code loop at the beginning of each loop iteration a rewindable interpreter This technique is feasible but requires a detailed understanding of the internals of rdp It also results in rather slow translation Experiments with the mini interpreter show that when interpreting arithmetic expressions about 9096 of the time is spent performing the parse and only 1096 of the time performing useful computation In fact even this discouraging ratio represents the best case The use of comments and long variable names can significantly increase the proportion of time spent on parsing This is unfortunate as it militates against use of meaningful names and embedd
74. VM ID lab if last_label symbol_lookup_key mvmasm amp lab NULL NULL last label symbol insert key mvmasm amp lab sizeof char sizeof mvmasm_data mvmasm_cast last_label gt val location 1 op 50 op MUL op DIV op EXP op EQ op NE op GT GE CILE int op mi 1 m2 1 ADD SUB MUL DIV EXP ei EQ NE GT 1 GE P LT P LE emit op op dst int mi 1 CPY ei dst mi 0 ei src emit op O0P CPY dst 1 dst dst dst dst dst dst dst dst dst dst dst un PH PH PH PH PH PH PH PH PH PH PH Srci src2 mi m2 diadic copy branch print halt directive 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 3 1 1 1 1 1 1 LL LL LL LL LL LL 1 1 1 1 1 1 1 1 1 1 1 srci srci srci srci srci srci srci srci srci srci srci PH PH PH PH PH PH PH PH PH PH
75. a at address OOOA hex temp W RD 1 declare an integer variable called temp CODE 0 1000 switch to assembling code at address 1000 hex Start CPY temp 123 1 temp with decimal 123 PRTI temp print the value of temp as an integer HALT terminate the simulator END start transfer address is code start 48 MVMASM AN ASSEMBLER FOR MVM Each line of mvmasm source code may contain a label such as temp or start an instruction such temp and a comment which comprises anything between a semicolon and the end of a line All three of these fields are individually optional so lines containing only a label only an instruction or only a comment are valid as indeed are blank lines Most instructions in an assembler program correspond to machine opcodes but some are directives which are instructions to the assembler In the example above the instructions DATA CODE WORD and END are directives 6 1 1 Assembler output The effect of assembling opcodes and executing directives is best seen by ex amining the assembler s output The example source code is available within the rdp distribution as file examples rdp_case mvmsim mvm Executing the command mvmasm 1 examples rdp_case mvmsim produces the following output listing 0000 1 Simulator example file 0000 2 DATA Ox000A start assembling data at address OOOA hex 000 0001 3 temp WORD 1 declare an integer variable called temp 000C 4
76. ails The quit function closes the object file and echoes the transfer address to the screen If the x command line option has been used then the flag execute sim will be true Assuming that the object file was not sent to stdout i e that a file containing the object code exists command is constructed that will run the simulator on the object file and then the ANSI C library function system is called to pass control to the simulator Chapter 7 A single pass compiler for miniloop 7 1 This chapter describes the first of two full compilers for an extended version of the minicond syntax that provides a while loop and compound statement delimited by begin and end keywords The compiler works by recognising compilable fragments of the source code such as an individual assignment or an arithmetic operation and then emitting the corresponding MVM assembler instruction The output of the compiler is a complete assembler program with the same semantics as the miniloop source program and this can then be assembled using mvmasm and executed using mvmsim In this chapter we shall describe the language features added to miniloop give an example of the compiler s output and then describe the assembler code patterns that are used to implement the miniloop high level language con structs We shall then describe in detail the rdp grammar and auxiliary routines that are used to implement miniloop In the next chapter we shall describe an oth
77. akes the data pointer the current assembly address All subsequent instructions will be assembled into succeeding locations until such time as an other DATA or CODE directive is encountered If a DATA or CODE directive appears on its own without an operand then it simply switches the current assembly address to the current data address or current code address respectively Labels Labels have the same syntactic form as C language identifiers that is an al phabetic character or an underscore followed by zero or more alpha numeric characters or underscores label definition must be followed by a colon When a label is encountered it is given the value of the current assembly ad dress Whether a label gets the current data address or the current code address depends upon which of the DATA and CODE directives was most recently encoun tered Hence in the above example temp is given the value 000 16 the current data address and start the value 100016 the current code address Machine instructions and addressing modes Lines 7 8 and 9 show actual machine instructions being assembled Each line comprises one of the operation codes from Table 5 1 followed by between zero and three operands operand may be either an address or literal data which is distinguished by a preceding hash sign Hence the instruc tion CPY temp 123 assembles to OB 01 000A 007B where 000A is the value of the label temp and 007B is the hexadecimal form of
78. al in preprocessor definitions These two commands have quite different meanings define a b b 3 define a b b 3 108 PRETTY PRINTER FOR ANSI C 1 te foe ok 2 fe a kf koe ke ke koe koe he koe koe koe 2 3 RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 4 b pretty c bnf a pretty printer for ANSI C 6 7 This file may be freely distributed Please mail improvements to the author 8 9 This grammar illustrates a rather different approach to writing language 10 parsers Instead of trying to exactly define the language with the 11 grammar we try and find a simple grammar that accepts the language and 12 also allow it to accept lots of incorrect strings The rationale is that 13 a pretty printer does not need to check a program for syntax errors 14 because a conventional compiler will be used subsequently to do that 15 As a result we end up with a very flat loose grammar 16 17 6 ee ke 2 koe ke 18 TITLE C pretty printer V1 50 c Adrian Johnstone 1997 19 SUFFIX c 20 PARSER program 21 OUTPUT_FILE pretty c 22 TEXT 517 100 000 23 USES pr aux h 24 25 ARG NUMERIC i indent size lumber of spaces per indent level O means use tabs default 2 26 NUMERIC c comment start Preferred start column for trailing comments default
79. antic functions as miniloop and so needs to be linked with the functions in m1 aux c In addition three extra functions to handle the tree walking are contained in the auxiliary file mt_aux c We shall look at the grammar first and then the new auxiliary functions 8 2 1 grammar for minitree The starting point for the minitree grammar is the miniloop grammar stripped of its semantic actions apart from those associated with the symbol table We then add promotion operators to terminals and nonterminals so as to prune the derivation tree into the forms shown in Figures 8 1 8 6 The first task is to 92 8 3 8 3 8 3 MINITREE MULTIPLE PASS COMPILER remove nodes that are pure syntactic sugar such as the semicolon and comma nodes These are used to separate items in lists when represented as a linear text but within a tree we can simply represent the list items as siblings under a parent node Hence in line 33 the semicolon node is promoted under its parent and thus effectively deleted from the tree Similarly in line 46 the parentheses in the print statement and the comma separating the parameters to be printed are deleted The if and while statements also contain sugar nodes that are deleted the if statement is represented in the tree as a single if node with two or three children the first being the expression tree for the relational condition and the second and third corresponding to the then block and the optional else bl
80. approach was popularised by the UCSD P system in the 1970 s which was a combined operating system and Pascal compiler that was distributed as P code code PD82b PD82a was in fact the machine language for mythical stack based computer that could be efficiently simulated on real architectures The system was so successful that a microprocessor manufacturer subsequently de signed and marketed hardware implementation of the P code processor On this processor the P code was native machine code so no software based inter pretation was required P code was successful because its only real competitors on the very small microprocessor based systems of the time were interpreters for BASIC These This use of the term virtual machine to denote an architecture that is independent of any physically implemented machine should not be confused with the use of the term in operating systems and computer architecture contexts where it denotes the ability of an architecture to support multiple simultaneously executing processes each of which appears to own the full resources of the host machine 4 1 3 TRANSLATION TOOLS fully interpreted languages were slow compared to the P code simulator microprocessor systems matured true compilers for languages such as C and Turbo Pascal that compiled to the host machine s machine code became widely available and the UCSD P system fell out of favour because it was much slower than these so called nati
81. art void pretty close char sourcefilename char outputfilename void pretty open char sourcefilename char outputfilename void pretty print char lexeme enum kinds kind unsigned long column unsigned long line End of c aux h Figure 9 3 Pretty printer auxiliary functions header file pretty print function remembers in a static variable the token kind of last lexeme seen so at each stage it has access to the token kinds of the previous and current lexemes Line 129 uses the space array to check whether a space should be output 129 if space table last insert space if necessary 130 printed fprintf outputfile Here we see that if there is a one at position last current of the space array then a preceding space will be output Line 54 for instance specifies that a space shall always be output if the last kind was punctuation such as a comma or semicolon This effectively inserts a space after every punctuation character This lookup table mechanism is very flexible and almost sufficiently powerful but it does suffer from some limitations In particular it is not easy to decide whether token is diadic multiplication or a monadic pointer dereference operator especially in contexts such as mytype temp where mytype is a user defined type definition that has been created using a typedef statement To be able to handle such cases we would need
82. ary provides two routines symbol lookup key and symbol insert key to search for and insert keys You should look at Chapter 7 of the support library manual JS97b for a complete description of these routines Lines 30 36 illustrate the use of these functions to look up an identifier in the symbol table 30 statement ID name 31 if symbol lookup key mini name NULL NULL 32 1 33 text message TEXT ERROR Undeclared variable s n name 34 symbol insert key mini amp name sizeof char sizeof mini data 35 36 The ID scanner primitive will accept an alphanumeric identifier whose lexeme is returned in attribute name The semantic action calls symbol lookup key to search the symbol table called mini for the identifier name in any scope region If the symbol table does not contain name then symbol lookup key will return a NULL value in which case the action issues an error message and then inserts the identifier into the table This is done so as to suppress subsequent error messages that might be triggered by later references to the variable Chapter 4 minicond language interpretation 4 1 with conditionals minicalc language discussed in the last chapter is only really as powerful as an integer only pocket calculator with a large number of memories Historically calculators were distinguished from full blown computers on the basis of their control capability to be
83. at how the tool is used before covering the design and implementation of the translator All of the tools are included in the rdp standard distribution pack and have been tested on MS DOS Windows 95 and Unix systems rdp source code is public domain and has been successfully built using Borland C version 3 1 and Microsoft C version 7 on MS DOS Borland C version 5 1 on Windows 95 GNU gcc and g running on OSF 1 Ul trix MS DOS Linux and SunOS and a variety of vendor s own C compilers Users have also reported straightforward ports to the Amiga Macintosh and Archimedes systems This document is Adrian Johnstone and Elizabeth Scott 1997 Permission is given to freely distribute this document electronically and on paper You may not change this document or incorporate parts of it in other documents it must be distributed intact The system itself is Adrian Johnstone but may be freely copied and modified on condition that details of the modifications are sent to the copyright holder with permission to include such modifications in future versions and to discuss them with acknowledgement in future publications The version of rdp described here is version 1 50 dated 20 December 1997 Please send bug reports and copies of modifications to the authors at the address on the title page or electronically to A Johnstone rhbnc ac uk Contents Translation tools 11 spectrum of language translators a
84. at is commonly used by assemblers for real processors although executable file formats used by commercially available processors are usually in pure binary to save space In a pure binary file each location could be represented by a single binary byte but in our case each location requires two bytes each representing a hexadecimal character Therefore MVM executable files are likely to be at least twice as large as their pure binary equivalents On the other hand pure binary files can not easily be read into an editor or printed out As well as specifying memory contents the input file must tell the simula tor which memory location contains the first instruction to be executed The mvmsim input file uses a special format to specify this transfer address compris ing an asterisk followed by the transfer address itself The short example below shows the contents of an mvmsim input file for a program comprising three instructions 1000 OC 01 000A 007B 1006 10 11 0000 000A 100C 00 11 100E 1000 The first instruction based at location 100016 copies the number 12310 7 16 to the memory location 000 16 The next instruction prints out the contents of that location as a decimal number and the third instruction is HALT which will cause the simulator to terminate The final line specifies a transfer address of 100016 the transfer address is denoted with a leading asterisk which warns the assembler not to attempt to load the data on that l
85. auto matically resets the input at the start of each pass but some mvmasm variables need to be re initialised each time as shown in lines 37 40 boolean flag emit code is set to TRUE on pass three and is used to control the output to the binary file code emission is inhibited on passes one and two when this flag is FALSE Both the data and code location counters are zeroed and the code location is set as the default assembly address Finally dummy 1abel is initialised to point to a new symbol table record This record is used to handle errors involving the EQU assembler directive as will be described below Each line of assembler source is processed by the code rule which can match an optional label an optional instruction and a line end EOLN primitive At the start of each line an end of line character is sent to the output by calling emit eoln followed by the value of current location counter The global variable last label is set to NULL to indicate that no label has been seen yet on this line After processing the contents of the line the code rule calls emit fill to pad the binary output to column 16 This ensures that the mixed binary source output listing is properly aligned Labels are processed by rule label string representing the label s name is returned in attribute lab and the semantic action in lines 48 51 first looks up the label in the symbol table inserting it if not already present before loading the
86. bnf rdp c rdp exe rdp h rdp_aux c rdp aux h rdp gram rdp gram rdp_prnt rdp_prnt test c test pas testcalc testcond 5 85B testloop testtree m rdp_doc rdp_case dvi rdp_doc rdp_case ps rdp_doc rdp_supp dvi rdp_doc rdp_supp ps rdp_doc rdp_tut dvi rdp_doc rdp_tut ps rdp_doc rdp_user dvi rdp_doc rdp_user ps rdp_supp arg c rdp_supp arg h rdp_supp graph c rdp_supp graph h rdp_supp memalloc c rdp_supp memalloc h rdp_supp scan c rdp_supp scan h rdp_supp scanner c rdp_supp set c rdp_supp set h rdp_supp symbol c rdp_supp symbol h rdp_supp textio c rdp_supp textio h examples An overview of rdp Main rdp makefile rdp specification for the minicalc interpreter rdp specification for the minicond interpreter rdp specification for the miniloop compiler rdp specification for the minitree compiler rdp specification for the mini syntax checker miniloop auxiliary file miniloop auxiliary header file minitree auxiliary file minitree auxiliary header file rdp specification of the mvmasm assembler source code for the mvmsim simulator auxiliary file for mvmasm auxiliary header file for mvmasm op code definitions for MVM rdp specification for Pascal rdp specification for the ANSI C pretty printer auxiliary file for pretty c auxiliary header file for pretty c rdp specification for rdp itself rdp main source file generated from rdp bnf 32 bit rdp executable for Win 32 zip file only rdp main h
87. cember 1997 testloop m a piece of Miniloop source to test the Miniloop compiler This file may be freely distributed Please mail improvements to the author o 10 11 int 3 4 1 12 13 is n 14 15 b a 2 16 17 print b is b b is b n 18 19 print a cubed is 3 Nn 20 21 int 2 a 22 23 if z a then print z equals else print z does not equal a n 24 25 z a 3 26 27 if z a then print z equals else print z does not equal a n 28 29 a 3 30 31 while a gt 0 do 32 begin 33 print a is a n 34 1 35 end 36 37 End of testloop m Figure 7 1 An example miniloop program testloop m The output produced when this is run through the miniloop compiler and then assembled and simulated by mvmasm and mvmsim is shown in Figure 7 2 The assembler code produced by miniloop is shown in Figures 7 6 7 8 and discussed in section 7 8 7 1 1 The begin end block compound statement It is useful to be able to group statements together into blocks so that a single if statement can control the execution of a list of statements In minicond only a single statement could be placed within the then or else clause of an if statement The begin end brackets allow statements to be grouped and treated as a single compound statement It is worth noti
88. child is a string marked with a node type of RDP T 18 corresponding to the token then emit print is called to emit a print string instruction If not then the expression walker is called on the child and emit print is called to emit a print integer instruc tion code to handle if and while statements is at lines 145 163 and 165 179 respectively The general format of the code is exactly the same as for the semantic actions in the miniloop grammar except that the tree walker is called in lines 158 160 and 176 to generate the code for the then else and do blocks It would be quite straightforward to integrate the expression and statement walker functions together into a single function We have separated them for clarity but the reader may like to consider how to combine them together Further ideas for projects are given in the final chapter Chapter 9 pretty printer for ANSI C 9 1 pretty printer is a tool that rearranges the formatting of a program so as to meet some standard for indentation and comment placement It turns out that ANSI C and its embedded preprocessor present some difficult challenges in the design of pretty printer which we shall explore in this chapter The tool described here is called pretty c and you can see some examples of its output in Figures 9 4 9 7 rdp is usually used to specify parsers that describe a language tightly that is the parser should accept inputs that are in the la
89. current value of the location counter into the symbol s va1 field It should be clear that labels may be redefined in mvmasm that is they may appear more than once in a label field This is primarily intended to allow labels to have new values assigned with the EQU directive but it does mean that a label might accidentally be used several times within a code segment and it is hard to imagine a situation where this would not represent a programming error The reader might like to consider whether it would be appropriate to issue a mvmasm grammar 61 warning message when such doubly declared label is encountered Rule instr on line 53 splits the handling of instructions into six cases the first five handle the five syntactically different classes of MVM instructions and the sixth handles the assembler directives The diadic instructions matched by rule diadic illustrate the general prin ciples used in all MVM instruction processing Three local variables op m1 and m2 are used to collect the operation code and the addressing modes for the two destination operands Upon recognition of the opcode op is set to the corresponding member of the opcodes enumeration shown in Figure 6 4 The addressing modes are set to 1 variable mode addressing by default but are set to 0 constant mode addressing if the corresponding operand starts with a character The expression evaluator which is described in the next sec tion is called for each op
90. cuted it must be loaded into memory and then the program counter initialised with the address of the first instruction to be executed Once the machine starts running the program it reads the instruction pointed to by the program counter into some internal registers collectively called the Instruction Register In this case the instruction register has space for an operation code an address mode and up to three operands called the destina tion source 1 and source 2 operands These registers between them can hold all of the information needed to execute a single instruction the sig nificance of the individual registers will be described in the following sections For now note that the data from memory may be loaded into the instruction register or sent to the execution unit that memory addresses may be supplied by the program counter or by the operand registers that only the execution unit can generate data to be written back into memory and that the contents of the operand registers may connected directly to the execution unit After the instruction register has been loaded with a new instruction ready for execution the program counter is incremented so that it points to the loca tion just past the end of the instruction that has just been read The processor then performs whatever action is specified by the instruction in the instruc tion register which might for instance be the addition of two numbers or the 36 5 3 THE MINI
91. dded to the symbol table and given the value of the assembler s current location counter However the EQU directive can be used to assign arbitrary numeric values to labels by evaluating expressions and label values may of course be used in those expressions Thus labels in mvmasm perform the same r le as variables in the minicalc interpreter and it is perhaps to be expected that the symbol table declaration in lines 28 35 is essentially identical to the symbol table declaration in the minicalc and PPP IS ER ER HR ERR Ne gt 43 44 45 46 mvmasm grammar 57 kk kk ke ke RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 mvmasm bnf an assembler for Mini Virtual Machine assembler language This file may be freely distributed Please mail improvements to the author 2 ak ake ake ake ak ke kk ke ke hok hh heh hohe heh de ak koe kk he ok ake ke ake ake ak ake ake TITLE mvmasm v1 5 absolute assembler for mvm SUFFIX mvm USES aux h USES math h USES mvm def h PASSES 3 PARSER unit name of s
92. dp tree SYMBOL TABLE mini 101 31 symbol compare string symbol hash string symbol print string char id check declared if symbol lookup key mini name NULL NULL 1 text message TEXT ERROR Undeclared variable s n name symbol insert key mini amp name sizeof char sizeof mini data program 1 var dec statement j var dec 2 77 dec_body dec body ID name eO symbol insert key mini amp name sizeof char sizeof mini data if name amp amp name 1 text message TEXT ERROR ECHO variable names must not begin with two underscoresWMn 1 statement ID name check declared eO assignment 914977 eO statement else statement if statement hile eO do statement while do statement print 7 0 String 20 27 print statement begin statement end compound statement Figure 8 7 An rdp BNF specification for minitree part 1 statements 04 A MULTIPLE PASS COMPILER 49 eO Greater than 50 99555 ef Less than 51 2 7 ef Greater than or equal 52 lt 7 Less than or equal 53 j22 Equal x 54 phere el Not equal 5B 56 57 ei e2 59775 2 Add
93. e actions may be separated out from the main part of the grammar whilst still residing in the same source file semantic rule is one which contains only single sequence of semantic actions such these rules do not affect the language generated by the gram mar or equivalently matched by the parser generated from the grammar By convention semantic rule names begin with a leading underscore so that when reading the grammar we can mentally delete them from consideration of the language generated Semantic rules are implemented using ANSI C macros rather than as func tions This is to allow the semantic rule to automatically have access to all of the attributes in the rule that calls the semantic rule but you should be aware that each instance of a semantic rule will result in the complete body of the rule being instantiated into the parent rule so casual use could lead to very large generated parsers Semantic rules can take inherited attributes but may not return synthesized attributes since the semantic rule automatically has access to the complete state of the calling rule it can access the parents attributes directly The in herited attributes are treated slightly differently for semantic rules than for normal rules in that the attributes are made into macro parameters and are Adding conditional execution 29 thus available for textual substitution within the semantic actions As such they do not take a type and they follow
94. e assignments temp x and temp x 12 are both valid In practice they will be compiled into an ADD instruction with three operands a destination and two sources corresponding to the left and right sides of the operator The variables x and y will be stored at specific memory addresses What if variable y were to be stored at location 12 How would an MVM processor distinguish between an instruction to add the number 12 and an instruction to add the contents of a variable stored at location 12 The answer is to provide some extra information called the addressing mode It is the responsibility of the compiler to specify the correct addressing mode when it generates an MVM instruction We shall look at how the modes are specified in the next section provides only two addressing modes literal and variable Real pro cessor architectures often provide many complex addressing modes which for instance might allow an access within a two dimensional array to be specified as a single machine instruction The trend in recent years has been to discour age the use of any but the most straightforward addressing operations because they complicate the use of pipelining in hardware implementations Pipelined processors are very efficient but their execution units are disrupted by the MVM instructions 37 overhead of having to decode complicated addressing modes Broadly speak ing Complex Instruction Set Computer CISC architectures
95. e form and a separate code generation pass traverses the tree and outputs MVM assembler source code which may then be assembled and simulated T pretty printer for ANSI C which illustrates the use of a highly under specified grammar to process a language which will be checked for syn tactic correctness by another tool using a fully specified language We conclude this report by suggesting some design projects based on extensions to the compilers of these tools are included in the rdp distribution and are automatically built and tested as part of the standard installation makefile If you have successfully installed rdp therefore you should already have working versions of the tools and all the source files described here will be found in the main rdp directory Chapter 2 minicalc language simple 2 1 calculator with declared variables In this chapter we give grammar and associated syntax checker for a tiny language minicalc which includes only the features at the core of any proce dural programming language expression evaluation and assignment to named variables In minicalc as in most modern languages variables must be de clared before they are used so as to catch the elementary programming error of assigning to a variable whose name has been misspelled In early high level programming languages a system would quietly make a new variable with the misspelt name and assign the value there Subsequen
96. e src2 DIV dst src src mem dst resolve srci resolve src2 EXP dst src src mem dst resolve srci resolve src2 EQ dst srci src2 mem dst resolve srci resolve src2 NE dst src src mem dst resolve srci resolve src2 GT dst src src mem dst resolve srci resolve src2 GE dst src src mem dst resolve srci gt resolve src2 LT dst src src mem dst resolve srci resolve src2 LE dst src src mem dst resolve srci lt resolve src2 CPY dst src mem dst resolve srci BNE target src if resolve srci 0 then PC target BEQ target src if resolve srci 0 then PC target PRTS 0 src Print resolve srci as string PRTI 0 src Print resolve srci as decimal integer Table 5 1 The MVM instruction set the most significant nibble encodes for 7 and the least significant nibble for src2 For instructions that do not use one or both of the source operands the corresponding mode fields are set to zero No mode field is required for the destination operand because the destination must clearly always be an address it is never meaningful to assign a result to literal On some real architectures multiple destination addressing modes are provided but MVM has no need of them Example MVM instructions programs are made up of sequences of MVM instructions which will usually include both data manipulation and control manipulation instructions Each valid program must finish with a H
97. e that a run of exponentiation operators is evaluated in the reverse order to that in which they are read that is right to left which is what we require for a right associative operator In detail it is clear that the right recursion will absorb all instances in a run of exponentiation operators so even though we have used the zero or many iterator bracket each invocation of e4 can only ever absorb zero or one instances of the operator so in practice we write such rules in this way e4 eb 4 1 Operators that do not combine Some operators yield result values that are incompatible with their operands and therefore can not be used next to each other in expressions It is not math ematically meaningful for instance to write an expression like 3 lt 4 lt 6 because the result of evaluating 3 lt 4 is a boolean truth value and this cannot reasonably be compared to the integer 6 We might loosely call such operators non associative but strictly speaking it is meaningless to speak of the associa tivity of such an operator The arithmetic relationals are the standard examples of such operators and we implement them using rules such as rel expr exp exp 1 The important point here is that the curly braces used for the expression rules 1 and e2 in Figure 2 2 to specify zero or more consecutive instances of an operator are replaced by square brackets 1 which only allow zero or one occurrence
98. eader file generated from rdp bnf rdp auxiliary file rdp auxiliary header file grammar checking routines for rdp grammar checking routines header for rdp parser printing routines for rdp parser printing routines header for rdp ANSI C pretty printer test source file Pascal test source file minicalc test source file minicond test source file miniloop test source file minitree test source file case study dvi file case study Postscript source support manual dvi file support manual Postscript source tutorial manual TEX dvi file tutorial manual Postscript source user manual TEX dvi file user manual Postscript source argument handling routines argument handling header graph handling routines graph handling header memory management routines memory management header scanner support routines scanner support header the rdp scanner set handling routines set handling header symbol handling routines symbol handling header text buffer handling routines text buffer handling header examples from manuals Table A 1 Distribution file list Build log 123 executables If you type make veryclean then the directory is cleaned and the executables are also deleted A 2 Build log The output of a successful makefile build on MS DOS is shown below Note the warning messages from rdp on some commands these are quite normal cc Irdp supp c rdp c rdp c cc Irdp supp c aux c rdp aux c cc
99. ecifies the start address of the unit 6 4 4 The expression evaluator expression evaluator follows the general principles used in the minicalc interpreter In mvmasm a more complete set of operators is available than in minicalc corresponding to the complete set of ANSI C integer operators aug mented with the exponentiation operator Two literal values have also been added TRUE and FALSE which yield 1 and 0 respectively Identifiers are checked for validity on the final pass lines 143 150 undefined labels on earlier passes are simply ignored 62 MVMASM ASSEMBLER FOR MVM sk kk 2 ke ke ke ok ke oe eoe e jeje oj oj oj ok oj ok oj ok je ok EEE EEEE EEE EEEE EEE oe eoe ae ae ae RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 mvm aux c Mini Virtual Machine assembler semantic routines This file may be freely distributed Please mail improvements to the author o ak sk sk jk ks fe ff kk EE EEEE EEEE EE EEE EEEE EEE EEE EEE EEEE EEE E EEE ke fe fe fe ko fe ff ke ee kk include lt stdarg h gt m o 11 include lt stdio h gt 12 include lt stdlib h gt 13 include scan h 14 include memalloc h 15 include textio h 16 include mvmasm h 17 include mvm aux h 18 19 int emit code 0 20 int execute sim 0 21 22 static FILE objfile NULL 23
100. ed documentation leading to cryptic and hard to understand programs Treating loops using the rewinding trick would mean that the loop would be re parsed over and over again and such an interpreter would be slow Nevertheless this kind of trick is used in some real systems in particular BASIC interpreters such as the Visual Basic engine built in some Windows 95 tools work this way To improve the performance a little it is normal for such tools to store the program in a format that strips out comments and white space and replaces keywords with single characters This eases the job of the scanner and helps to improve performance A compiler does not attempt to execute a program in the way that an interpreter does Instead it outputs program in the machine language of some target processor which can be directly executed by that processor The compiler s main task is to identify operations in the source program and map them to code templates in the target processor s language that have the same meaning or semantics Full compilation undoubtedly provides the most efficient way of executing most real programs but a different target program will be required for each kind of target processor that is the generated code is not portable between ar chitectures or in extreme cases not even between different models of computer within the same architectural family One approach to providing a measure of portability is to strictly separate the par
101. ed n 277 break 278 279 280 Figure 5 6 Extracts from the mvmsim simulator the execute function part 3 Chapter 6 mvmasm an assembler for MVM 6 1 Writing MVM programs directly in the binary machine code is very error prone In the early days of computing it was not unusual for programmers to take great pride in their ability to remember all the binary codes for instructions but even if the machine code is easy to remember as indeed it is for the very simple MVM processor it is still hard to keep track of lots of variables if they can only be referred to by their numeric machine addresses Assemblers evolved as the earliest available programming aids Most assem blers provide two basic facilities o a set of mnemonic names for the machine instructions o the ability to label instructions and data locations allowing jump targets and variable addresses to be referred to using symbolic names rather than numeric values In addition assemblers usually allow arithmetic to be performed on symbolic addresses This allows the address calculations associated with array indexing record field selection and jump branch selection to be done by the assembler rather than by the programmer A first example mvmasm source code corresponding to the short example used in section 5 7 3 is shown below a variable is loaded with decimal 123 and then printed Simulator example file DATA OxOOOA start assembling dat
102. ed to hold the working value of a variable Using synthesized attributes synthesized attribute is a value which may be of any C type including primitive types such as characters and integers as well as complex types such as structures and arrays which is passed back up a parse tree or equivalently in rdp terms returned by a grammar rule or scanner primitive simple example may be found in the definition for rule e5 part of which is reproduced here e5 integer INTEGER result When rule e5 matches against an INTEGER the scanner can be asked to return the binary number corresponding to the INTEGER lexeme just recognised The specification here indicates that the return value should be loaded into an at tribute called result At the end of a rule the current value of result is returned to the caller of the rule so the effect of this rdp IBNF fragment is to parse an integer and return the corresponding binary number to the caller Expression evaluation The expression tree rules e1 e5 specified in lines 45 70 evaluates expressions by collecting the values of operands in rule e5 and passing them back up through the tree performing any calculations specified by operators en route Each operator has an attached semantic action which evaluates its operands into the result return value The semantic actions just use the equivalent op erator in the underlying C language except in the case of the exponentiation operat
103. er compiler called minitree which compiles from the same source language to the same MVM assembler code as miniloop The difference between the two compilers is that miniloop emits assembler code during the parse whereas minitree builds an internal representation of the source program a modified derivation tree and then in a separate phase traverses the tree to output the assembler code The two compilers are functionally almost identical as they stand but minitree allows code optimisations such as rearranging the order of instructions to be performed Since miniloop is a single pass compiler it cannot perform code re ordering miniloop features miniloop programs look like minicond programs with some additional fea tures the minicond and minicalc languages are almost strict subsets of the miniloop language so any minicalc or minicond program will be correctly handled by the miniloop compiler The only exception to this rule is that miniloop variable names must not start with two underscore characters This is because miniloop generates internal identifier names with that form and we do not want user identifiers and internal identifiers to clash Figure 7 1 shows an example miniloop program 68 SINGLE PASS COMPILER FOR MINILOOP ok ok ke e ke ke jk fe kk e ke ke je ok fe 2k ok oe 2k ok ek ke EEEE EEEE EEEE EEEE EE ok ok RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 De
104. erand After the line has been processed the auxiliary function emit op is called to output the binary pattern corresponding to the instruction The other MVM instructions are processed similarly The eight assembler directives are handled by rule directive The INCLUDE directive collects a filename and calls text open to lookup and open the file If the file is not found text open returns NULL and an error message is issued If the file is successfully opened the text handler and scanner are initialised lines 91 92 There is no need to restore the scanner and text handler context when an included file is closed because the text handler performs this task automatically CODE and DATA directives switch the current assembly location to the code or data pointer respectively They also optionally take an expression and update the location accordingly The WORD BLOCKW and STRING directives allocate storage space for data WORD takes an expression which is evaluated and emitted directly The BLOCKW also takes an expression which is then used to update the location counter which has the effect of reserving a block of storage without initialising it The STRING directive accepts a double quote delimited string and then emits along with a terminating zero the ASCII NUL character The EQU directive takes an expression and updates the current label s val field accordingly The END directive marks the end of the assembly unit and sp
105. esent machine level quantities so Table 5 1 which shows the complete MVM instruction set gives the hexadecimal encodings for the instructions The functional description of each instruction in Table 5 1 uses a C like syn tax to explain the actions of the MVM processor on receipt of each instruction Main memory is modeled as an array of locations called mem and the program counter as a variable called PC The function resolve looks at the addressing mode of its corresponding operand and fetches the actual data Later in this chapter we give extracts from the source code of a simulator for MVM instruc tions which shows exactly how these functional descriptions may be turned into executable code 5 4 2 Address mode encoding The mode byte is split into two four bit nibbles called mode fields 1 Each of the two mode fields encode the address mode for one of the two source operands Since we only have two addressing modes to encode we could make do with only a single bit for each field but we wish to leave some capacity so that for instance a register based variant of MVM could be easily defined 38 MINI VIRTUAL MACHINE MVM Opcode 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E OF 10 5 5 Mnemonic Operands Function HALT Stop the processor ADD dst src src mem dst resolve srci resolve src2 SUB dst src src mem dst resolve srci resolve src2 MUL dst src src mem dst resolve srci resolv
106. f the last reported line number This may be different to the number of line endings seen because comment that spans a line ending will be parsed as a single comment lexeme and so some line endings may be hidden The file handling routines pretty open and pretty close are shown in Figure 9 5 and are straightforward 9 5 1 space array The 16 x 16 array of booleans space array is used by the pretty printer to decide whether a space should precede the current lexeme before it is output ON HB RB HR HR HH HR OHO AN DOP o o UNBE Auxiliary routines 111 Sk 2 ee ke aj ke oe ok aj ke ke jk ke jk kk af ake X X RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 pr_c_aux h pretty printer semantic routines This file may be freely distributed Please mail improvements to the author EEEE EE EEE EEEE EEEE EE EEE EEE EEEE EEEE EEEE EEE EEE ke ke af ke kk af ok H enum kinds d K BLOCK CLOSE K BLOCK K CHARACTER K CLOSE BRACKET K COMMENT K DIADIC K_EOLN FIELD DELIM K KEYWORD K KEYWORD INDENT ITEM K MONADIC K BRACKET PREPR CESSOR K PUNCTUATION K STRING extern unsigned long indent size extern unsigned long comment st
107. graph get edge target this edge do stat rdp tree data graph get edge target graph get next edge this edge integer label new label emitf DO Z1lu WMn label relation expression walk rel stat emitf BEQ luNt ifn 4s go to __OD_ lu relation label relation label tree walk do stat emitf BRA __DO_ lu t go to __DO_ lu n__OD_ lu n label label label break default text message TEXT FATAL unexpected tree node found token number 1 identifier V 4s V Mn root gt token root gt id void code generate char source char output void tree root 1 emit open source output tree walk rdp tree data graph get next node tree root emit close End of mt aux c Figure 8 13 minitree auxiliary functions part 5 while and POST PARSE func tion minitree auxiliary functions 99 either one of the scanner primitives such as SCAN P ID the ID primitive or a keyword from the minitree grammar such as RDP T 17 the token The definitions of the primitives may be found supp scan h and the defini tions of the minitree tokens minitree h The expression walker returns at each level the name of the variable containing the result of the calculation per formed at that level in exactly the same way as the expression rules in miniloop transmit the names of locations back up the tree leaf nodes an expression must be either
108. hecking long range relation ships between program symbols variable declaration may occur a long way before that variable is used but the context free grammars used by rdp are es sentially only powerful enough to check local features of the language contezt sensitive grammar may be written in such way as to support type checking but efficient parsing techniques for practical context sensitive grammars are not available Instead we use an external symbol table and embedded semantic ac tions to keep track of the declaration and use of identifiers Our next tool which is a full interpreter for minicalc can check for undeclared variables without any extra overhead the interpreter needs a symbol table anyway to keep track of computed results and it turns out that adding checks for undeclared variables is straightforward Chapter 3 interpreter for minicalc 3 1 primary purpose of is to construct a parser for the language generated by an rdp IBNF specification Such a parser may be used as a syntax checker for the language as we have seen in the previous chapter Syntax checkers are useful but we really want to be able to write translators that perform some useful action as a side effect of performing a parse interpreter is a translator that executes actions specified in the parser whilst a parse is occurring To be suitable for interpretation the language grammar must be designed to be both parsable and exec
109. here are two internal registers used to hold the address and contents of the currently executing instruction MVM is a 16 bit processor in which arithmetic operations take place on 16 bit quantities and in which memory addresses also fit into 16 bit words This limitation to 16 bit memory addresses does constrain the size of the programs that we can write but is sufficient for demonstrating the ideas behind the development of a compiler It also means that the tools can be compiled and run on older 16 bit computers such as ordinary MS DOS machines If you have a 32 bit system and a suitable C compiler it is quite easy to extend the MVM specification and its simulator to support 32 bit operations and addresses Understanding a new processor is made easier if we list the capabilities of the architecture under three main headings 1 the memory resources provided by the architecture 2 the various ways in which operands may be fetched during instruction execution the addressing modes and 3 the collection of operations that may be programmed on that architecture the instruction set In each of these three areas MVM provides very limited facilities This makes MVM easy to understand easy to program and easy to write software simulators for but it does not make MVM a good target for efficient hardware implementation That need not concern us MVM is only really intended to 34 5 1 THE MINI VIRTUAL MACHINE MVM memory 65535 6
110. hich could build into a complete compiler for C targeted at a virtual machine of the MVM form The suggestions here form a coherent path through the tasks but an experienced language implementor would probably coalesce some of the intermediate stages together 120 DESIGN PROJECTS a i j k 1 Define language for the target machine The MVM assembly lan guage used in this manual is suitable as a basic language and has the advantage that an assembler and a simulator already exist for that language extremely ambitious choice would be to use the language of a real processor although this is only recommended for readers that are very familiar with programming the chosen proces sor Define a tree like intermediate data structure to represent the result of parsing the source program student can either decide to build this structure using actions embedded in the parser specification or use the automatic tree building capability of rdp Define subset of C simple subset might correspond to a ver sion of C that only allows integer operations has no pointers no user defined types and no capability to define functions Control structures might also be restricted a simple if then else state ment and a while do statement would suffice in the first instance Write an rdp specification that parses the chosen subset language and test it against a set of test examples illustrating both co
111. ibutes and we also saw how information calculated within a parser rule could be passed back to a calling rule using a similar mechanism These kinds of attributes are called synthesized attributes because the information is synthesized at the lower level either within the scanner or a rule and passed back the chain of produc tion rules Synthesized attributes correspond roughly to the return values of functions in a programming language and as we have seen this is precisely how they are implemented in rdp Sometimes we need to reverse this process and pass information down into production rules We can think of this as a rule inheriting information from the rule which called it and so these kinds of attributes are called inherited attributes They correspond roughly to function parameters in conventional programming languages and that is how they are implemented in rdp You can read more about the use of inherited attributes in Chapter 5 of the user manual 597 and Chapter 6 of the tutorial manual JS97c In the minicond interpreter we use inherited attributes in some rules to pass in a flag called interpret which controls the execution of semantic actions The use of this flag will be explained more fully in section 4 5 below Here we 28 4 4 THE MINICOND LANGUAGE INTERPRETATION WITH CONDITIONALS simply note that an inherited attribute is specified on the left hand side of a rule definition by adding parenthesised parameters
112. ient because discovering a derivation for an input text is so time consuming that is after all the primary function of the parsers that rdp constructs and it would clearly be wasteful to run the process several times Of course just because this is a wasteful process it need not stop us using it where applicable and rdp provides the PASSES directive for precisely this purpose Simple multi pass ap plications such as the implementation of a translator from a machine s assembly language to its machine code may usefully exploit this strategy You can read about the design and implementation of such as assembler in Chapter 6 Leaving aside issues of efficiency making multiple independent passes over the source text does not allow us to make connections between widely separated parts of the text because the parsers generated by rdp only look at a single symbol at a time they do not of themselves keep track of complete sentences or program statements However rdp can be set to build a derivation tree whilst it performs a parse This tree shows explicitly the relationships between 84 8 1 MINITREE MULTIPLE PASS COMPILER symbols in the source program that are only implicitly present in the original text and can be traversed and rearranged efficiently This chapter is about a compiler called minitree that accepts the same source language as the miniloop compiler described in the previous chapter and which outputs almost identical MVM asse
113. ight order much like the Pascal write statement The usual ANSI C escape sequences may be used to output non printing characters such an line end represented by n minicalc limitations minicalc can only perform integer computations and only allows strict se quencing of statements there being no flow of control statements In addition there is no read input statement to accompany the print output statement Enhanced versions of minicalc will be presented in later chapters A grammar for minicalc It is easy to construct an LL 1 grammar for minicalc suitable for input to rdp A parser generated from such a grammar with no semantic actions embedded within it acts as a pure parser syntaz checker for the minicalc language Figure 2 2 shows a suitable grammar from which to generate a mini syntax checker it is supplied with the rdp distribution as file mini_syn bnf and will be discussed in the remainder of this section OIN ON HB HR RB OH RB HR 0 Q o o UNBE 1 N E grammar for minicalc 9 ff hse he ke je ke ok ke he ok eek he koe he fe oe he ok koe ek koe hehe ek koe koe ek RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 mini syn bnf a mini grammar for syntax checking This file may be free
114. iliary functions part 1 8 9 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 76 77 78 79 80 81 82 83 84 85 86 87 88 89 Auxiliary routines 113 void pretty_open char sourcefilename char outputfilename if strcmp sourcefilename outputfilename 0 text message TEXT FATAL temporary output filename is the same as the source filename if outputfilename outputfile stdout else if outputfile fopen outputfilename w NULL text message TEXT FATAL unable to open output file 57055 outputfilename void pretty_close char sourcefilename char outputfilename unsigned long useful_lexeme_count lexeme_count comment_count eoln_count char backup filename text force filetype sourcefilename bak fclose outputfile remove backup filename if rename sourcefilename backup filename O text message TEXT FATAL unable to rename s to s n sourcefilename backup filename if rename outputfilename sourcefilename 0 text message TEXT FATAL unable to rename s to s n outputfilename sourcefilename text printf As 41lu 41u 4 21fNn sourcefilename last line useful lexeme count double useful lexeme count double last line Figure 9 5 Pretty printer auxiliary functions part 2 114 PRETTY PRINTER FOR ANSI C 90 void pretty print char lexeme enum ki
115. ine 9 1 1 Command line options The pretty printer provides the normal rdp generated parser command line options along with the following two pretty printer specific flags indent spacing default indentation spacing is two spaces larger value makes the inden tation clearer whilst making the lines longer and some standards require the use of tab characters to show indentation flag of i0 will force pretty c to use one tab character per indent non zero value such as i4 will set the pretty printer to use that number of spaces per indent c comment start column pretty printer handles comments specially as will be described in the next section pretty c attempts to line up comments by moving them across to the comment start column which is column 30 by default This flag may be used to change the comment start column Using the pretty printer 103 9 1 2 File usage pretty_c is a single pass parser which reads the lexemes in the input file from left to right in the usual way and writes a reformatted version to a temporary file By default this file is called pretty c but a different temporary file name can be specified with the o option At the end of a successful run pretty c renames the source file to a file with the same name but a filetype of bak and then renames the temporary file to the original source file name It is a fatal error to try to make the temporary output file the same name as the
116. ine into memory The file may be found in the standard rdp distribution as examples rdp case mvmsim sim 5 7 3 Running a simulation We can run the simulator on the above test file with all options enabled by issuing the command mvmsim 1 t v examples rdp case mvmsim sim The output of this command is shown below 42 MINI VIRTUAL MACHINE MVM mvmsim v1 5 simulator for mvm Load address 1000 0C01000A007B Load address 1006 101100000004 Load address 100C 0011 Load address 100E 1000 1000 CPY 0004 007B 0000 gt 007B 1006 0000 007B 0000 gt 0000 123 100 HALT 0000 0000 0000 gt 0000 Halted 0 006 CPU seconds used 3 MVM instructions executed After the title line mvmsim echoes to the output the contents of the input file as it is loaded into the internal memory Execution then begins starting at the transfer address As each instruction is executed mvmsim outputs the address of the instruction its mnemonic and then the three operands in the order dst src and src2 If an operand is not used by an opcode then 0000 is output The operands are printed after the addressing mode has been resolved that is the actual data to be operated on is displayed rather than its address Hence when the instruction at location 100616 is being executed its second operand is shown as 7Bjig not as 000 16 which is the address specified in the load file The value written back to memory by the instruction is
117. ing to the operator being processed 7 9 2 miniloop auxiliary functions The miniloop auxiliary functions are shown in Figures 7 11 and 7 12 They perform file handling output to the assembler object file and some housekeeping concerned with the generation of unique labels Function emitf lines 24 31 forms the heart of the output routines it simulates the behaviour of the 0000 0000 0000 8000 8000 1000 1000 1000 8000 8002 8002 1000 1008 100 100 8002 8004 8004 100 1014 1014 8004 800A 800A 1014 101A 1020 1020 800A 800C 800C 1020 1026 102E 1034 1034 800C 8012 8012 1034 103A 1040 1040 8012 801B 801B 1040 1046 104E 1054 0001 0100807 400030004 001180008074 0001 0C0180020001 612069732000 0 0100008004 101100008000 0400 OF010000800A 0310807580000002 001180028075 622069732000 0 010000800 101100008002 2 20206220697320 0 0100008012 0211807600008002 101100008076 Q IQ M IQ N oH d BBB 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 CAN Da FF Implementing miniloop 75 testloop mvm generated from testloop m DATA Ox8000 __MPP_DATA CODE 0x1000 MPP CODE DATA a WORD O CODE ADD __temp 0 3 4 _ temp 0 3 4 CPY a __temp 0 __temp 0 DATA b WORD 0 CODE
118. ion 5 7 4 Implementing mvmsim mvmasm an assembler for MVM 6 1 6 2 6 3 6 4 6 5 A first example 6 1 1 Assembler output 6 1 2 Using the assembler and the simulator together Assembler syntax reference 6 2 1 Line oriented and free format languages 6 2 2 Lexical elements 6 2 3 Expressions 6 2 4 Instructions and addressing modes 6 2 5 Directives Implementing mvmasm 6 3 1 Multiple pass parsers 6 3 2 The EOLN scanner primitive The mvmasm grammar 6 4 1 Directives for setting up the parser 6 4 2 The MVM definition header 6 4 3 main mvmasm grammar 6 4 4 expression evaluator mvmasm auxiliary functions single pass compiler for 1 7 1 7 2 7 3 7 4 7 5 7 6 TT 7 8 7 9 miniloop features 7 11 begin end block compound statement 7 1 2 while loop Arranging data and code in memory Compiling declarations Compiling arithmetic expressions Compiling print statements Compiling if statements Compiling while loops Typical compiler output Implementing miniloop 7 9 1 grammar for miniloop 7 9 2 miniloop auxiliary functions 38 39 39 40 40 41 41 42 47 4T 48 49 50 50 51 51 51 53 54 54 56 56 56 56 60 61 62 67 67 68 69 69 70 71 71 72 73 73 74 74 74 8 minitree a multiple pass compiler 10 8 1 8 2 8 3 minitree intermediate form Implementing minitree 8 2 1 grammar for minitree minitree auxiliary functions
119. ion and we shall now discuss this in detail Left to right evaluation and left associative operators The left associative arithmetic operators and are specified with rules of the form el e2 2 2 2 2 The rule specifying the left associative operators e1 in this case calls its im mediate successor rule in the operator tree e2 on both sides of the operators A grammar for minicalc 13 being recognised in e1 This has the effect of ensuring that higher priority op erators on both sides of a or are evaluated first In addition the use of the zero or many iterator bracket ensures that a run of and or operators is processed in strictly left to right order i e the operators are evaluated in a left associative manner Right to left evaluation and right associative operators Right associative operators such as exponentiation are specified using rules of the form eb 4 The crucial difference between this kind of rule for right associative operators and the previous rule for defining left associative operators is that in this case the rule calls itself on the right hand side of the operator Because of the way the iterator works this ensures that in a run of exponentiation operators all of the operators to the right of the first one will be processed before the first one is processed By extension it is easy to see that the effect of this kind of right recursive rule is to ensur
120. ity In programming languages operators take values of a particular type and return values with a type Integer addition for instance takes two integers and returns an integer and the greater than or equal relational operator gt takes two integers and returns a boolean result If the return type of an operator is the same as the type of its acceptable operands then we can write expressions that contain a run of similar operators such 3 4 5 which we can read as either 3 4 3 4 5 With addition both interpretations evaluate to 12 but if we use subtraction instead then an ambiguity arises 3 4 5 is 6 but 3 4 is 3 1 which is 4 The associativity of an operator specifies which of the two interpretations should be selected subtraction is in fact left associative so 3 4 5 is interpreted as 3 4 5 In general if we have several operators at the same level of priority we need to decide in which order to evaluate the operands In most cases we evaluate from left to right so that 2 3 4 evaluates as 2 3 4notas 2 3 4 and 2 3 4 evaluates as 2 3 4 Evaluating from left to right automatically gives each of the operators left associativity Left to right evaluation is most common but strings of exponents such as 2 3 4 are traditionally evaluated in right to left order The mini grammars demonstrate how to ensure left to right and right to left operand evaluat
121. ks so these must be reinstated on output and 4 K_PREPROCESSOR lexemes do not contain the delimiting token so this must be inserted before output of the body of the preprocessor command final actions of the pretty print function are to set a temporary indent if an indentable keyword has been output and to update the last kind variable ready for the next invocation Chapter 10 Design projects In this chapter we give suggestions for enhancements to the mini languages described in earlier chapters that might reasonably be undertaken as exercises as well as a larger project to build a subset C compiler 1 2 10 11 12 13 14 Add block definition to minicond hint follow the syntax in miniloop bnf Add left and right shift operators to minicalc and its descendent lan guages Add logical operators to minicalc and its descendent languages Add real arithmetic minicalc and its descendent languages Add a switch statement minicond and miniloop Add a for loop to miniloop Add a goto statement to miniloop Add function definition and call to miniloop Implement common mode subexpression elimination for minitree Add registers to MVM and a simple register allocator to minitree Add a graph colouring register allocator to minitree Add conditional assembly to mvmasm Add macro definition and call to mvmasm Implement a subset C compiler with rdp This is an ambitious project w
122. line Pretty printer limitations 105 10 comment that is not the first lexeme on a line is reformatted to begin in the comment start column or two columns to the right of the previous lexeme whichever is the least By default the comment start column is column 30 but this may be changed with the command line option 9 3 Pretty printer limitations conventions listed above are a useful start but it turns out that there are some sequences of C statements that can result in ugly formatting In its present form pretty cis good enough for everyday use all of the source has been formatted using it for instance but in this section we note a series of special cases that are handled poorly In the next chapter we make some suggestions on how to extend the tool to cope with some more esoteric constructions 9 3 1 Operators which may be monadic or diadic Some language tokens serve a dual r le The operator for instance is used to denote multiplication pointer definition and pointer dereferencing Ideally we should like to produce formatted output is the following form char str int a a str 4 Since pretty only ever examines the token to be formatted and its immediate predecessor it is hard to distinguish between the monadic and uses of In the present pretty printer is always treated as a diadic operator with a space on both sides resulting in output of the form char str i
123. lt goto result int result long result register result return result short result signed result sizeof result static result struct result union result unsigned result void result volatile result keyword indent char do result else result for result 2149 result switch result while result Figure 9 2 rdp grammar for pretty printer part 2 110 FOR ANSI C first defines macro called with parameter b which has 3 assigned to it in the body of the macro The second defines a parameterless macro a which expands to the string b b 3 The absence or presence of the space between the macro name and the opening parenthesis is used to decided whether a macro has parameters or not This is an immediate problem for the rdp scanner because spaces are discarded and must be reconstructed from the token stream It would be possible to do this by keeping track of the column numbers for the tokens immediately after define token but fortunately for us there is a simpler solution In the C preprocessor no line endings are allowed within preprocessor directives a result we can make use of the rdp scanner s COMMENT LINE VISIBLE primitive to define a comment that opens with the token and closes with the line end This will cause the complete preprocessor directive to be handled in the
124. lude else 9vumvuNN GCG Y Build log 125 Warning Grammar is not LL 1 but F set writing files 1 error and 1 warning cc Irdp supp c minitree c minitree c cc Irdp supp c mt aux c mt aux c cc eminitree exe minitree obj m aux obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj minitree otesttree mvm testtree mvmasm otesttree sim testtree Transfer address 00001000 mvmsim testtree sim is 7 is 14 b is 14 cubed is 343 equals a does not equal a is 3 is 2 9mvoNN XGCG is 1 Halted rdp opascal F pascal Error LL 1 violation rule statement 9 else statement 1 contains null but first and follow sets both include else Warning Grammar is not LL 1 but F set writing files 1 error and 1 warning cc Irdp supp c pascal c pascal c epascal exe pascal obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj pascal test c pretty Irdp_supp c pretty c c pretty c c cc Irdp_supp c pr c aux c pr_c_aux c cc epretty c exe pretty c obj pr_c_aux obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj pretty c test test c 2133 12267 5 75 fc test c test bak Comparing files test c and test bak FC no differences encountered del test bak F
125. ly distributed Please mail improvements to the author ke ak ok 2k ok 2k ke 2k ke oe ok oe e ke ke jk oe e ok ae 2k af E EE af fe 2k EE EE EEEE EEEE EE af ke af ake 2k TITLE Mini syn V1 50 c Adrian Johnstone 1997 SUFFIX program 4 var dec statement var dec 5 int ID 1 statement ID 1 print C 1 String 9 e2 4 2 2 Add x 2 Subtract 2 e2 3 x Multiply Divide 24 e3 Posite e3 Negate eb 4 1 Exponentiate eb ID Variable INTEGER Numeric literal 2 1 comment COMMENT NEST Comments String STRING_ESC Strings for print End of mini syn bnf Figure 2 2 An rdp grammar specification for 1 mini syn bnf 10 THE MINICALC LANGUAGE A SIMPLE CALCULATOR WITH DECLARED VARIABLES minicalc source program comprises an arbitrary number of variable dec larations and statements each terminated with a semicolon as specified by rule program in line 13 Variable declarations specified in line 15 comprise the keyword int followed by a comma delimited list of variable names which may be optionally followed by an sign and an initialisation expression Statemen
126. mal 10 15 respectively Within a line space and tab characters may be used to format the source Comments are marked by a leading semicolon Any characters between a semicolon and the end of a line are ignored by the assembler A comment may start in any column 6 2 3 Expressions In any mvmasm context requiring a numeric value an expression may be used Expression operands may be identifiers or numbers as defined above or the pre defined identifiers TRUE and FALSE which are synonyms for the values 1 and 0 respectively The full set of ANSI C numeric operators is provided augmented by the operator which stands for exponentiation The supported operators with their priorities on a scale of 1 the lowest to 11 the highest are listed in Table 6 1 Internally all assembler arithmetic is done with the precision of along integer Most C compilers treat this as a 32 bit integer 6 2 4 Instructions and addressing modes The 17 MVM instructions are assembled using the mnemonics listed in Ta ble 5 1 Operands are separated by commas and take the form of an expression as defined above For source but not destination operands the expression may be preceded by a hash sign denoting literal addressing mode The hash has no effect on the value returned by the expression but sets the addressing mode for the operand containing the to literal mode The following are all valid instructions ADD temp x y sum x and y SUB temp temp
127. mar for a superset of ANSI C 109 comment char COMMENT_VISIBLE result preprocessor char COMMENT LINE VISIBLE d result monadic char result result result result diadic char amp amp result amp result result result result 4 result result result result result result gt gt result iresult 2 result result result 2 gt result result 2 result result amp e result result result result z2 i iresult 2 result result result 2 lt lt result 2 result result block open char result block close char j result open bracket char result result close bracket char result result item char result SCAN CAST id INTEGER REAL ID result result string char STRING ESC NMV result character char STRING ESC N NM result field delim char result result punctuation char result result result keyword char auto result break result case result char result const result continue result Idefault result double result enum result extern result float resu
128. mbler code but which uses a tree as an internal data structure During the parse the rdp generated minitree parser automatically constructs the intermediate form and then a POST PARSE function called code generate is called which traverses the tree emitting MVM instructions as it goes In principle optimising phases could be inserted between the parse phase and the code generation phase that would rearrange the tree to create more efficient code although we do not describe such optimi sations here We strongly recommend that before proceeding with this chapter you read Chapters 9 and 10 of the rdp user manual JS97a which describe rdp s tree generation facilities in detail minitree intermediate form When designing a tree based compiler the central decision concerns the infor mation to be retained in the tree after parsing One extreme option is to simply use the entire derivation tree which contains all the terminals matched as well as a node for every rule instance activated during the parse small programs in Figures 8 1 8 6 exercise all of the major syntactic features of minitree including declarations with and without initialisation assignment of expressions to variables print statements both if then and if then else statements a while do statement and a compound begin end statement Each program fragment is accompanied by a full derivation tree and the corresponding reduced derivation tree used as an intermediate fo
129. mbler for at least one real computer the Digico M16 minicomputer a 16 bit machine with an architecture similar to that of the 12 bit DEC PDP 8 did indeed enforce this restriction This machine assembled from paper tape and internal memory was very limited In addition the design of the instruction set meant that forward jumps were less common than on modern machines so the designers thought that only having to feed the source papertape through once was a sufficient advantage to justify banning forward references Implementing mvmasm 55 fixup assembler assembles the source into a buffer in memory When it encounters the BEQ instruction above it assumes a value of zero for the value of the label done but also adds the instruction to a list of references for done When it subsequently assembles the definition of done it then knows the correct value and so can go back through the list for done filling in the correct value wherever it has been used This approach called fixing up the forward references allows assembly to be completed in a single pass of the source file but it requires the assembler to maintain an internal buffer which is as large as the largest possible program that could be assembled and in addition a potentially large number of reference lists In practice a fixup based assembler can be rather complicated and might require a large amount of runtime storage By far the most common solution to the problem is to make two or
130. mbler output file and between them the tree walker is called 1 Use of the graph library 2 The tree walker The expression walker is recursively called once for each node in the expression tree t each call a subtree is passed as a parameter and the function examines the token number of the root node in that subtree The token numbers comprise DOP PPP INO INO INO I2 IO IO ONE ATA PWN gt ON CAN Da FF 0l Hn N ds ds minitree auxiliary functions 93 kk kk kk ek he ke je ke ke EEEE EEEE EEEE EEEE EEE ke ke ke ke ek koe ke fe ek ok koe ok eek RDP release 1 50 by Adrian Johnstone A Johnstone rhbnc ac uk 20 December 1997 minitree bnf a mini parser which builds an intermediate form This file may be freely distributed Please mail improvements to the author X X ke 2k ke oe ok oe e ke ke 2k ke ke 2k ok fe 2k 2k TITLE Minitree compiler V1 50 c Adrian Johnstone 1997 SUFFIX m PARSER program USES m1 aux h USES mt aux h DUTPUT FILE minitree mvm TREE POST PARSE code generate rdp sourcefilename rdp outputfilename r
131. nd the limitations of sin gle pass translators 1 2 Intermediate forms and translation to virtual machine code 1 3 The mini family The minicalc language a simple calculator with declared variables 2 minicalc features 2 2 minicalc limitations 2 3 A grammar for minicalc 2 3 1 Specifying expressions 2 4 Hints on selecting operator priority and associativity 2 5 A minicalc syntax checker An interpreter for minicalc 3 1 Declaring symbol tables 3 2 Using synthesized attributes 3 3 Expression evaluation 3 4 Accessing the symbol table The minicond language interpretation with conditionals 4 1 grammar for minicond 4 2 Adding conditional operators 4 3 Using inherited attributes 4 4 Using semantic rules 4 5 Adding conditional execution 4 5 1 Semantic actions for conditional execution 4 6 Next steps The Mini Virtual Machine MVM 5 1 MVM memory 5 2 MVM instruction execution 5 3 MVM addressing modes 5 4 instructions 5 4 1 Instruction set capabilities 5 4 2 Address mode encoding 5 5 Example MVM instructions gt NN 14 14 17 17 20 20 21 23 23 27 27 28 29 31 31 33 34 35 36 37 37 37 38 5 6 5 6 5 7 5 5 1 Data manipulation instructions and address modes 5 5 2 Control manipulation instructions Using an assembler to program MVM mvmsim a simulator for MVM byte codes 5 7 1 Using mvmsim 5 7 2 The mvmsim input file format 5 7 8 Running a simulat
132. nds kind unsigned long column unsigned long line 91 1 92 static int last kind EOLN 93 Static int indentation 0 94 static int temporary indent 0 95 static int printed 0 96 97 lexeme_count bump lexeme counter for statistics 98 last_line line remember the highest line number seen 99 100 if kind K_BLOCK_CLOSE 101 indentation 102 else if last kind K BLO CK OPEN 103 indentation t 104 105 if last kind K_EOLN do indentation 106 107 int indent count space count 108 109 if temporary indent amp amp kind K BL CK OPEN add an indent of we aren t opening a block 110 indentation t 111 112 for indent count 0 indent count lt indentation indent_count 113 if column 1 amp amp kind K COMMENT Don t indent comments that start in column 1 114 if indent size 0 115 1 116 fprintf outputfile t indent using a tab 117 printed text get tab width 118 119 else 120 for space_count 0 space_count lt indent_size space_count 121 printed fprintf outputfile 122 123 if temporary indent amp amp kind K_BLOCK_OPEN reset temporary indent 124 indentation 125 126 temporary indent 0 127 128 129 if space table last insert space if necessary 130 printed fprintf outputfile 131
133. ng address 65 static void put memory byte unsigned long address int data static void put memory word unsigned long address int data Figure 5 3 Extracts from the mvmsim simulator memory declarations Memory and program counter declarations Figure 5 3 shows the declarations that model the MVM memory Line 18 spec ifies the size of the simulated memory which is restricted to 655361u 65536 as a long unsigned number or 64K bytes in this 16 bit MVM simulator It is possible to reduce the size of the MVM memory by adjusting this figure but of course MVM programs must then ensure that they only work within the available memory The internal memory itself is modeled by an array of unsigned char line 22 and the program counter by an unsigned long integer line 23 We could access the MVM memory by simply reading and writing to the memory array but one of the characteristics of machine level programming is that programs often contain errors bad error might cause the simu lator to run amok and start executing from illegal host addresses as to control this kind of problem all memory access is channeled through the get memory byte get memory word put memory byte and put memory word routines declared in lines 39 76 These routines validate the memory address issuing a fatal error message if the program being simulated tried to access a non existent location The get memory byte and get memory word
134. ng that miniloop is strict about the placement of semicolons which are statement separators not statement terminators as they are in ANSI C The last statement in a begin end ok sk ok 2k oe ae ke e af ke ke le ke jk ek ok e af kj ek k af kk af ek fe fe kc ok ae af fe 2k ke Arranging data and code memory 69 is 7 is 14 b is 14 cubed is 343 equals a does not equal a is 3 is 2 is 1 Halted 9vumvuNN GCG Y Figure 7 2 mvmsim output for assembled output from miniloop for the example program block cannot by definition therefore have a semicolon following it so it is always an error to have a semicolon before an end statement This usage follows that of Pascal and Algol 68 although Pascal does allow an empty statement which in most cases allows spurious semicolons to be accepted 7 1 2 while loop Lines 31 35 of Figure 7 1 illustrate the use of the while loop which is essentially identical to the while loop in Pascal relational expression is repeatedly evaluated and the statement after the do keyword is evaluated as long as the expression is TRUE The statement may be either a simple statement such as print or indeed another while or it may be a compound statement as in the example 7 2 Arranging data and code in memory MVM is limited to 64K bytes of memory because the address fields in the in structions are only 16 bits long and 216
135. nguage and reject inputs that are not It can be very hard to ensure that a parser does have this property and we know that some aspects of language such as type checking are not amenable to specification using just context free grammars In these cases we must use semantic checks to increase the checking power of the parser For our pretty printer we look at a radically different approach to language parsing in which minimilist rdp grammar is constructed that will parse all valid ANSI C programs as well as a large number of syntactically invalid ones rationale here is that an ANSI C programmer who wishes to use the pretty printer will also have access to an ANSI C compiler which will be able to detect syntactically invalid programs so we can reasonably assume that the ANSI C program presented to the pretty printer will already have been checked for validity Therefore we can safely use a parser for a superset of the ANSI C language and not bother to check every detail This allows us to use very significantly simplified grammar but the limitation is that our pretty printer has to make formatting decisions on the basis of the current input lexeme and its immediate predecessor The pretty printer never knows whether it is inside function definition or processing global definitions for instance and as a result it cannot vary formatting according to the kind of construct it is processing Using the pretty printer pre
136. ning the result of evaluating the relational expression and n is the number of the control statement This has the effect of jumping to the end of the while do if the condition was false The code for the do block is then emitted followed by the __OD_n label Typical compiler output Figures 7 6 7 8 show the compiled output for the test program in Figure 7 1 which contains instances of all the constructs described above In particular note the setup and wrapup code in lines 3 6 and 137 142 the declaration in lines 8 9 the arithmetic expression evaluation in line 12 and the assignment 74 SINGLE PASS COMPILER FOR MINILOOP of its result to user variable a in line 13 the if then else statement at lines 97 114 and the while do statement at lines 116 135 7 9 Implementing miniloop The miniloop compiler makes a single pass over the source file emitting MVM instructions as it goes We shall examine the grammar first and then look at the auxiliary functions which perform the actual code output 7 9 1 grammar for miniloop The overall form of the miniloop grammar shown in Figures 7 9 and 7 10 is similar to the minicond grammar with the addition of syntax for a while do loop in lines 56 61 and syntax for the begin end compound statement on line 67 The symbol table declared in lines 17 22 is used only for keeping track of whether a variable has been correctly identified and so the symbol table data specified in line 21 includes only the id
137. nt a str 4 9 3 2 Consecutive indenting keywords convention for indenting keywords is that they should cause a temporary indentation of the following line This is inadequate for the case of a sequence of indenting keywords on neighbouring lines For instance this piece of code if x 0 do 3 while lt x will reformatted as if x 0 do 3 while y lt x 106 FOR ANSI C This is because temporary indentations do not accumulate 9 3 3 Continuation lines Occasionally a long expression or function call will be broken over several lines with significant horizontal formatting pretty does not preserve this format ting Consider 4 long function call first second third Ideally we would like pretty_c to recognise that the open parenthesis marks the start of a new indentation level but in fact pretty c will simply reformat this as 4 long function call first second third 9 34 Embedded comments comment which is not the first lexeme on a line will be moved to the comment start column if possible This is undesirable if the comment is intended to be embedded within a line x func 3 parameter width 45 67 will be reformatted as x func 3 parameter width 45 67 9 3 5 Formatting of lexemes In one place pretty does not even follow its own conventions a string or characte
138. ock expression tree in lines 49 74 uses the techniques described in the user manual to build operator trees with the usual priority and associativity rela tionships built into their structure The promote above operator is used to handle the left associative operators and the natural tree ordering ensures that the operator priorities are correctly implemented We assume that the tree is to be traversed in a depth first left to right manner so that higher priority operators will appear deeper in the tree minitree auxiliary functions minitree makes use of the miniloop auxiliary functions described previously for handling output to the assembler file and opening and closing the file You should refer to the previous chapter for a discussion of these code emission functions The minitree auxiliary file mt aux c shown in Figures 8 9 8 13 contains three extra functions 1 a top level function code generate at lines 188 193 that is called as the POST PARSE function from the grammar 2 a depth first left to right tree traversal function that processes expression trees expression walk at lines 20 76 and 3 a depth first left to right tree traversal function that processes statements and calls the expression walker where appropriate tree walk at lines 78 186 The code generate function is straightforward it calls the emit open and emit close functions used by miniloop to initialise and close the as se
139. of the relational operator so a sequence of such operators in an expression will be rejected by the parser In the next chapter we shall add relational operators to mini using a rule of the form described here It is perhaps worth noting that in some languages this issue is rather obscured by the confusion of boolean values with integer 14 THEMINICALC LANGUAGE A SIMPLE CALCULATOR WITH DECLARED VARIABLES values Languages such as Pascal are strict whereas others such as ANSI C simply use integer values instead of true booleans and may even provide fully left associative relational operators 2 4 Hints on selecting operator priority and associativity Most common operators are left associative since this corresponds to a left to right evaluation rule which is natural for cultures that read left to right Occasionally an operator is given right associativity for special reasons The exponentiation operator is an example of such an operator The reason that exponentiation is traditionally right associative is that when written in the traditional mathematical notation using position rather than a symbol for the operator the expression 27 can be trivially rewritten as 27 Since there is already such a straightforward way of writing left associating exponentiation it makes sense to define the exponent operator as right associative so 27 corre sponds to 9 Thus using the programming language notation y z is interpreted
140. ond c cc eminicond exe minicond obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj minicond testcond is 7 is 14 b is 14 cubed is 343 equals a does not equal a rdp F ominiloop miniloop Error LL 1 violation rule statement 2 else statement 1 contains null but first and follow sets both include else N N Warning Grammar is not LL 1 but F set writing files 1 error and 1 warning cc Irdp_supp c miniloop c miniloop c cc Irdp supp c ml aux c ml aux c cc eminiloop exe miniloop obj aux obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj rdp F omvmasm mvmasm cc Irdp supp c mvmasm c mvmasm c cc Irdp supp c mvm aux c mvm aux c cc emvmasm exe mvmasm obj mvm aux obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj cc Irdp supp c mvmsim c mvmsim c cc emvmsim exe mvmsim obj arg obj graph obj memalloc obj scan obj scanner obj set obj symbol obj textio obj miniloop otestloop mvm testloop mvmasm otestloop sim testloop Transfer address 00001000 mvmsim testloop sim is 7 is 14 b is 14 cubed is 343 equals a does not equal a is 3 is 2 is 1 Halted rdp F ominitree minitree Error LL 1 violation rule statement 2 else statement 1 contains null but first and follow sets both inc
141. ook at fully interpreted versions of mini and the design of an assembler and simulator for MVM The level of presentation is aimed at readers who are fa miliar with the principles of parser generators and the ANSI C programming language If you are completely new to translator design you may find it helpful to read the rdp tutorial manual JS97c and the accompanying user JS97a and support JS97b manuals In detail we will develop the following tools 1 A syntax checker and interpreter for minicalc a language that provides declarations assignment expression evaluation and output 2 An interpreter for minicond which has block statements relational oper ators and an if then else statement in addition to the basic minicalc language 3 A paper architecture called the Mini Virtual Machine MVM and its specification as a simulator for MVM called mvmsim written in ANSI C 4 An assembler called mvmasm that translates MVM assembly language into binary code The implementation of mvmasm illustrates the design mini family 5 issues assemblers which are culturally rather different from high level programming languages 5 A single pass compiler for the language miniloop which adds a while loop construct to minicond and outputs MVM assembler source suitable for translation with mvmasm into MVM binary which may be executed by mvmsim 6 multiple pass compiler in which the parser builds tree based inter mediat
142. or equal 76 lt el right result result lt right Less than or equal TT gt el right result result right Equal 78 1 2 el right result result right 1 Not equal 79 80 ei integer e2 result 47 e2 right result right Add 81 e2 right result right 3 Subtract 82 83 e2 integer e3 result 47 e3 right result right Multiply 84 gt e3 right Divide 85 if result 0 86 text message TEXT FATAL ECHO Divide by zero attempted n 87 else result right 88 89 90 91 e3 integer 2 e3 result Posite 92 e3 result result result Negate 93 e4 result 94 95 e4 integer eb result e4 right Exponentiate 96 result integer pow double result double right 97 l 98 99 eb integer ID name _lookup name result Variable access 100 INTEGER result Numeric literal 101 C ei result Do first 102 103 comment COMMENT NEST Comments 104 105 String char STRING ESC result Strings for print 106 107 End of minicond bnf Figure 4 4 An rdp specification for the minicond interpreter part 2 4 2 4 3 Adding conditional operators 27 Adding conditional operators rule 0 defined in lines
143. or which does not exist in ANSI C Exponentiation is therefore han dled by calling the pow standard library function ensuring that the integer operands supplied by the mini code are re cast as double precision real num bers The header file for the maths library must be added to the list of files which are includeed into the parser and this is specified with the USES di rective in line 12 We must be particularly cautious with the divide operator because an attempt to divide by zero would generate an arithmetic trap on some com puter architectures or even worse quietly generate undefined results on some Accessing the symbol table 21 others The semantic action for the divide operator checks for this condition before attempting to evaluate any divisions and issues a fatal error message if necessary which will abort interpretation You can read about the routine text message which is used to issue error messages in Chapter 8 of the support library manual JS97b Accessing the symbol table When a new variable is declared in a mini program using an int declaration we must create a new symbol table entry which will hold the value of the variable When the corresponding variable identifier appears within an expression we must access the symbol table to retrieve the value and when an identifier appears on the left hand side of an assignment we must access the symbol table to update the variable s value field symbol table libr
144. ords long is specified These form the array of temporary variables used during expression evaluation as described in section 7 4 and represented by the region Internal temporary data in Figure 7 3 Finally the value of __ _ is established as the transfer address by naming it in an END directive 7 3 Compiling declarations declaration in miniloop such as int a reserves space for one integer in memory and makes the identifier a synonym for the address of that variable Whenever the compiler encounters a declaration it switches to the DATA location and assembles a WORD directive DATA a WORD O 7 4 7 5 Compiling arithmetic expressions 71 The WORD directive reserves one word two bytes of memory and initialises them in this case to zero Following such a declaration we can use the identifier a to refer symbolically to the location holding the contents of the variable just as we would in the high level source code Compiling arithmetic expressions The parser breaks expressions down into individual operations taking account of operator priority and associativity as discussed in Chapter 2 Each operation is then compiled into the corresponding MVM instruction with the destination operand being a temporary variable sub expression of the form 3 4 will be compiled to ADD temp 0 3 4 temp 0 3 4 The temporary variables do not need to be separately declared in the way that user variables
145. ot easily be extended to handle looping because our parsers are designed to make complete passes over the source program and a loop construct would require us to skip backwards in the parse This is certainly not impossible to implement but would require detailed knowledge of rdp s internals and is not the recommended approach Instead we shall move to a full compiler for mini which outputs instructions for a very simple computer called the Mini Virtual Machine By providing an assembler and simulator for we build a complete system that models the tasks of a compiler for a real processor Chapter 5 The Mini Virtual Machine MVM MVM is paper architecture designed to support efficient interpretation a host architecture We will use MVM to illustrate the techniques of virtual machine simulation assembly language translation and compilation In this chapter we shall describe the MVM architecture and a simulator for that ar chitecture written in ANSI C As well as providing a means to execute MVM programs the simulator source code can be treated as an exact specification of the architecture We will begin by describing the architecture informally Mini Virtual Machine is a very simple architecture based around a con ventional memory to memory processor This means that all MVM operations execute directly on the contents of memory locations there are no registers or stacks available for storing data although t
146. parser as a single monolithic unit just like a comment In this way the spacing is preserved Of course a side effect of this is that preprocessor lines will never be prettified but given the subtleties of parsing preprocessor directives this conservative design decision is perhaps justified The lexeme and its associated kind value are passed to the auxiliary function pretty print in line 52 along with the line and column numbers for the token This function will be described in the next section we simply note here that the lexeme will be printed possibly with a preceding space to the output file and that line end tokens will be followed by a string of spaces corresponding to the indentation level 9 5 Auxiliary routines auxiliary functions and the kind enumeration are defined in the auxiliary header file aux h shown in Figure 9 3 The two externally visible vari ables indent size and comment start receive the values of the i and c command line arguments The kind enumeration has 17 values the first 16 correspond to the 16 subrules in the pretty grammar and the last one is a dummy value that is set to the number of subrules The source code for the three auxiliary functions is shown in Figures 9 4 9 7 The data declarations are in Figure 9 4 and include variables to keep count of the number of line endings seen the number of lexemes seen and the number of comments We also remember the value o
147. pass compiler semantics This file may be freely distributed Please mail improvements to the author fe sk ok 2k ke 2k ke oe ke ok e ke ke kk fe kc ok fe fe kc ok ae af ok kk ke kk program TITLE Miniloop compiler V1 50 c Adrian Johnstone 1997 SUFFIX PARSER program USES ml_aux h TREE OUTPUT_FILE miniloop mvm SYMBOL_TABLE mini 101 31 symbol_compare_string symbol_hash_string symbol_print_string char id check declared if symbol lookup key mini dst NULL NULL 1 text message TEXT ERROR Undeclared variable s n dst symbol insert key mini dst sizeof char sizeof mini data emit open rdp sourcefilename rdp outputfilename 1 var dec statement emit close int ID dst emitf Nn DATA n s WORD O n n CODE n dst 2 eO left emit CPY dst left NULL symbol insert key mini amp dst sizeof char sizeof mini data if dst amp amp dstt 1 text message TEXT ERROR ECHO variable names must not begin with two underscoresWn Statement Declaration ID dst check_declared gt eO left emit CPY dst left NULL assignment integer label new_label if statement emitf __IF_ lu n
148. r literal containing an octal escape sequence such as NO3 or An embedded control X012 character will be output with the numerical escape sequence reformatted to use hexadec imal notation as in 03 or An embedded control XOA character This minor unpleasantness arises from a limitation of the rdp scanner which only returns the binary version of a string or character literal A grammar for a superset of ANSI C 107 9 4 A grammar for a superset of ANSI C Our aim with the grammar for pretty_c is to accept all valid ANSI C programs but we are not limited to accepting only valid ANSI C The outline form of the grammar is program any valid ANSI C lezeme This will be a string of zero or more ANSI C lexemes in whatever order including sequences that are not syntactically correct ANSI C so for instance pretty_c would accept a program of the form int main void else 3 do case 16 which should certainly be rejected by any real C compiler The full pretty_c grammar specification is shown in Figures 9 1 and 9 2 The top level rule program accepts zero or more matches against one of 16 subrules that between them generate the complete ANSI C lexicon Each of the 16 subrules defines a particular kind of token and each kind has different spacing conventions all of the diadic operators for instance are defined in rule diadic Rule program receives in the synthesized attribute lexeme the string of charac
149. r or an arbitrary non zero number of space characters per indentation level Some lexemes are output with a preceding inter token space Diadic oper ators such as gt gt or 7 for instance are always surrounded by single space characters In detail pretty classifies each language token into one of 16 kinds and maintains a 16 x 16 array of boolean values that spec ify whether an ordered pair of language tokens should be separated by a space or not All line endings are preserved Some pretty printers attempt to ensure that a blank line is inserted after each block of declarations and in some other contexts pretty c preserves whatever convention for vertical spac ing already exists in the file to be formatted the only changes made are within a line opening brace 1 increases the indentation level by one and a closing brace decreases the indentation level by one The keywords do while for if else and switch are indenting key words line after an indenting keyword will have its indentation level increased by one unless it starts with an open brace 4 Subsequent lines will not be affected and will be indented as they would have been if the indenting keyword had not been encountered comment that starts in the first column is never indented comment that does not start in the first column but which is the first lexeme on a line is indented using the current indentation level for that
150. re that it is configured for your needs by uncommenting one of the blocks of macro definitions at the top of the file 3 To build everything go to the directory containing the makefile and type make The default target in the makefile builds rdp the mini syn syn tax analyser the minicalc interpreter the minicond interpreter the miniloop compiler the minitree compiler an assembler called mvmasm and its accompanying simulator mvmsim a parser for the Pascal language and a pretty printer for ANSI C The tools are run on various test files None of these should generate any errors except for LL 1 errors caused by the mini and Pascal if statements and warnings from rdp about un used comment rules which are normal make then builds rdpi a machine generated version of rdp1 is then used to reproduce itself creating a file called rdp2 The two machine generated versions are compared with each other to make sure that the bootstrap has been successful Finally the machine generated versions are deleted 4 If you type make clean all the object files and the machine generated rdp versions will be deleted leaving the distribution files plus the new 122 ACQUIRING AND INSTALLING RDP OO0readme 1 5 makefile minicalc bnf minicond bnf miniloop bnf minitree bnf mini syn bnf ml aux c ml aux h mt aux c mt aux h mvmasm bnf mvmsim c mvm aux c mvm aux h mvm def h pascal bnf pretty c bnf pr c aux c pr c aux h rdp
151. ression comment COMMENT NEST Comments stripped by lexer String char STRING ESC NX result Strings for print End of miniloop bnf Figure 7 10 An rdp BNF specification for miniloop part 2 expressions 80 SINGLE PASS COMPILER FOR MINILOOP ANSI C printf output function by accepting a formatted output string and an arbitrary number of output fields and then using ANSI C vprintf and vfprintf functions to format the output The ANSI C standard library macros va list va start and va end are used to handle the variable number of arguments which emitf may be passed see any good book on ANSI C for an explanation of their use emit open and emit close functions open and close the output file as well as writing the wrapper code that appears at the start and end of every compiled program see section 7 2 The function emit is used to output a single assembler instruction along with a comment that renders the operation in an algebraic form to make reading the output easier for those not used to assembler format emit print function is a specialised output routine for handling the print statement in miniloop It generates the code templates discussed in section 7 5 The new temporary function allocates a block of memory to hold the name of the temporary and then uses the sprintf function to construct the name Qo 2 P2
152. rm by minitree Full derivation trees for a parse grow rapidly with program length putting all the program fragments together into a ten line program yields a tree contain ing 184 nodes The tree is mostly broad and flat with long catkins hanging off of some nodes catkins are generated by the expression rules every time an integer or a variable is referenced the parser must recurse right down to the bottom of the expression tree giving rise to these long vertical chains More than a quarter of the nodes in the derivation tree are of this form and the pro portion would be even higher if the expression tree had more levels that is if we had more priority levels in the expressions as we do in the mvmasm grammar for instance Our reduced derivation trees typically contain only one quarter of the nodes of a full derivation tree and yet the original program may be re constructed from a reduced derivation tree In particular the expression rules no longer generate catkins but are only as deep as they need to be to show the operators actually used in the source expression An efficient intermediate form for a compiler should retain all the informa tion needed to reconstruct the original program but no more Text books compilers often distinguish between Concrete Syntaz Trees and Abstract Syn minitree intermediate form 85 Y Y INTEGER 3 INTEGER 4
153. routines take an address and return either a single byte or a single word which is formed by concatenating the addressed byte with the contents of the location address 1 In this case the addressed byte forms the least significant byte of the returned word The put memory word function takes an address and 16 bit data word The least significant byte of the data word is written into memory at the specified address and the most significant byte is loaded to location address 1 The main execution loop After the simulator has loaded the memory array and set the program counter to the value of the transfer address the function mvmsim execute is called This function loops until a HALT instruction is encountered executing one in struction per iteration The full source of the mvm_execute function is shown in Figures 5 4 5 6 44 MINI VIRTUAL MACHINE MVM 160 static void mvmsim_execute void 161 162 int stop 0 163 164 while stop 165 1 166 unsigned op get memory byte pc 167 mode get memory byte pc 1 168 int dst get memory word pc 2 169 srci get memory word pc 4 170 src2 get memory word pc 6 171 172 exec count t 173 174 do indirections on modes 175 if mode gt gt 4 1 176 src1 get memory word srci 177 178 if mode amp 7 1 179 src2 get memory word src2 180 Figure 5 4 Extracts from the mvmsim simulator the execute
154. rrect and incorrect usage Enhance the rdp specification using either explicit semantic actions or the rdp tree operators to build the intermediate form Write a POST PARSE function that traverses the intermediate form emitting instructions for the target processor Demonstrate correct compilation and execution of test programs us ing simulation or by direct execution on the target architecture Add a full complement of C control structures including switch break goto and for Add support for floating point arithmetic Add support for function definition Add support for user defined type definition Add support for pointers Implement common subexpression elimination Implement register allocation using graph colouring Appendix Acquiring and installing rdp may be fetched using anonymous ftp to ftp dcs rhbnc ac uk If you are a Unix user download pub rdp rdpz y tar or if you are an MS DOS user download pub rdp rdpz y zip In each case 2 y should be the highest number in the directory You can also access the rdp distribution via the rdp Web page at http www dcs rhbnc ac uk research languages rdp shmtl If all else fails try mailing directly to A Johnstone rhbnc ac uk and a tape or disk will be sent to you 1 installation 1 Unpack the distribution kit You should have the files listed in Table A 1 2 The makefile can be used with many different operating systems and compilers Edit it to make su
155. rs or groups of spaces according to the value of the size variable that is set using the i command line option In each case the printed variable is updated to show the col umn number after printing The routine text get tab width is used to get the current value of the tab setting as set using the t command line option Incidentally note the ugly layout of this code which is manifestation of the problem described in section 9 3 2 space table is accessed in lines 129 130 to control the output of a space character before the current lexeme is printed You may dislike the spacing convention used here some people like a space after an opening parenthesis and a space before a closing parenthesis for instance in which case you should experiment with modifications to the space table lexemes are printed out under the control of the switch statement at lines 133 170 Most token kinds receive the default treatment of being simply printed out However the following token kinds require special treatment 1 EOLN must be output as a newline character and the eo1n count and printed variables must be updated at the same time 2 K COMMENT lexemes do not contain the delimiting and brackets so these must be reinstated on output and the comment must be placed as near to the comment start column as possible Auxiliary routines 117 3 K STRING and CHARACTER lexemes do not contain the delimiting quote mar
156. s have been free format allowing whitespace and line breaks to appear between any two language tokens Low level languages such as assem blers have tended to retain the older style not least because it can be simpler to hand write a parser for a line oriented language In particular error recovery is eased if a syntax error is detected on a line then after reporting the error the parser can simply restart at the start of the next line re synchronisation of the parser after an error in a free format language can be much harder and Assembler syntax reference 51 as a result errors in free format languages can generate an avalanche of spurious error messages 6 2 2 Lexical elements Identifiers in mvmasm follow the rules for identifiers in ANSI C that is an identi fier may begin with an alphabetic character or an underscore and continue with zero or more alphabetic characters digits or underscores The length of an iden tifier is limited only by the available memory within the running assembler and is for practical purposes unbounded Numbers start with a digit If the first digit is a zero 0 and this is immedi ately followed by a lower or upper case x character then the rest of the number is assumed to be in hexadecimal format otherwise the number is assumed to be decimal Decimal numbers are made up of the digits 0 9 Hexadecimal numbers can additionally use the letters in either upper or lower case to represent hexadeci
157. se parentheses and in fact the equivalent Pascal statements do not require them They are just there to please the eye and may be omitted from the intermediate form The second case is represented by the many kinds of brackets used in pro gramming languages including parentheses in arithmetic expressions the brack ets around array index expressions and the begin end constructs or in ANSI C These brackets are used to show the nesting in a program but any tree form can show nesting naturally in terms of the parent child relationships between nodes so the bracketing terminals are redundant The intermediate tree forms used by real compilers tend to be rather ad hoc rdp provides a standardised way to build trees by applying promotion operators to nodes within the full derivation tree The user manual 597 contains examples of standard approaches to common language features and we have applied these to the implementation of minitree Implementing minitree One of the advantages of multi pass implementation scheme is that it allows a clean separation between the grammar and the semantics of code generation The only semantic actions left in the grammar file minitree bnf are those that use the symbol table to check that all variables encountered have been correctly declared All of the code generation calls to the various emit auxiliary functions have been shifted to the tree walker code minitree uses the same auxiliary sem
158. sed and treated as whitespace 6 4 mvmasm grammar listing of mvmasm bnf the rdp specification for mvmasm is shown in Fig ures 6 1 6 3 The main body of the mvmasm grammar is shown in Figure 6 2 The third part of the listing Figure 6 3 shows a self contained interpreter for arithmetic expressions based on the C language operators This part of the grammar is a useful starting point for any small language based around expression evaluation We shall look at the three parts in turn 6 4 1 Directives for setting up the parser mvmasm source files have default file type mvm as specified in line 11 Three header files are used by the grammar mvm aux h which contains the function prototypes for the auxiliary functions described in the next section mvm_def h which contains an enumeration listing the MVM operation codes see Fig ure 6 4 and the ANSI C library file math h which is used to implement the exponentiation operator 6 4 2 The MVM definition header The parser uses three passes to resolve forward references in the way described above Output file handling is performed by the PRE and POST PARSE functions init and quit O respectively declared on lines 18 and 19 The x com mand line switch is set up using an ARG BOOLEAN directive in line 21 along with some other additional information for the help message mvmasm uses a symbol table to keep track of labels and their contents When a label is first declared it is a
159. sing stage which is specified by the design of the language to be translated and the generation phase which is keyed to the architecture of the target processor This is usually achieved by allowing the parser to make one or more passes over the source program and by providing embedded semantic actions that translate the program into some simple intermediate form which captures the meaning of the program without requiring the large syntactic overhead of key words and English like syntax that are used in most human readable program ming languages The compilers miniloop and minitree described in Chap ters 7 and 8 respectively are examples of this approach 1 2 Intermediate forms and translation to virtual machine code 3 Intermediate forms and translation to virtual machine code Intermediate forms used in real compilers fall into two basic types a tree like structure closely related to the derivation trees described in JS97a or alterna tively a list of instructions for a paper architecture or virtual machine The virtual machine approach is illustrated by the miniloop compiler in Chapter 7 and the tree based approach by the minitree compiler described in Chapter 8 Virtual machines are superficially similar to real processors but they offer a level of abstraction above that of a real processor For instance it is common in intermediate forms to retain the variable names from the original user s pro gram rather than translating
160. statement as defined above This has the effect of jumping to the else block if the condition was false compiler then emits the code for the then block followed by the assem bler instruction BRA FI n go to FI n 7 7 7 8 Compiling while loops 73 DO_n relational test do block OD_n Figure 7 5 Flow of control through a compiled while do statement which causes control to flow unconditionally to the end of the if statement Finally the compiler emits the __ELSE_n label and the code for the else block which may be empty finishing off with the __FI_n label Compiling while loops A while do statement is similar to an if then statement with no else clause which is followed by a jump back to the relational test There are two blocks of code a relational expression and the do block Figure 7 5 illustrates the code template used by miniloop and the allowable control flow through the construct The compiler emits two labels for each while do loop one of the form __DO_n to mark the start of the statement and one of the form __OD_n to mark the end where n is the number of the control statement The relational expression is compiled first yielding a temporary variable which will contain a zero if the expression evaluates to FALSE and a one other wise The compiler then issues the assembler instruction BEQ __temp t 0D n ifn temp t go to __OD_n gt where is the number of the temporary contai
161. t left right Greater than or equal lt el right emit LE gt dst left right 1 Less than or equal gt el right emit EQ dst left right Equal 1 el right emit NE dst left right Not equal left dst result left ei char char dst e2 left 4 dst new_temporary 5 e2 right emit ADD dst left right Add 2 e2 right emit SUB dst left right Subtract left dst result left e2 char char dst e3 left 4 dst new_temporary 2 e3 right emit MUL dst left right Multiply gt e3 right emit DIV dst left right Divide left dst result left e3 char int negate 0 char dst C gt negate 1 e4 result Posite or negate if negate dst new temporary emit SUB dst result result dst e4 char char dst e5 left dst new_temporary e4 right emit EXP dst left right 1 Exponentiate left dst 1 result left eb char ID dst check declared result dst Variable access INTEGER val result char mem 11 12 sprintf result 1 val C ei result Parenthesised exp
162. t expressions using the value of the correctly spelt variable would then use the old value causing hard to find errors Variable declarations are also used to establish the type of a variable which restricts the kinds of values that may be assigned and the kinds of operations that may be applied to the declared variable Type checking can catch pro gramming errors such as trying to add a number to a string or attempting to use an integer instead of a pointer value minicalc provides constructs for variable declaration for assignment of the results of arithmetic operations to those variables and for the values of those variables to be printed out It is effectively only as powerful as a desktop cal culator with named variables An example minicalc program listing is shown in Figure 2 1 it corresponds to the file testcalc m in the standard rdp dis tribution In later chapters we shall extend mini to include control structures such as loops and if statements minicalc features minicalc programs comprise a sequence of declarations and statements Each statement and declaration must be terminated by a semicolon in much the same way as in an ANSI C program minicalc supports only integer variables Variable declarations look like ANSI C int declarations taking an optional initialisation expression Line 11 in the listing shows an example of two variables being declared both with initialisation expressions 8 THEMINICALC LANGUAGE SIMPLE
163. t loc and emit fill call emitf to print the transfer address and the current value of the location counter and to pad the line with spaces to column 16 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 mvmasm auxiliary functions 63 static int emitf char fmt conditional print to object file int i va_list ap argument list walker va_start ap fmt if emit code no op if not emitting 1 if emitted 16 amp amp text get echo 1 vprintf fmt ap vfprintf objfile fmt ap otherwise pass to fprintf va end ap return i for completeness although not used here void emit eoln void 4 if emit code fprintf objfile void emit transfer void 4 if emit code emitted emitf 74 41X transfer void emit loc void 4 emitted 0 emitf 4 41X location void emit fill void 4 void emit op int op unsigned long operi unsigned long 2 unsigned long oper3 4 if text get 1 while emitted lt 16 printf printf int modei int mode2 int opers emiti unsigned long op output opcode emiti unsigned long mode1 lt lt 4 mode2
164. tart production PRE PARSE init rdp outputfilename POST PARSE quit rdp outputfilename ARG B OLEAN x execute sim execute assembled code using mvmsim simulator ARG BLANK BLANK You can contact the author Adrian Johnstone at ARG BLANK ARG BLANK Computer Science Department Royal Holloway University of London ARG BLANK Egham Surrey TW20 OEX UK Email A Johnstone rhbnc ac uk SYMBOL_TABLE mvmasm 101 31 symbol compare string symbol hash string symbol print string char id integer val unit emit code rdp pass 3 data location code location 0 clear location counters location amp code location make code counter current dummy label symbol new symbol sizeof mvmasm data initialise error symbol code code emit_eoln emit_loc last_label NULL label instr emit fillO EOLN Figure 6 1 An rdp BNF specification for mvmasm part 1 rdp directives and the start production 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 58 label instr diadic copy branch print halt directive MVMASM AN ASSEMBLER FOR M
165. ted unable to rename filename 1 to filename 2 pretty_c was unable to rename the first file to the second file This may be because there is no disk space left or there may already exist a file called filename 2 that is write protected 104 PRETTY PRINTER FOR ANSI C 9 2 Pretty printer features first requirement of a pretty printer is that it should only modify the spac ing of a program and not change its meaning a pretty printer is an interesting example of a translator whose input and output language are the same The particular details of the formatting changes are essentially a matter of taste variety of standards exist for C formatting but there is no universal agree ment on how C program should be indented We choose to follow the format that rdp uses for its machine generated parsers In detail pretty uses the following conventions 1 Each line of a program has an indentation level The indentation level of the first line of a program is 0 All of the original spacing in the file to be pretty printed is discarded except for the contents of comments preprocessor directives and string literals which are preserved Each output line is preceded by a possibly zero length space the length of which is proportional to the indentation level By default each in dentation level is represented by two spaces but the user can specify via the i command line argument the use of a single tab characte
166. ters matched by the scanner and a local attribute kind is as signed one member of the kind enumeration which is defined in the auxiliary file pr_c_aux h shown in Figure 9 3 Rule program also includes calls to the auxiliary functions pretty_open and pretty_close that control the file handling There are two cases where the string returned in lexeme is not necessarily the actual string matched by the scanner In the case of both character and string literals defined in lines 82 and 84 the scanner will process embedded escape sequences to produce a string which may contain binary characters This is the source of the restriction noted in section 9 3 5 in which octal escape sequences will be rewritten as hexadecimal escape sequences on output the scanner does not preserve information on whether a particular escape sequence was octal or hexadecimal so we have arbitrarily decided to output them all as hexadecimal Comments in rdp generated parsers are usually defined using one of the invisible comment scanner primitives and quietly suppressed by the scan ner In this application of course we wish to pass comments from the parser into the pretty printer otherwise the comments would be removed from the formatted output As a result comments are defined in line 56 using the COMMENT_VISIBLE primitive Preprocessor directives in ANSI C present the pretty printer with particular problems Unusually for a high level language spacing is critic
167. the literal decimal con stant 123 As in ANSI C hexadecimal numbers are marked by the prefix Ox Numeric values lacking this prefix are assumed to be decimal Data declaration directives Data may be declared using the WORD directive which specifies that enough space be reserved for a machine word two bytes and in this case also provides an initialisation expression so that temp is initialised to 1 are other data declaration directives which may be used to reserve larger blocks of storage These other directives are described below in section 6 2 5 The END directive Line 11 shows an END directive which both marks the end of the assembler input file and specifies the transfer address that is the address of the first instruction to be executed by the simulator In this case the value of the start label which is 100016 6 1 2 Using the assembler and the simulator together The assembler is usually used to prepare input for the mvmsim simulator and if the assembler is invoked with a x option then the simulator will be automati 50 6 2 MVMASM ASSEMBLER FOR MVM cally run trace mode on the assembler output Hence issuing this command mvmasm 1 x examples rdp case mvmsim mvm produces this output 0000 1 Simulator example file 0000 2 DATA Ox000A start assembling data at address OOOA hex 000 0001 3 temp WORD 1 declare an integer variable called temp 000C 4 000C 5 CODE 0
168. them into machine addresses as would be required for a real machine level program Both kinds of intermediate form allow a variety of optimisations to be ap plied such as the evaluation of constant expressions or the replacement of mul tiplications by powers of two with shift operations In general an optimiser is supposed to take a program in the intermediate form and output another pro gram written in intermediate form that has the same semantics but is faster or more compact or both Sometimes optimisers fail to make improvements and in some cases they may actually make things worse In addition some features of programming languages such as the unrestricted use of pointers can introduce subtle effects that make it hard for the optimiser to guarantee that the semantics are preserved After the optimiser has finished code must be generated for the target pro cessor In general there must be a different code generator for each processor but at least all of the parsing and many of the optimisation components of the compiler can be common between target processors One way of providing the benefits of full portability whilst retaining much of the efficiency of a fully compiled solution is use an intermediate form that can itself be efficiently interpreted In this case no final code generation phase is required Instead a software simulator for the virtual machine which can read and execute the intermediate form is supplied This kind of
169. to look at the type of node in the root of the subtree being processed and act accordingly 100 MINITREE MULTIPLE PASS COMPILER The first case shown here corresponds to begin node or the root node of the reduced derivation tree which has token number 0 begin nodes do not of themselves generate any output code but their children are minitree statements that must be recursively processed hence in line 96 we see tree walk being called on the current node s children Five other statements are handled within the tree walker Assignment lines 102 107 emits a CPY assembler instruction with the first child of the root as the destination operand The source is obtained by calling the expres sion walker on the right child of the root Declarations are denoted by a sub tree with an INT root node which have one or more child nodes containing the names of the variables to be declared while loop at lines 113 123 walks the children outputting a WORD assembler directive for each variable labeled with the name of that variable The optional initialisation expression is represented in the tree as an expression sub tree hanging under the node containing the name of the variable being declared so if this tree is non null then the expression walker is called to generate code to evaluate the initialisation expression print statements are handled in lines 127 143 The while loop in lines 131 141 walks the children of the print node If the
170. to the rule name In line 50 of minicond bnf for instance the original minicalc variable declaration rule is redefined as var dec interpret integer This specifies that rule var dec has a single inherited attribute called interpret of type integer rdp grammar rules can have multiple inherited attributes each of which must be specified with an accompanying type When such a rule is called the parameter values must be filled in using either literal numbers literal strings or the names of other attributes You can see examples in line 21 where the top level rule calls subrules with a literal integer program var_dec 1 statement 1 and in line 64 where the statement rule is called and passed the value of an attribute Using semantic rules In complex translators the semantic actions can become very large and reading a decorated rdp grammar can become difficult as the C language semantic ac tions obscure the underlying form of the grammar One solution to this problem is to parcel all but the most trivial actions into separate C functions that reside in the auxiliary file in which case the semantic actions in the grammar may be reduced to function calls This certainly allows the grammar to show through but then a full understanding of the translator requires two files the rdp IBNF file and the C language auxiliary file to be coordinated Semantic rules are a sort of half way house in which the C languag
171. ts in minicalc may be either assignments or print statements as specified in line 17 The print statement takes parameters by way of a comma delimited list of strings and expressions Comments present particular problems for parser generators because the modern convention is to allow a comment to occur anywhere that whitespace is allowed typically between any two tokens of the language A full specifica tion of this convention using only the grammar rules would require a call to a comment rule after every terminal in the grammar and this would make the grammar very unwieldy The usual solution is to instruct the scanner to detect and filter out comments in exactly the same way as whitespace such as line ends tabs and space characters is discarded rdp offers a range of scanner primitives to support three different com menting conventions You can read more about the use of these primitives and the general problems of comment specification in Chapter 4 of the user manual JS97a In mini_syn bnf comments comprising matching and brackets are specified on line 29 This rule is only used to parameterise the scanner and is never actually called by the main parser so it will be deleted when rdp processes the grammar 2 3 1 Specifying expressions The forms of mini expressions are specified on lines 19 27 Most programming languages provide support for expressions made up of operators for common arithmetic and logical operations Some even
172. tty printer is built during installation of as a side effect of running the command make To check whether all is well type pretty_c and you should receive the following help message 102 PRETTY PRINTER FOR ANSI C Fatal No source file specified C pretty printer V1 50 c Adrian Johnstone 1997 Generated on Dec 20 1997 21 55 41 and compiled on Dec 20 1997 at 21 51 16 Usage 0 lt s gt t lt n gt T lt n gt V lt s gt i lt n gt lt n gt pretty c options source c Filter mode read from stdin and write to stdout Make a listing Write output to filename Echo each scanner symbol as it is read Print summary symbol table statistics Tab expansion width default 8 Text buffer size in bytes for scanner default 20000 Set verbose mode Write derivation tree to filename in VCG format not available in this parser Number of spaces per indent level 0 means use tabs default 2 Preferred start column for trailing comments default 30 These command line options are described below Now type pretty c test c The pretty printer will reformat the file test c which is part of the standard distribution and print out test c 2133 12267 5 75 first field is the name of the file that was formatted the second is the number of lines in the file 2133 and the third is the number of language tokens processed 12267 in this case The final field is the average number of tokens per l
173. urse we only want to execute the semantic actions for one branch or the other so we need to be able to dynamically disable semantic action execution at run time We do this by supplying an inherited attribute to the productions statement and var dec which is a boolean value If the attribute is true then the embedded semantic actions are executed and if not they are skipped over At the top level rule program in line 21 calls the statement and var dec rules with the attribute set to 1 ie TRUE so at the start of a program all semantic actions will be executed When an if statement is encountered the conditional expression is evaluated and the result is logically AND ed with the value of the interpret attribute This new value is then supplied as a parameter to a new nested instance of the statement rule semantic actions in the expression tree not switched on and off in this way because they do not need to be during the evaluation of an expression attributes are calculated and passed back up the tree of expression rules but no changes are made to the variables declared in a minicond program until an assignment is executed The update semantic rule associated with the assignment statement is guarded by a check against the interpret attribute so the result of an expression is simply discarded if interpretation has been switched off Next steps techniques used in this chapter to handle conditional interpretation can n
174. utable in a single linear pass and minicalc is an example of a language which has such grammar interpreter which is very similar to minicalc is described in the tutorial manual JS97c rdp allows us to embed semantic actions into a grammar rdp s semantic actions are written in C and are copied into the generated parser so that as Soon as a fragment of the language to be parsed is recognised the action can be performed For instance on recognition of the minicalc fragment int temp we can make a new symbol table entry for a variable called temp If the fragment is followed by an token then we can go on to parse and evaluate an arithmetic expression placing the result into the symbol table record for temp Figures 3 1 and 3 2 show the specification for a full interpreter for ninicalc that operates in this way This interpreter IBNF specification uses the same grammar as that for the mini syntax checker described in the last chapter The only differences are the addition of semantic actions and synthesised attributes to allow declaration of variables evaluation of arithmetic operations and the printing of results You can read more about semantic actions in Chapter 5 of the user man ual JS97a and Chapter 6 of the tutorial manual JS97c To understand the minicalc interpreter you also need to be familiar with the use of rdp s built in symbol table package which you can read about in Chapter 7 of the support library manual JS97b and
175. uthor Adrian Johnstone at Computer Science Department Royal Holloway University of London Egham Surrey TW20 OEX UK Email JohnstoneOrhbnc ac uk This is the standard mvmsim help message in this case it has been triggered because no source file was specified It tells you about the three optional flags that can be supplied to mvmsim o 1 tells mvmsim to echo the data it is writing into the simulator s memory during the load phase o t switches on trace mode in which instructions are echoed as they are executed o v sets verbose mode which causes mvmsim to print out a title line and then at the end of run the total CPU time along with the number of MVM instructions executed mvmsim a simulator for MVM byte codes 41 5 7 2 mvmsin input file format The MVM program to be simulated is read by mvmsim from an input file which contains binary information rendered as a hexadecimal dump of the required memory contents Each line of the dump file starts with a 16 bit hexadecimal number which specifies the base address to which the rest of the data on that line will be loaded The data is specified as zero or more pairs of hexadecimal digits Each pair specifies the contents of one eight bit memory cell and the corresponding memory locations are loaded in ascending order starting with the base address Spaces are allowed but not required between each pair of digits Blank lines are also allowed This kind of load form
176. ve mode compilers Recently virtual machine based approaches have become popular again be cause of the need to distribute executable programs around the Internet Porta bility between different computer architectures must be absolutely guaranteed even though there are very wide variety of systems connected to the net and the programs must run in an identical fashion wherever they are executed In addition the programs must be run in such way that any suspicious be haviour that might undermine the host system s security can be caught In practice actually allowing arbitrary machine language programs to execute is too dangerous Instead languages like Java compile to an intermediate virtual machine and Web browsers provide an interpreter for that virtual machine that can in prinipal catch illegal memory accesses and attempts to access operating system services that could threaten system integrity The Java virtual machine simultaneously acts as a reasonably efficient portable platform for executing programs and as a filter on the actions of those programs that protects the underlying operating system The nini family In the following chapters we will illustrate the interpreted virtual machine ap proach to compilation by describing the development of single and multiple pass compilers for mini which work in this fashion outputting instructions for a per processor called the Mini Virtual Machine MVM Along the way we will l
177. with the address specified in the destination operand the test fails the program counter is simply incremented in the normal way thus passing control to the next instruction 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 mvmsim a simulator for byte codes switch op 1 case ADD put memory word dst display ADD dst 8 break case SUB put memory word dst display SUB dst 8 break case MUL put memory word dst display MUL dst 8 break case DIV put memory word dst display DIV dst 8 break case EXP put memory word dst display EXP dst 8 break case EQ put memory word dst display EQ dst 8 break case NE put memory word dst display NE dst 8 break case GT put memory word dst display GT dst 8 break src2 src2 src2 src2 src2 src2 src2 src2 int pow double srci double src2 src2 src src2 src2 src src2 src2 gt src2 src2 45
178. worthy of the name a computer must be capable of making decisions A decision in this context usually means conditionally executing some parts of a program on the basis of calculations performed whilst the program is running that is at run time By this definition minicalc has the abilities of calculator not a computer We shall progressively add capabilities to our mini language We start in this chapter by making these additions 1 relational operators gt gt lt and with lower priority than any of the arithmetic operators and 2 an if then else statement which allows conditional interpretation of programs result is a language minicond whose programs look like minicalc programs with some additional features minicalc is a strict subset of the minicond language so any minicalc program will be correctly evaluated by minicond interpreter Figure 4 1 shows an example minicond program The output pro duced when this is run through the minicond interpreter is shown in Figure 4 2 In a later chapter we shall see how to add looping constructs and a facility for grouping statements together in blocks A grammar for The grammar for minicond shown in Figures 4 3 and 4 4 follows the general form of the interpreter presented in the previous chapter except that large se mantic actions have been placed in their own semantic rules and the necessary syntax and semantic actions have been
Download Pdf Manuals
Related Search
Related Contents
bed_anl_DigiBox Beta 2_neu Vorlage.qxd 4668SG - Lega srl costruzioni apistiche 920MHz帯 無線ユニット [Modbus タイプ] ユーザーズマニュアル User Manual CCN–ECM - Sabiana User manual - Hjelpemiddeldatabasen SERVICE MANUAL - Aire Acondicionado Mitsubishi Electric VIRB™ Series Guide mon électricité solaire mode d`emploi, pdf, 5.8 Mo Copyright © All rights reserved.
Failed to retrieve file