Home

The Compiler Generator Coco/R

image

Contents

1. name of the entry public int type type of the entry NONE for procs public Entry next to next entry in same scope public int kind vars procs scopes public int adr address in memory or start of proc public int level nesting level of declaration public Entry locals scopes to locally declared entries public int nextAdr scopes next free address in this scope class TL symbol table handler types public const int NONE 0 public const int INT 1 public const int BOOL 2 entry kinds public const int vars 0 public const int procs 1 public const int scopes 2 public static int curLevel nesting level of current scope static Entry undefEntry entry node for erroneous symbols static Entry topScope topmost procedure scope public static void EnterScope Entry scope new Entry scope name scope type NONE scope locals null scope nextAdr scope next topScope curLevel n Y scope kind scopes 3 topScope scope blic sta topScope tic void LeaveScope topScope next O L curLevel pu public static int DataSpace return topScope nextAdr 3 public static Entry NewEntry string name int kind Entry p entry new Entry entry name name entry type NONE entry kind kind entry level curLevel p topScope locals while p null if p name Equals name Parser SemErr name declared twice p p
2. 19 Code generator public static int progStart address of first instruction of main program public static int pc program counter static bool generatingCode true const int memSize 15000 static byte code new byte memSize public static void Emit int op if generatingCode if pc gt memSize 4 Parser SemErr program too long generatingCode false else code pct byte op public static void Emit2 int op int val Emit op Emit val gt gt 8 Emit val T public static void Emit3 int op int level int val Emit op Emit level Emit val gt gt 8 Emit val public static void Fixup int adr if generatingCode code adr byte pc gt gt 8 code adr 1 byte pc Interpreter const int stackSize 1000 static int stack new int stackSize static int top top of stack static int bp base pointer static int Int if b bool b return 1 else return 0 static int Next return code pc static int Next2 int x y x sbyte code pc y code pc return x lt lt 8 y static void Push int val stack top val static int Pop return stack topl static int Up int level int b bp while level gt 0 b return b stack b level public static void Interpret int val a lev Console Write nData file STDIN InFile data n
3. type TL BOOL int typel op Term lt out typel gt if type TL INT typel TL INT SemErr integer type expected TC Emit op Fy jf A A A A ca Rc et a a E A Term lt out int type gt Factor lt out type gt MulOp lt out op gt int typel op Factor lt out typel gt if type TL INT typel TL INT SemErr integer type expected TC Emit op fs ir aca CR a ag SG a peg N E lE acy E a i EG es Factor lt out int type gt Ident lt out name gt TRUE FALSE number Factor lt out type gt MulOp lt out int op gt fet tous RelOp lt out int op gt t pote po 1 VA END Taste int n Entry entry string name type TL NONE entry TL Find name type entry type if entry kind TL vars TC Emit3 TC LOAD TL curLevel entry level entry adr else SemErr variable name expected TC Emit2 TC LIT 1 type TL BOOL TC Emit2 TC LIT 0 type TL BOOL Convert Tolnt32 token val TC Emit2 TC LIT n type TL INT if type TL INT SemErr integer type expected type TL INT TC Emit TC NEG op TC BAD op TC MUL op TC DIV op TC BAD op TC ADD op TC SUB op TC BAD op TC EQU op TC LSS op TC GTR 37 B 2 TL cs symbol table namespace Taste class Entry symbol table entries public string name
4. Factor ident Factor type cast r t Expr nested expression ident number Since Expr can also start with an ident the conflict can be resolved by checking whether this ident denotes a type or some other object Factor IF IsCast ident Factor type cast Expr nested expression ident number IsCast looks up ident in the symbol table and returns true if it is a type name static bool IsCast Token x Scanner Peek if x kind _ident object obj SymTab Find x val return obj null amp amp obj kind Type else return false Placing resolvers correctly Coco R checks that resolvers are placed correctly The following rules must be obeyed 1 If two alternatives start with the same token the resolver must be placed in front of the first one Otherwise it would never be executed because the parser would always choose the first matching alternative More precisely a resolver must be placed at the earliest possible point where an LL 1 conflict arises 2 A resolver may only be placed in front of an alternative that is in conflict with some other alternative Otherwise it would be illegal Here is an example where resolvers are incorrectly placed 20 A a IF b c misplaced resolver No LL 1 conflict TE sse ab resolver not evaluated Place it at first alt TE lees Bb misplaced resolver No LL 1 c
5. Linz versions of Coco R can be downloaded from http www ssw uni linz ac at Research Projects Coco Both the Java version and the C version of the extended versions of Coco R can be downloaded from http www scifac ru ac za resourcekit C version Copy the following files to appropriate directories Coco exe the executable Scanner frame the frame file from which the scanner is generated Parser frame the frame file from which the parser is generated Driver frame the frame file from which the driver is generated if required Java version Copy the following files to appropriate directories Coco jar an archive containing all classes of Coco R Scanner frame the frame file from which the scanner is generated Parser frame the frame file from which the parser is generated Driver frame the frame file from which the driver is generated if required 3 2 Invocation Coco R can be invoked from the command line as follows Under C Coco fileName Options Under Java java jar Coco jar fileName Options fileName is the name of the file containing the Cocol R compiler description As a convention compiler descriptions have the extension atc for attributed grammar 24 Options The following options may be specified Options namespace namespaceName in Java package packageName frames framesDirectory erace options ptionString The user can specify a namespace or package to
6. and optional parts Coco R translates the productions into a recursive descent parser which is small and efficient Nonterminal symbols can have a number of input and output attributes the Java version allows just one output attribute which may however be an object of a suitable composite class Terminal symbols do not have explicit attributes but the tokens returned by the scanner contain information that can be viewed as attributes All attributes are evaluated during parsing i e the grammar is processed as an L attributed grammar Semantic actions can be placed anywhere in the grammar not just at the end of productions They may contain arbitrary statements or declarations written in the language of the generated parser e g C or Java The special symbol ANY can be used to denote a set of complementary tokens In principle the grammar must be LL 1 However Coco R can also handle non LL 1 grammars by using so called resolvers that make a parsing decision based on a multi symbol lookahead or on semantic information Every production can have its own local variables In addition to these one can declare global variables or methods which are incorporated as static fields and methods of the parser Semantic actions can also access other objects or methods from user written classes or from library classes Coco R checks the grammar for completeness consistency and non redundancy It also reports LL 1 conflicts
7. conflict is when an expression in curly or square brackets is deletable e g A B B a b If the parser tries to recognize a and sees an a it cannot decide whether to enter the deletable symbol s or to skip 3 Therefore Coco R prints the warning LL 1 warning in A contents of or must not be deletable Note that Coco R reports LL 1 conflicts as warnings not as errors Whenever a parser generated by Coco R sees two or more alternatives that can start with the same token it always chooses the first one If this is what the user intends then everything is 16 fine as in the well known example of the dangling else that occurs in many programming languages Statement if Expression Statement else Statement Input for this grammar like if a gt b if a gt c max a else max b is ambiguous does the else belongs to the inner or to the outer if statement The LL 1 conflict arises because first else Statement N follow statement else However this is not a big problem because the parser chooses the first matching alternative which is the else of the inner if statement This is exactly what we want Resolving LL 1 conflicts by grammar transformations If Coco R reports an LL 1 conflict the user should try to eliminate it by transforming the grammar as is shown in the following examples Factorization Most LL 1 conflicts can be resolved by factorization i e by extr
8. e Char ANY Char CHR number TILT char SetDec1 associates a name with a character set Basic character sets are denoted as string a set consisting of all the characters in the string ident a previously declared character set with this name char a set containing the character char charl char2 the set of all characters from char1 tO char2 inclusive ANY the set of all characters in the range 0 255 Character sets may be formed from basic sets using the operators set union set difference Examples CHARACTERS digit 0123456789 the set of all digits hexDigit digit ABCDEF the set of all hexadecimal digits letter ARP cee TOs the set of all upper case letters eol Art the end of line character noDigit ANY digit any character that is not a digit control CHR 0 CHR S1L ASCII control characters 2 3 2 Tokens This is the main section of the scanner specification in which the tokens or terminal symbols of the language are declared Tokens may be divided into literals and token classes Literals such as while or gt have a fixed representation in the source language In the grammar they are written as strings e g while or gt and denote themselves They do not have to be declared in the tokens section but can be implicitly declared at their first use in the productions of the grammar Token classes such as identifiers or numbers
9. eos conato ooo an noe 3 User Guide 3 1 Installation 3 2 Invocation 3 3 Interfaces of the Generated Classes sse 25 3 3 1 Scanner 33 2 Kel Mii ri EE ad ae sic IGI T aE E E A T ERTE 3 3 4 Parser 3 3 5 Errors 3 4 Main Class of the TTT 27 3D Grammar L S rr Eaa R aE E A EEEE TEN ISEE E E VAARASTA ESTERE 28 4 A Sample Compilen serike eiei Ane ETE REN Annin Er E IAEE RA AEEA AEEA 29 3 Applications OF Coco Riiie aiii 31 Y O 32 As Syntax of CocollRerinsiiacia ii ica B Sources of the Sample Compiler B 1 Taste ATG nennti B 2 N symboltable siera neee rao Er aeaa E Ea S EN E E a E EER B 3 VE cs code generator T assets inari einar ETRE di B4 Taste cS mam program sennen ia 1 Overview Coco R is a compiler generator which takes an attributed grammar for a source language and generates a scanner and a recursive descent parser for this language The user has to supply a main class that calls the parser as well as semantic classes e g a symbol table handler and a code generator that are used by semantic actions in the parser This is shown in Figure 1 Main Coco R Parser compiler Scanner description ud en semantic classes Figure 1 Input and output of Coco R 1 1 Sample Production In order to give you an idea of how attributed grammars look like in Coco R let us look at a sample production for variable declarations in a Pascal like language VarDeclar
10. fileName string dir bool merge public static void Summarize public static void StoreError int line int col string msg public static void Warn int line int col string msg The variable count holds the number of errors reported by SynErr SemErr SemError and Error Errors can either be reported with an error number or with an error message Syntax errors are automatically reported by the generated parser which calls the method synErr Semantic errors should be reported by the user by calling either semzrr SemError OF Error from the semantic actions of the attributed grammar Unless a source listing is required error messages are simply reflected to the console using the format errMsgrormat Which can be changed by the user to obtain some custom format of error messages The placeholder 0 is filled with the source file name 1 is filled with the line number 2 is filled with the column number and 3 is filled with the error message If a source listing is required with merged error messages methods synErr SemErr SemError Warn and Error store error messages in an internal data structure The production of the listing is initiated after parsing is completed by calling the method Summarize The method Exception can be called for errors from which the compiler cannot recover In Coco R it is called for example if the frame files cannot be found or are corrupt It prints an error message to the console and term
11. for example using System using System Collections GlobalFieldsAndMethods After the grammar name one may declare additional fields and methods to be incorporated into the generated parser for example static int sum static void Add int x sum sum x These declarations are written in the language of the generated parser i e in CF or in Java and are not checked by Coco R Since all methods of the parser are static the declared fields and methods should also be static They can be used in the semantic actions of the parser specification The remaining parts of the compiler description specify the scanner and the parser that are to be generated They are now described in more detail 2 3 Scanner Specification A scanner has to read source text skip meaningless characters recognize tokens and pass them to the parser The scanner specification consists of six optional parts ScannerSpecification IGNORECASE CHARACTERS SetDecl TOKENS TokenDecl NAMES NameDecl PRAGMAS PragmaDecl CommentDecl WhiteSpaceDecl 2 3 1 Character sets This section allows the user to declare character sets such as letters or digits Their names can then be used in the other sections of the scanner specification Coco R grammars are expressed in an extended ASCII character set 256 characters SetDecl ident Set Set BasicSet BasicSet BasicSet string ident Char
12. for example by counting them or by processing compiler options within them 2 3 5 White space Some characters that appear between tokens such as blanks tabulators or end of line separators are usually considered as white space that should simply be ignored by the scanner Blanks are ignored by default If other inter token characters are to be ignored the user has to specify them in the following way WhiteSpaceDecl IGNORE Set Example IGNORE t r n 2 3 6 Case sensitivity Some languages such as Pascal or XML are case insensitive In Pascal for example one can also write the keyword while as While or WHILE By default Coco R generates scanners that are case sensitive If this is not desired one has to include the directive IGNORECASE at the beginning of the scanner specification The effect of tcnorecass is that all input to the scanner is treated in a case insensitive way The production 11 WhileStatement while Expr Statement would then also recognize while statements that start with while or waILE Similarly the declaration TOKENS float digit digit E digit digit would cause the scanner to recognize not only 1 222 but also 1 2e2 as a float token However the original casing of tokens is preserved in the vai field of every token see Section 3 3 2 so that the lexical value of tokens such as identifiers and strings is delivered exactly as it was written
13. next entry next topScope locals if kind vars entry adr topScope nextAdr topScope nextAdr return entry topScope locals entry public static Entry Find string name Entry entry scope scope topScope while scope null entry scope locals while entry null if entry name Equals name return entry entry entry next scope scope next T Parser SemErr undeclared name return undefEntry T public static void Init topScope null curLevel 0 undefEntry new Entry undefEntry name undefEntry type NONE undefEntry kind vars undefEntry adr 0 undefEntry level 0 undefEntry next null end TL end namespace 38 B 3 TC cs code generator using Library using System using System I0 namespace Taste class TC code generator and interpreter Opcodes public const int ADD 0 public const int SUB 1 public const int MUL 2 public const int DIV 3 public const int EQU 4 public const int LSS 5G public const int GTR 6 public const int LOAD 7 public const int LIT 8 public const int STO 9 public const int CALL 10 public const int RET 11 public const int RES 12 public const int JMP 13 public const int FJMP 14 public const int HALT 15 public const int NEG 16 public const int READ 17 public const int WRITE 18 public const int BAD
14. occur in any other productions derived from the start symbol Coco R prints the message X cannot be reached Derivability If the grammar contains nonterminals that cannot be derived into a sequence of terminals such as in X Y E noe TEED Sra Coco R prints the messages X cannot be derived to terminals Y cannot be derived to terminals Lack of circularity If the grammar contains circular productions i e if nonterminals can be derived into themselves directly or indirectly such as in A a B B G b J C A cc Coco R prints the messages A gt B B gt G C gt A Lack of ambiguity If two or more tokens are declared so that they can have the same structure and thus cannot be distinguished by the scanner as in the following example where the input 123 could either be recognized as an integer OF aS a float TOKENS integer digit digit float digit digit digit Coco R prints the message Tokens integer and float cannot be distinguished In all these cases the compiler specification is erroneous and no scanner and parser is generated 29 Warnings There are also situations in grammars that although legal might lead to problems In such cases Coco R prints a warning but still generates a scanner and a parser The user should carefully check if these situations are acceptable and if not repair the grammar Deletable symbols Sometimes nonterminals can be derived in
15. semicolon Coco R also generates a scanner that reads the input stream and returns a stream of tokens to the parser 1 3 Summary of Features Scanner The scanner is specified by a list of token declarations Literals e g if or while do not have to be explicitly declared as tokens but can be used directly in the productions of the grammar The scanner is implemented as a deterministic finite automaton DFA Therefore the terminal symbols or tokens have to be described by a regular EBNF grammar One can specify one or more kinds of comments that a scanner will ignore Provision is made for allowing comments to be nested Tokens must be made up of characters from the extended ASCII set i e 256 values The scanner can be made case sensitive or case insensitive The scanner can recognize tokens that are distinguishable only on their context in the input stream The scanner can read from any input stream not just from a file However all input must come from a single stream no includes The scanner can handle so called pragmas which are tokens that are not part of the syntax but can occur anywhere in the input stream e g compiler directives or end of line characters The user can suppress the generation of a scanner and can provide a hand written scanner instead Parser The parser is specified by a set of EBNF productions with attributes and semantic actions The productions allow for alternatives repetition
16. the user has to give certain hints in the grammar in order to enable the parser to perform appropriate synchronization 21 Invalid terminal symbols If a certain terminal symbol was expected but not found in the input the parser just reports that this symbol was expected For example if we had a production A a Do Gs but the parser was presented with input like axc the parser would report something like file Grammar atg 11 9 b expected Invalid alternative lists If the lookahead symbol does not match any alternative from a list of expected alternatives in a nonterminal a the parser just reports that a was invalid For example if we had a production A a b c d e but the parser was presented with input like axe the parser would report something like file Grammar atg 11 9 invalid A Obviously this error message can be improved if we turn the alternative list into a separate nonterminal symbol i e A B aBe blilcld In this case the error message would be file Grammar atg 11 9 invalid B which is more precise Synchronization After an error is reported the parser continues until it gets to a so called synchronization point where it tries to synchronize the input with the grammar again Synchronization points have to be specified by the keyword sync They are points in the grammar where particularly safe tokens are expected i e tokens that hardly occur anywhere else and are unlikely to
17. Gough Markus L berbauer Peter Rechenberg Josef Templ Pat Terry and Albrecht Wop References M ss90 Mossenb ck H A Generator for Production Quality Compilers 3rd Intl Workshop on Compiler Compilers CC 90 Schwerin LNCS 477 Springer Verlag 1990 Terry97 Terry P D Compilers and Compiler Generators An Introduction Using C International Thomson Computer Press 1997 Terry04 Terry P D Compiling with C and Java Pearson 2004 Wirth77 Wirth N What Can We Do about the Unnecessary Diversity of Notation for Syntactic Definitions Communications of the ACM November 1977 WLMO03 WoB A L berbauer M M ssenb ck H LL 1 Conflict Resolution in a Recursive Descent Compiler Generator Joint Modular Languages Conference JMLC 03 Klagenfurt 2003 33 A Syntax of Cocol R Cocol ANY using clauses in C and import clauses in Java COMPILER ident ANY global fields and methods ScannerSpecification ParserSpecification END ident ScannerSpecification IGNORECASE CHARACTERS SetDecl TOKENS TokenDecl1 NAMES NameDec1 PRAGMAS PragmaDecl1 CommentDec1 WhiteSpaceDecl SetDecl ident Set Set BasicSet BasicSet BasicSet string ident Char Char ANY Char char CHR number TokenDecl Symbol TokenExpr TokenExpr TokenTerm TokenTerm TokenTerm TokenFactor TokenF
18. MENTS FROM NAMES PRODUCTIONS WEAK Comments are enclosed in and and may be nested EBNF All syntax descriptions in Cocol R are written in Extended Backus Naur Form EBNF Wirth77 By convention identifiers starting with a lower case letter denote terminal symbols identifiers starting with an upper case letter denote nonterminal symbols Strings denote themselves The following meta characters are used symbol meaning example separates the sides of a production A abc y terminates a production A abc separates alternatives ablclde means ab or c or de groups alternatives alb c meansac or bc option a b means ab or b iteration 0 or more times a b means b or ab or aab or Attributes are written between lt and gt or between lt and gt Semantic actions are enclosed in and The operators and are used to form character sets 2 2 Overall Structure A Cocol R compiler description has the following structure Cocol Imports COMPILER ident GlobalFieldsAndMethods ScannerSpecification ParserSpecification END ident The name after the keyword compILER is the grammar name and must match the name after the keyword env The grammar name also denotes the topmost nonterminal symbol the start or goal symbol The parser specification must contain a production for this symbol Imports In front of the keyword comPILER one can import namespaces in C or packages in Java
19. RE cr 1f tab PRODUCTIONS in ee ene a enn ease es a ere O een A ee IS ee ae eet Taste string name progName PROGRAM La TC Init 3 PATRE O E Ident lt out progName gt TC progStart TC pc Body Ident lt out name gt if name Equals progName SemErr name mismatch TC Emit TC HALT mor Ge e me a ey ce a pi ad a we P E Si AS me a ee et en Body int fix string name namel Entry entry A TL EnterScope fix TC pc 1 TC Emit2 TC JMP 0 VAR Ident lt out name gt ne entry TL NewEntry name TL vars Typeld lt out entry type gt tet 7 PROCEDURE Ident lt out name gt Peace entry TL NewEntry name TL procs entry adr TC pc Body Ident lt out namel gt TC Emit TC RET if name Equals namel SemErr mismatched procedure name BEGIN TC Fixup fix TC Emit2 TC RES TL DataSpace StatSeq END TL LeaveScope Type Id lt out int type gt 35 type TL NONE INTEGER type TL INT BOOLEAN type TL BOOL hs A SS i A A a a a a aR Pale A A L E AR A a Na Ident lt out string name gt ident name token val ad A A RN a nd Dard ly gay et Bai nd gl A A gg ed ASS StatSeq Stat Stat A A A A A a a a a a a A A A A E E Na Stat int type L Ident lt out name gt ss Expression lt out type gt ip Expression lt out t
20. The Compiler Generator Coco R Extended User Manual Pat Terry Rhodes University July 2004 This document has been modified with permission from the original by Hanspeter M ssenb ck so as to describe the extensions to Coco R implemented to support the text Compiling with C and Java The important modifications have been clearly marked in red User Manual Hanspeter M ssenbock Johannes Kepler University Linz Institute of System Software June 2004 Coco R is a compiler generator which takes an attributed grammar of a source language and generates a scanner and a parser for this language The scanner works as a deterministic finite automaton The parser uses recursive descent LL 1 conflicts can be resolved by a multi symbol lookahead or by semantic checks Thus the class of accepted grammars is LL k for an arbitrary k There are versions of Coco R for Java C C Delphi Modula 2 Oberon and other languages This manual describes the versions for C and Java implemented at the University of Linz Compiler Generator Coco R Copyright 1990 2004 Hanspeter M ssenb ck University of Linz This program is free software you can redistribute it and or modify it under the terms of the GNU General Public License as published by the Free Software Foundation either version 2 or at your option any later version This program is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY without even the imp
21. The error messages printed by the generated parser can be configured to conform to a user specific format The generated parser and scanner can be specified to belong to a certain namespace or package 2 Input Language This section specifies the compiler description language Cocol R that is used as the input language for Coco R A compiler description consists of a set of grammar rules that describe the lexical and syntactical structure of a language as well as its translation to a target language 2 1 Vocabulary The basic elements of Cocol R are identifiers numbers characters and strings which are defined as follows ident letter letter digit number digit digit string anyButQuotelescape anyButApostrophelescape char anyButQuotelescape re anyButApostrophe escape Tr Upper case letters are distinct from lower case letters Strings may not extend across multiple lines Strings and characters may contain the following escape sequences backslash Vr carriage return NE form feed XT apostrophe n new line a bell y quote NE horizontal tab b backspace XU null character v vertical tab uxxxx hex char value The following identifiers are reserved keywords in the C version of Cocol R the identifier using is also a keyword in the Java version the identifier import ANY COMPILER TE NESTED SYNC CHARACTERS CONTEXT IGNORE out TO CHR END IGNORECASE PRAGMAS TOKENS COM
22. a synchronization point including the end of file symbol The effect is that the parsing of the parameter list would not be terminated prematurely but would get a chance to synchronize with the start of the next parameter after a possibly mistyped separator symbol In order to get good error recovery the user of Coco R should perform some experiments with erroneous inputs and place sync and weax keywords appropriately to recover from the most likely errors 23 2 4 8 Frame Files The scanner and the parser are generated from template files with the names Scanner frame and Parser frame Those files contain fixed code parts as well as textual markers that denote positions at which grammar specific parts are inserted by Coco R In rare situations advanced users may want to modify the fixed parts of the frame files by which they can influence the behavior of the scanner and the parser to a certain degree The extended version of Coco R described here can also generate a main driver class from a template file with the name Grammar frame if this exists where Grammar stands for the name of the goal symbol of the grammar If not the system can generate such a driver from a file Driver frame Such a generic driver frame file is provided with the distribution the idea being that it will act as an example from which a more appropriate Grammar frame can be derived 3 User Guide 3 1 Installation Both the Java version and the C version of the
23. act ing the common parts of conflicting alternatives and moving them to the front For example the production A abclabd can be transformed to A ab ce a Left recursion Left recursion always represents an LL 1 conflict In the production A Ab c both alternatives start with c because first a c However left recursion can always be transformed into an iteration e g the previous production becomes A c b Hard conflicts Some LL 1 conflicts cannot be resolved by grammar transfor mations Consider the following simplified productions from the C grammar Expr Factor Factor Factor YUK ident Factor type cast Expr j nested expression ident number The conflict arises because two alternatives Of Factor start with Even worse Expr can also be derived to an ident There is no way to get rid of this conflict by transforming the grammar The only way to resolve it is to look at the ident following the if it denotes a type the parser has to select the first alternative otherwise the second one We shall deal with this kind of conflict resolution in Section 2 4 6 Readability issues Some grammar transformations can degrade the readability of the grammar Consider the following example again taken from a simplified form of the C grammar UsingClause Qualident using ident Qualident ident ident 17 The conflict is in UsingClause where both iden
24. actor CONTEXT TokenExpr TokenFactor Symbol TokenExpr 1i TokenExpr 4 TokenExpr Symbol ident string char NameDecl ident ident string PragmaDecl TokenDecl SemAction CommentDecl COMMENTS FROM TokenExpr TO TokenExpr NESTED WhiteSpaceDecl IGNORE Set ParserSpecification PRODUCTIONS Production Production ident Attributes SemAction Expression Expression Term Term Term Resolver Factor Factor Factor WEAK Symbol Attributes Expression Expression Expression ANY SYNC SemAction Attributes lt ANY gt lt ANY gt SemAction ANY Resolver TR Cl ANY 3 34 B Sources of the Sample Compiler These have been modified from those in the Linz distribution so as to correspond more closely with other case studies in Compiling with CA and Java Terry04 B 1 Taste ATG COMPILER Taste C Taste Compiler based on the original by Hanspeter Moessenboeck This version for Coco R for C Terry version Modifications by Pat Terry Rhodes University AA A i a ag ea a E E Sag as pe Sn em re eS CHARACTERS letter ABCDEFGHIJKLMNOPORSTUVWXYZabcdefghijklmnopgrstuvwxyz digit 0123456789 er CHR 13 lf CHR 10 tab CHR 9 TOKENS ident letter letter digit number digit digit COMMENTS FROM TO NESTED IGNO
25. ation lt ref int adr gt string name TypeDesc 4 Ident lt out name gt Obj x SymTab Enter name int n 1 Ident lt out name gt Obj y SymTab Enter name x next y X y n t Ys Type lt out typ gt adr n typ size for int a adr x null x x next a typ size x adr a a The core of this specification is the EBNF production VarDeclaration Ident Ident Type It is augmented with attributes and semantic actions The attributes e g lt out name gt specify the parameters of the symbols There are input attributes e g lt x y gt and output attributes e g lt out z gt Or lt ref z gt A semantic action is a piece of code that is written in the target language of Coco R e g in C or Java and is executed by the generated parser at its position in the production 1 2 Sample Parsing Method Every production is translated by Coco R into a parsing method The method for VarDeclaration for example looks like this in C code parts originating from attributes or semantic actions are shown in grey static void VarDeclaration ref int adr string name TypeDesc typ Ident out name Obj x SymTab Enter name int n 1 while la kind comma Get Ident out name Obj y SymTab Enter name X next y X y n Expect colon Type out typ adr n typ size for int a adr x null x x next Expect
26. be mistyped When the parser reaches a synchronization point it skips all input until a token occurs that is expected at this point In many languages good candidates for synchronization points are the beginning of a statement where keywords like if while or for are expected or the beginning of a declaration sequence where keywords like public private or void are expected A semicolon is also a good synchronization point in a statement sequence The following production for example specifies the beginning of a statement as well as the semicolon after an assignment as synchronization points Statement SYNC Designator Expression SYNC if Expression Statement else Statement while Expression Statement Statement s 22 In the generated parser these synchronization points lead to code as follows written in pseudo code here static void Statement while la kind _EOF _ident _if _while _lbrace Report an error Get next token if la kind _ident Designator Expect _eql Expression while la kind _EOF _semicolon Report an error Get next token else if la kind _if Note that the end of file symbol is always included in the set of synchronization symbols This guarantees that the synchronization loop terminates at least at the end of the input In order to avoid a proliferation of error messages during synchro
27. contain the declarations of local variables Every production has its own set of local variables which are retained in recursive productions The optional semantic action on the left hand side of a production Loca1Dec1 is intended for such declarations but variables can also be declared in any other semantic action Here is an example that counts the number of identifiers in an identifier list IdentList ident int n 1 ident ntt Console WriteLine n n As a matter of style it is good practice to write all syntax parts of each production on the left side of a Cocol file with all semantic actions on the right as illustrated above This makes a production better readable because the syntax is separated from its processing Semantic actions can access not only local variables but also the static fields and methods declared at the beginning of the attributed grammar see Section 2 2 as well as fields and methods of other classes 2 4 3 Attributes Productions are considered as and are actually translated to parsing methods The occurrence of a nonterminal on the right hand side of a production can be viewed as a call of that nonterminal s parsing method Nonterminals may have attributes which correspond to parameters of the nontermi nal s parsing method There are input attributes which are used to pass values to the production of a nonterminal and output attributes which are used to return val
28. e actual parser It has to be called by the main class of the compiler see Section 3 4 after initializing the scanner The method semErr msg can be used to report semantic errors during parsing It calls Errors Error see Section 3 3 5 and suppresses error messages that are too close to the position of the previous error thus avoiding spurious error messages see Section 2 4 7 The remaining methods found only in the extended version of Coco R provide compatibility with earlier versions of Coco R which were not object oriented 3 3 5 Errors This class is used to print error messages Coco R distinguishes four kinds of errors syntax errors semantic errors warnings and runtime exceptions The interface of Errors given below is of little direct interest as calls to the methods involved with syntactic errors are incorporated automatically by Coco R while semantic errors are best dealt with by using by using the method Parser SemErr or Parser SemError the 27 methods are equivalent and both are provided simply for compatibility with other versions of Coco R public class Errors public static int count 0 public static string errMsgFormat file 0 1 2 3 public static void SynErr int line int col int n public static void SemErr int line int col int n public static void Error int line int col string msg public static void Exception string msg public static void Init string
29. eter is either a stream or the name of a file from where the tokens should be read It has to be called from the main class of the compiler see Section 3 4 before scanning and parsing starts The method scan is the actual scanner The parser calls it whenever it needs the next token Once the input is exhausted scan returns the end of file token which has the token number o For invalid tokens caused by illegal token syntax or by invalid characters scan returns a special token kind which normally causes the parser to report an error Peek can be used to read one or several tokens ahead without removing them from the input stream With every call of scan i e every time a token has been recognized the peek position is set to the scan position so that the first peek after a Scan Will return the first yet unscanned token The method resetPeex can be used to reset the peek position to the scan position after several calls of Peek 3 3 2 Token Every token returned by the scanner is an object of the following class public class Token public int kind token code EOF has the code 0 public string val token value public int pos position in the source stream starting at 0 public int line line number starting at 1 public int col column number starting at 0 26 3 3 3 Buffer This is an auxiliary class that is used by the scanner and possibly by other classes to read the source stream
30. ew InFile Console ReadLine Console Write nResults file STDOUT OutFile results new OutFile Console ReadLine O pc progStart bp 0 top 3 val 0 for switch Next case LOAD lev Next a Next2 Push stack Up lev a break case LIT Push Next2 break case STO lev Next a Next2 stack Up lev a Pop break Case ADD val Pop Push Pop val break Case SUB val Pop Push Pop val break case DIV val Pop if val 0 Push Pop val break else Console WriteLine Division by zero results Close System Environment Exit 1 break case MUL val Pop Push Pop val break case EQU val Pop Push Int Pop val break case LSS val Pop Push Int Pop lt val break case GTR val Pop Push Int Pop gt val break case CALL Push Up Next Push bp Push pc 2 pc Next2 bp top 3 break case RET top bp bp stack top 1 pc stack top 2 break case RES top top Next2 break case JMP pe Next2 break case FJMP a Next2 if Pop 0 pc a break case HALT results Close return case NEG Push Pop break case READ lev Next a Next2 stack Up lev a data ReadInt break case WRITE results WriteLine Pop break default Console WriteLine Illegal Instruction results Close System Environmen
31. generate source code that uses names for the tokens and terminals test the grammar but do not generate any code AS Ss E With the exception of m these various features may also be selected by pragmas placed within the Grammar ATG file conventionally at the start Pragmas take the form SoptionString This may be exemplified by COMPILER Grammar SCNF and this may be the preferred route to follow for many applications 25 For example the option sasx will cause the states of the automaton the symbol table and a cross reference list to be printed to the file trace txt Output files Coco R translates an attributed grammar into the following files R Scanner cs Or Scanner java containing the classes scanner Token and Buffer R Parser cs OF Parser java containing the classes parser and Errors R Grammar cs Or Grammar java containing the driver class if requested trace txt containing trace output if any listing txt containing a source listing with merged error message if requested All files are generated in the directory that contains the attributed grammar 3 3 Interfaces of the Generated Classes 3 3 1 Scanner The generated scanner has the following interface public class Scanner public static void Init public static void Init public static Token Scan public static Token Peek public static void ResetPeek string sourceFile Stream s Init initializes the scanner Its param
32. have a certain structure that must be explicitly declared by a regular expression in EBNF There are usually many instances of a token class e g many different identifiers which have the same token code but different lexeme values The syntax of token declarations is as follows TokenDecl Symbol TokenExpr TokenExpr TokenTerm TokenTerm TokenTerm TokenFactor TokenFactor CONTEXT TokenExpr TokenFactor Symbol LT TokenExpr ELT TokenExpr J 4 Tokentxpe TT Symbol ident string char A token declaration defines the syntax of a terminal symbol by a regular EBNF ex pression This expression may contain strings or character constants denoting themselves e g gt or as well as names of character sets e g letter denoting an arbitrary character from this set It must not contain other token names which implies that EBNF expressions in token declarations cannot be recursive Examples TOKENS ident letter letter digit _ number digit digit Ox hexDigit hexDigit hexDigit hexDigit float digit digit digit E digit digit The token declarations need not be LL 1 as can be seen in the declaration of number where both alternatives can start with a 0 Coco R automatically resolves any ambiguities and generates a deterministic finite scanner automaton Tokens may be declared in any order However if a token is decla
33. in the following way UsingClause using IF IsAlias ident Qualident IsAlias is a user defined method that reads two tokens ahead It returns true if ident is followed by otherwise it returns false Conflict resolution by a multi symbol lookahead The generated parser remembers the most recently recognized token as well as the current lookahead token in two global variables see also Section 3 3 4 Token token most recently recognized token Token la lookahead token The generated scanner offers a method Peek that can be used to read ahead beyond the lookahead token without removing any tokens from the input stream When normal parsing resumes the scanner will return these tokens again With Peek 1 we can implement Isalias in the following way static bool IsAlias Token next Scanner Peek return la kind _ident amp amp next kind _eql The conflict mentioned at the end of Section 2 4 5 can be resolved by the production A IF FollowedByColon ident x 1 ident x 7 mo ident Foo ident Bar Tea 18 and the following implementation of the function FollowedByColon static bool FollowedByColon Token x la while x kind _comma x kind _ident x Scanner Peek return x kind _colon Token names For peeking as we have illustrated it is convenient to be able to refer to the token ki
34. in the input text 2 3 Named tokens The scanner and parser produced by Coco R use small integer values to distinguish token kinds This makes their code harder to understand by a human reader some would argue that humans should never need to read such code anyway When used with appropriate options typically a n pragma Coco R can generate code that uses names for all the tokens By default these names have a rather stereotyped form for example would be named pointpointpointSym The facility exists to prefer user defined names or to help resolve name clashes for example between the default names that would be chosen for point and UserNames NAMES UserName UserName ident ident string Example NAMES period arn ellipsis ys 2 4 Parser Specification The parser specification is the main part of a compiler description It contains the productions of an attributed grammar which specify the syntax of the language to be parsed as well as its translation ParserSpecification PRODUCTIONS Production Production ident FormalAttributes LocalDecl Expression Expression Term Term Term Resolver Factor Factor Factor WEAK Symbol ActualAttributes Expression TL Expression Expression ANY SYNC SemAction Symbol ident string char SemAction ArbitraryStatements LocalDecl SemActi
35. inates the compiler 3 4 Main Class of the Compiler The main class of a compiler that is generated with Coco R has also to be provided in some manner It has to initialize the scanner call the parser and possibly print a message about the success of the compilation In its simplest form it might look like this public class Compiler public static void Main string arg Scanner Init arg 0 Parser Parse Console WriteLine Errors count errors detected As mentioned previously in the extended version of Coco R a compiler driver class can be generated from an appropriate driver frame file For very simple applications the default priver frame file supplied in the distribution might suffice but for serious applications a customized frame file should be created in the same directory as the Grammar ATG file and given the name Grammar frame It is suggested that this file can 28 be derived fairly simply from the simple Driver frame and some hints for doing so can be found in section 10 7 2 of Compiling with C and Java Terry04 3 5 Grammar Tests Coco R checks if the grammar in the compiler specification is well formed This includes the following tests Completeness For every nonterminal symbol there must be a production If a nonterminal x does not have a production Coco R prints the message No production for X Lack of redundancy If the grammar contains productions for a nonterminal x that does not
36. into a buffer and retrieve portions of it public class Buffer public static void Fill Stream s public static int Read public static int Peek public static int Pos get set public static string GetString int beg int end Fi11 s fills the buffer with the source stream s Read returns the next character or 256 if the input is exhausted peek allows the scanner to read characters ahead without consuming them Pos allows the scanner to get or set the reading position which is initially 0 Getstring a b can be used to retrieve the text interval a b from the input stream 3 3 4 Parser The generated parser has the following interface public class Parser public static Token token most recently recognized token public static Token la lookahead token public static void Parse public static void SemErr string msg public static bool Successful no errors occurred public static void SemError string msg same as SemErr public static void Warning string msg Warning message only public static string LexString Returns token val public static string LookAheadString Returns la val The variable token holds the most recently recognized token It can be used in semantic actions to access the token value or the token position The variable 1a holds the lookahead token i e the first token after token which has not yet been recognized by the parser The method parse is th
37. irst alternative So it returns false and the parser checks the second alternative The function IssecondaAlternative starts peeking again but before that it should reset the peek position to the first symbol after the lookahead token ia This can be done by calling scanner ResetPeek static bool IsSecondAlternative Scanner ResetPeek Token x Scanner Peek returns the first token after the lookahead token again The peek position is reset automatically every time a regular token is recognized by Scanner Scan see Section 3 3 1 19 Translation of conflict resolvers Coco R treats resolvers like semantic actions and simply copies them into the generated parser at the position where they appear in the grammar For example the production UsingClause using IF IsAlias ident Qualident is translated into the equivalent of the following parsing method static void UsingClause Expect _using if IsAlias Expect _ident Expect _eql Qualident Expect _semicolon Conflict resolution by exploiting semantic information A conflict resolver can base its decision not only on lookahead tokens but also on any other information For example it could access a symbol table to find out semantic properties about a token Consider the following LL 1 conflict between type casts and nested expressions which can be found in many programming languages Expr Factor Factor
38. l symbols do not have attributes in Cocol R For every token however the scanner returns the token value i e the token s string representation as well as the line and column number of the token see Section 3 3 4 This information can be viewed as output attributes of that token If users wish to access this data they can wrap a token into a nonterminal with the desired attributes for example Ident lt out string name gt ident name token val Number lt out int value gt number value Convert ToInt32 token val The variable token is the most recently recognized token Its field token vai holds the textual representation of the token see Section 3 3 4 2 4 4 The Symbol ANY In the productions of the grammar the symbol any denotes any token that is not an alternative to that any symbol in the context where it appears It can be used to conveniently parse structures that contain arbitrary text The following production for example processes an attribute list in Cocol R and returns the number of characters between the angle brackets Attributes lt out int len gt Ve int beg token pos 1 ANY gt len token pos beg In this example the token gt is an implicit alternative of the any symbol in curly braces The meaning is that this any matches any token except gt token pos is the source text position of the most recently recognized token see Section 3 3 4 Here is another examp
39. le that counts the number of statements in a block Block lt out int stmts gt a int n ki stmts 0 Wee stmts Block lt out n gt 7 L T stmts n ANY 1 1 x In this example the any matches any token except and which are its alternatives in this context 2 4 5 LL 1 Conflicts Recursive descent parsing requires that the grammar of the parsed language is LL 1 i e parsable from Left to right with Left canonical derivations and 1 lookahead symbol This means that at any point in the grammar the parser must be able to decide on the basis of a single lookahead symbol which of several possible alternatives have to be selected The following production for example is not LL 1 Statement ident Expression ident ActualParameters 15 Both alternatives start with the symbol ident If the parser comes to the beginning of a Statement and finds an ident as the next input token it cannot distinguish between the two alternatives However this production can easily be transformed to Statement ident Expression ActualParameters where all alternatives start with distinct symbols and the LL 1 conflict has dis appeared LL 1 conflicts can arise not only from explicit alternatives like those in the example above but also from implicit alternatives that are hidden in optional or iterative EBNF expressions The following
40. lied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE See the GNU General Public License for more details You should have received a copy of the GNU General Public License along with this program if not write to the Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA As an exception it is allowed to write an extension of Coco R that is used as a plugin in non free software If not otherwise stated any source code generated by Coco R other than Coco R itself does not fall under the GNU General Public License 1 a f Coco R stands for compiler compiler generating recursive descent parsers Contents LOVE cin ii a 1 1 Sample Product On cccsciscsccccscsssestsvcencesavesssacvacgncocesctesuevissnsuscesssdconssereodsdocsseuseceseevegus ciencia 1 2 Sample Parsing Method 1 3 Summary of ESAtUtES armani a odds 2 Input Language 2 1 Vocabulary 2 2 Overall Structure DO SCAMMEL SPECHICAIOM cisne aii ao aia sico eliana sai 7 2 3 1 Character sets AS VOK TT Ds Ded A casdssssdstevacosestadscostsscoetesbssitevesendu sedis svedvrennevssdy cons 2 3 4 Comments Dee Whitespace iii o dana catan 2 3 0 Case SENSI VILY nas dos 10 a A E E EAE deuce ET REEE 10 2A Parser Specific vecinas copian noia nic 2 4 1 Productions 2 4 2 Semantic Actions ZAS AMD dni 2 4 4 The Symbol ANY 24 9 LLY esae hi ti 2 4 6 LE Conflict REsol Vers usina ad 17 2 4 7 Syntax Error Handling 2 48 Frame Piles
41. list shows how to check for LL 1 conflicts in these situations Greek symbols denote arbitrary EBNF expressions such as a bjc first a denotes the set of terminal start symbols of the EBNF expression a follow a denotes the set of terminal symbols that can follow the nonterminal a in any other production Explicit alternatives A a Bly check that first 0 A first B A firsta A first y A first B O first y A O1 B check that firsta A first B A Q check that first a A follow A Options A 0 B check that firsta A first B A 0 check that first a A follow A Iterations A 0 B check that firsta A first B A 0 check that first a A follow A It would be very tedious and error prone to check all these conditions manually for a grammar of a realistic size Fortunately Coco R does that automatically For example processing the grammar A B Cc a BCd b a Je G d will result in the following LL 1 warnings 1 warning in A a is the start of several alternatives 1 warning in C d is the start amp successor of deletable structure Ll The first conflict arises because 8 can start with an a The second conflict comes from the fact that c may be followed by a a and so the parser does not know whether it should do another iteration of a in c or terminate c and continue with the a outside Another situation that leads to a
42. nds by names such as _ident or _comma Coco R generates such names for all tokens declared in the toxens section of the scanner specification For example if the ToKENs section reads TOKENS ident letter letter digit number digit digit eql L es comma colon i then Coco R will add the following constant declarations to the parser const int _EOF 0 const int _ident 1 const int _number 2 const int _eql 3 const int _comma 4 const int _colon 5 These token names are preceded by an underscore in order to avoid conflicts with reserved keywords and other identifiers Normally the toxens section will only contain declarations for token classes like ident Of number leaving literal tokens to be declared implicitly where they appear in productions However if the name of a literal token is needed for peeking it is convenient to introduce the token explicitly in this section In the actions associated with productions of the grammar this token can then be referred to by name Resetting the peek position The scanner makes sure that a sequence of Peek calls will return the tokens following the lookahead token 1a In rare situations however the user has to reset the peek position manually Consider the following grammar A IF IsFirstAlternative IF IsSecondAlternative de Assume that the function IsFirstAlternative starts peeking and finds out that the input does not match the f
43. nization an error is only reported if at least two tokens have been recognized correctly since the last error Normally there are only a handful of synchronization points in a grammar for a real programming language This makes error recovery cheap in parsers generated by Coco R and does not slow down error free parsing Weak tokens Error recovery can further be improved by specifying tokens that are weak in a certain context A weak token is a symbol that is often mistyped or missing such as a comma in a parameter list which is often mistyped as a semicolon A weak token is preceded by the keyword wzax When the parser expects a weak token but does not find it in the input stream it adjusts the input to the next token that is either a legal successor of the weak token or a token expected at any synchroni zation point symbols expected at synchronization points are considered to be particularly strong so that it makes sense to never skip them Weak tokens are often separator symbols that occur at the beginning of an iteration For example if we have the productions ParameterList Parameter WEAK Parameter Parameter ref out Type ident and the parser does not find a ora after the first parameter it reports an error and skips the input until it finds either a legal successor of the weak token i e a legal start of Parameter or a successor of the iteration i e or any symbol expected at
44. of all integers from 1 to n VAR sum INTEGER BEGIN sum 0 WHILE n gt 0 DO sum sum n n n 1 END WRITE sum END SumUp 30 BEGIN READ n WHILE n gt 0 DO SumUp READ n END END Example Of course Taste is too restrictive to be used as a real programming language Its purpose is just to give you a taste of how to write a compiler with Coco R The Taste compiler is a compile and go compiler which means that it reads a source program and translates it into a target program which is executed i e interpreted immediately after the compilation In order to run it type Taste Test TAS The file Test Tas holds the sample program shown above This file is then compiled and immediately executed prompting the user for the names of appropriate input data and output results files Classes Figure 2 shows the classes of the compiler Taste Parser Scanner TL TC Figure 2 Classes of the Taste compiler Taste is the main class It initializes the scanner and calls the parser and the interpreter TL is the symbol table with methods to handle scopes and to store and retrieve object information Finally tc is the code generator with methods to emit instructions It also contains the interpreter and its data structures The source code of all classes as well as the attributed grammar taste atc can be found in Appendix B Target Code We define an abstract stack machine for the interpretation of Taste program
45. on FormalAttributes lt ArbitraryText gt lt ArbitraryText gt ActualAttributes lt ArbitraryText gt lt ArbitraryText gt Resolver IF hy 4 ANY PA 12 2 4 1 Productions A production specifies the syntactical structure of a nonterminal symbol It consists of a left hand side and a right hand side which are separated by an equal sign The left hand side specifies the name of the nonterminal together with its formal attributes and the local variables of the production The right hand side consists of an EBNF expression that specifies the structure of the nonterminal as well as its translation in form of attributes and semantic actions The productions may be given in any order References to as yet undeclared nonterminals are allowed any name that has not previously been declared is considered to be a forward reference to a nonterminal symbol For every nonterminal there must be exactly one production which may specify alternative right hand sides In particular there must be a production for the grammar name which is the start symbol of the grammar 2 4 2 Semantic Actions A semantic action is a piece of code written in the target language of Coco R i e in C or in Java It is executed by the generated parser at the position where it has been specified in the grammar Semantic actions are simply copied to the generated parser without being checked by Coco R A semantic action can also
46. onflict Here is how the resolvers should have been placed in this example A If le ab resolves conflict betw the first two alternatives ae b Ja The following example is also interesting A a D IE ssa D G resolver placed incorrectly b Although the in the second alternative constitutes an LL 1 conflict with the b after the iteration the resolver is placed incorrectly The reason is that it should be called only once in the parser namely in the header of the while loop and not both in the while header and at the beginning of the second alternative If correctly placed at the beginning of the iteration like this A IF AnotherIteration a be b and with function anotherIteration implemented as follows static bool AnotherIteration Token next Scanner Peek return la kind _a la kind _b KK next kind _c Coco R then generates code like static void A while Anotherlteration if la kind _a Expect _a else if la kind _b Expect _b Expect _c Expect _b Remember that the resolver must be placed at the earliest possible point where the LL 1 conflict arises 2 4 7 Syntax Error Handling If a syntax error is detected during parsing the generated parser reports the error and should try to recover by synchronizing the erroneous input with the grammar While error messages are generated automatically
47. perators and statements the nesting of statements and expressions as well as the use of local and global variables to obtain a measure of the program complexity and an indication if the program is well structured 32 A cross reference generator which lists all occurrences of the objects in a program according to their scope together with information where the objects have been assigned a value and where they have been referenced A pretty printer that uses the structure and the length of statements for proper indentation A program that generates an index for books and reports The index is generated from a little language that describes page numbers and the keywords occurring on those pages The front end of a syntax oriented editor A program is translated into a tree representation which is the internal data structure of the editor A program that builds a repository of symbols and their relations in a program The repository is accessed by a case tool A profiler that inserts counters and timers into the source code of a program and evaluates them after the program has been run A white box test tool that inserts counters into the source code of a program to find out which paths of the programs have been executed 6 Acknowledgements The author gratefully acknowledges the help of the following people who contributed ideas and improvements to Coco R or ported it to other programming languages Frankie Arzu John
48. red as a literal that might match an instance of a more general token the literal has to be declared after the more general token Example ident letter letter digit while while Since the string while matches both the tokens while and ident the declaration of while must come after the declaration of ident In principle literal tokens do not have to be declared in the token declarations at all but can simply be introduced directly in the productions of the grammar In some situations however it makes sense to declare them explicitly perhaps to get a token name for them that can be used in resolver methods see Section 2 4 6 Context dependent tokens The context phrase in a TokenTerm means that the term is only recognized if its context i e the characters that follow the term in the input stream matches the TokenExpr specified in brackets Note that the TokenExpr is not part of the token Example number digit digit digit digit CONTEXT float digit digit digit E digit digit The context phrase in this example allows the scanner to distinguish between float tokens e g 1 23 and integer ranges e g 1 2 that could otherwise not be scanned with a single character lookahead This works as follows after having read 1 the scanner still works on both tokens If the next character is a the characters are pushed back to the input stream and a number token
49. returns the base address of the stack frame which is statically n levels up in the stack n 0 means the current frame n 1 means the statically surrounding frame etc The main program is considered as a normal procedure with a stack frame Figure 3 shows the format of a stack frame bp at static link dynamic link return addr local variables top Figure 3 Format of a stack frame For example the source code instruction i i 1 is translated into the code LOAD 0 3 load value of i at address 3 in current frame LIT 1 load constant 1 ADD STO 0 3 store result to i Appendix B contains the source code of the following files which can also be downloaded from http www ssw uni linz ac at Research Projects Coco for the original Linz version or from http www scifac ru ac za resourcekit for a version compatible with the extended version of Coco R but further modified to use some identifiers more like those in other case studies in the book Compiling with C and Java Terry04 Taste ATG the attributed grammar Taste cs the main program TL cs the symbol table TC cs the code generator and interpreter 5 Applications of Coco R Coco R can be used not only to write proper compilers but also to build many kinds of tools that process structured input data Various people have used Coco R for the following applications An analyzer for the static complexity of programs The analyzer evaluates the kind of o
50. s The compiler translates a source program into instructions of that machine which are then interpreted The machine uses the following data structures char code object code filled by the compiler int stack stack with frames for local variables int top stack pointer points to next free stack slot int pc program counter int bp base pointer of current frame The machine instructions are described by the following table the initial values of the registers are bp 0 top 3 LOAD n a Load value Push stack Frame n a LIT Load literal Push 1 STO n a Store stack Frame n a Pop ADD Add j Pop i Pop Push 1 3 SUB Subtract jJ Pop i Pop Push i j DIV Divide jJ Pop i Pop Push i j MUL Multiply j Pop i Pop Push 1 3 EQU Compare if equal j Pop i Pop if i 3 Push 1 else Push 0 LSS Compare if less jJ Pop i Pop if i lt j Push 1 else Push 0 GTR Compare if greater j Pop i Pop if i gt j Push 1 else Push 0 CALL n a Call procedure Push Frame n Push bp Push pc 2 31 pc a p top 3 RET Return from proc top bp 3 pc Pop bp Pop dummy Pop RES i Reserve frame top topti JMP a Jump pc a FJMP a False jump if Pop 1 pc adr HALT End of program Halt NEG Negation Push Pop READ n a Read integer stack Frame n a ReadInt WRITE Write integer WriteLine Pop The function Frame n
51. scanner PragmaDecl SemAction TokenDecl SemAction ArbitraryStatements 10 Example PRAGMAS option S letter foreach char ch in la val if ch A else if ch B This pragma defines a compiler option that can be written for example as sa Whenever it occurs in the input stream it is not forwarded to the parser but immediately processed by executing its associated semantic action Note that 1a val accesses the value of the lookahead token 1a which is in this case the pragma that was just read see Section 3 3 4 2 3 4 Comments Comments are difficult to specify with regular expressions nested comments are quite impossible to specify in that way This makes it necessary to have a special construct to define their structure Comments are declared by specifying their opening and closing brackets The keyword NesTED denotes that they can be nested CommentDecl COMMENTS FROM TokenExpr TO TokenExpr NESTED Comment delimiters must be sequences of 1 or 2 characters which can be specified as literals or as single element character sets They must not be structured for example with alternatives It is possible to declare more than one structure for comments Example COMMENTS FROM TO NESTED COMMENTS FROM TO eol Alternatively if comments are not to be nested one can define them as pragmas This has the advantage that such comments can be processed semantically
52. t and Qualident start with ident Although this conflict could be eliminated by transforming the production to UsingClause using ident ident Qualident lT the readability would clearly deteriorate It is better to resolve this conflict as shown in Section 2 4 6 Semantic issues Finally factorization is sometimes inhibited by the fact that the se mantic processing of conflicting alternatives differs e g A ident x 1 ident x ident Foo ident Bar a The common parts of these two alternatives cannot be factored out because each alternative has to be processed semantically in its own way Again this problem can be solved with the technique explained in Section 2 4 6 2 4 6 LL 1 Conflict Resolvers A conflict resolver is a Boolean expression that is inserted into the grammar at the beginning of the first of two conflicting alternatives and decides using a multi symbol lookahead or a semantic check whether this alternative matches the actual input If the resolver yields true the alternative prefixed by the resolver is selected otherwise the next alternative will be checked A conflict resolver is written as Resolver IF any expression where any Boolean expression can be written between the parentheses In most cases this will be a function call that returns true Or false Thus we can resolve the LL 1 conflict from Section 2 4 5
53. t Exit 1 public static void Init pe 1 end TC end namespace break 40 B 4 Taste cs main program using System using System I0 namespace Taste public class Taste public static void Main string args bool mergeErrors false string inputName null for int i 0 i lt args Length i if args i ToLower 1 mergeErrors true else inputName args i T if inputName null Console WriteLine No input file specified System Environment Exit 1 T int pos inputName LastIndexOf if pos lt 0 pos inputName LastIndexOf string dir inputName Substring 0 pos 1 Scanner Init inputName Errors Init inputName dir mergeErrors Parser Parse Errors Summarize if Errors count 0 TC Interpret 1 end Taste end namespace 41
54. to the empty string such as in the following grammar A B Lal B b In such cases Coco R prints the warnings A deletable B deletable LL 1 conflicts If two or more alternatives start with the same token such as in Statement ident Expression ident Parameters Coco R prints the warning LL 1 warning in Statement ident is the start of several alternatives If the start symbols and the successors of a deletable EBNF expression or are not disjoint as as in Qualid id t ad IdList 1d 1 ad Js Coco R prints the warnings LL 1 warning in Qualld id is the start amp successor of deletable structure LL 1 warning in IdList is the start amp successor of deletable structure The resolution of LL 1 conflicts is discussed in Section 2 4 5 4 ASample Compiler This section shows how to use Coco R for building a compiler for a tiny programming language called Taste Taste bears some similarities with Modula 2 or Oberon It has variables of type INTEGER and BooLEAN as well as nested procedures without parameters It allows assignments procedure calls r and wn TLE statements Integers may be read from a file and written to the console each of them in a single line It has arithmetic expressions and relational expressions lt gt Here is an example of a Taste program MODULE Example VAR n INTEGER PROCEDURE SumUp build the sum
55. ue This return value can be an object of a class that contains multiple members If a nonterminal is to have an output attribute this must be declared as the first attribute It is denoted by the keyword out both in its declaration and in its use The following example shows a nonterminal r with an output attribute x and two input attributes y and z for compatibility with older versions of Coco R the symbol can be substituted for the keyword out T lt out int x char y int z gt This nonterminal might be used as follows a Four a TDS CFS es The production of the nonterminal r is translated to the following parsing method static int T char y int z int x FSE xs Note that if expressions that are passed as actual input attributes both in C and in Java are bracketed by lt and gt they may not contain the operator gt which is the closing attribute bracket Such expressions might be assigned to a temporary variable which can then be passed as an attribute The extended version of Coco R described here also allows for attribute lists to be bracketed by lt and gt Coco R checks that nonterminals with attributes are always used with attributes and that nonterminals without attributes are always used without attributes However it 14 does not check the correspondence between formal and actual attributes which is left to the compiler of the target language Attributes of terminal symbols Termina
56. ues 13 from the production of a nonterminal to its caller i e to the place where this nonterminal occurs in some other production As with parameters we distinguish between formal attributes which are specified at the nonterminal s declaration on the left hand side of a production and actual attributes which are specified at the nonterminal s occurrence on the right hand side of a production Attributes in C A formal attribute looks like a parameter declaration In C output attributes must be preceded by the keyword out or ref The following example declares a nonterminal s with an input attribute x and two output attributes y and z S lt int x out int y ref string z gt An actual attribute looks like an actual parameter Actual input attributes may be expressions which are evaluated and assigned to the corresponding formal attributes In CF actual output attributes must be preceded by the keywords out or ref They are passed by reference like output parameters in C Here is an example a and b are assumed to be of type int c is assumed to be of type string S lt 3 a 1 out b ref gt The production of the nonterminal s is translated to the following parsing method static void S int x out int y ref string z Attributes in Java Java methods cannot have output parameters However the Java version of Coco R provides for a single output attribute which is passed to the caller as a return val
57. which the generated scanner and parser should belong e g at jku ssw Coco In the extended version of Coco R described here if no namespace is specified the generated classes will belong to a namespace matching the name of the goal symbol of the grammar The frames option can be used to specify the directory that contains the frame files Scanner frame Parser frame and Driver frame see Section 2 4 8 If this option is missing Coco R expects the frame files to be in the same directory as the attributed grammar The trace Or options option allows the user to specify a string of switches e g asx that cause internal data structures of Coco R to be dumped to the file trace txt The switches are denoted by the following characters A print the states of the scanner automaton print the first sets and follow sets of all nonterminals print the syntax graph of all productions trace the computation of first sets list the ANY and SYNC sets used in error recovery print statistics about the run of Coco R print the symbol table and the list of declared literals print a cross reference list of all terminals and nonterminals e MO DU E ba G N In the extended version the directives trace optionString options optionString are equivalent Further letters may be introduced to the optionString generate code for a compiler driver class Grammar cs Or Grammar java merge any error messages with the source code to create a listing file
58. with the value 1 is returned to the parser If the next character is not a the scanner continues with the recognition of a float token Hand written scanners If the right hand sides of the token declarations are missing no scanner is generated This gives the user the chance to provide a hand written scanner which must conform to the interface described in Section 3 3 1 Example TOKENS ident number Te f while Tokens are assigned numbers in the order of their declaration The first token gets the number 1 the second the number 2 and so on The number o is reserved for the end of file token The hand written scanner must return the token numbers according to these conventions In particular it must return an end of file token if no more input is available It is hardly ever necessary to supply a hand written scanner because the scanner generated by Coco R is highly optimized A user supplied scanner would be needed for example if the scanner were required to process include directives 2 3 3 Pragmas Pragmas are tokens that may occur anywhere in the input stream for example end of line symbols or compiler directives It would be too tedious to handle all their possible occurrences in the grammar Therefore they are excluded from the token stream that is passed to the parser Pragmas are declared like tokens but they may have a semantic action associated with them that is executed whenever they are recognized by the
59. ype gt THEN StatSeq ELSE StatSeq END WHILE Expression lt out type gt DO StatSeq END READ Ident lt out name gt WRITE Expression lt out type gt gt string name Entry entry int fix fix2 loopstart entry TL Find name if entry kind TL vars SemErr cannot assign to procedure if type entry type SemErr incompatible types TC Emit3 TC STO TL curLevel entry level entry adr if entry kind TL procs SemErr not a procedure TC Emit3 TC CALL TL curLevel entry level entry adr if type TL BOOL SemErr boolean type expected fix TC pc 1 TC Emit2 TC FJMP 0 fix2 TC pc 1 TC Emit2 TC JMP 0 TC Fixup fix fix fix2 C Fixup fix loopstart TC pc if type TL BOOL SemErr boolean type expected fix TC pc 1 TC Emit2 TC FJMP 0 TC Emit2 TC JMP loopstart TC Fixup fix entry TL Find name if entry type TL INT SemErr integer type expected TC Emit3 TC READ TL curLevel entry level entry adr if type TL INT SemErr integer type expected TC Emit TC WRITE Expression lt out int type gt SimExpr lt out type gt RelOp lt out op gt SimExpr lt out typel gt SimExpr lt out int type gt Term lt out type gt AddOp lt out op gt 36 int typel op if type typel SemErr incompatible types TC Emit op

Download Pdf Manuals

image

Related Search

Related Contents

Untitled - Sound Storm Lab  INSPIRE 1  Topcom UBR 624 Network Router User Manual  TOEICvol5-no2-part5-6  取扱説明書  GW-632FW Manual 20140311  REX F-0-9– Standalone or Access Controller Black line www.jantar.si  Benutzerhandbuch  USER MANUAL - The Sharper Image  Oster 137299 User's Manual  

Copyright © All rights reserved.
Failed to retrieve file