Home

UniCC User Manual - Phorward Software Technologies

image

Contents

1. Program Module Parser Generator Description Generator Parser One or multiple Se Description Modules e g c cpp java xml Fig 7 An overview about the UniCC code generators UniCC can not only be seen as a parser generator to compile a parser definition into a piece of program code to implement this parser It is also a parser generator that can be used as the base for different any kinds of parser analyzation code generation and optimization issues This is the reason why UniCC comes with two integrated code generators One code generator that builds program modules expressed in a particular programming language and one code generator to build an independent parser description file that describes the compiled form of the grammar and a transparent representation of the output parser The output of the first code generator can directly be fed to a compiler whereas the output of latter code generator can be analyzed by any type of other program with a specialized purpose 62 3 5 Code generators UniCC Parser Generator 3 5 1 The program module generator The program module generator is the default code generator that is used by UniCC if nothing else is explicitly specified It is used to build a parser module in a specific high level programming language Thanks to its template based approach this code generator is not targetted to one specific programming language All target
2. lt ATTLIST symbol name CDATA REQUIRED gt lt ATTLIST symbol terminal type character class string regular expression system IMPLIED gt lt ATTLIST symbol type non terminal terminal REQUIRED gt lt ATTLIST symbol value type CDATA IMPLIED gt lt ATTLIST symbol value type id NMTOKEN IMPLIED gt lt ELEMENT symbols symbol gt lt ELEMENT transition character class gt lt ATTLIST transition goto NMTOKEN REQUIRED gt lt ELEMENT value type PCDATA gt lt ATTLIST value type c_name CDATA REQUIRED gt lt ATTLIST value type id NMTOKEN REQUIRED gt lt ELEMENT value types value type gt lt ELEMENT variable EMPTY gt lt ATTLIST variable offset NMTOKEN IMPLIED gt lt ATTLIST variable target left hand side right hand side REQUIRED gt lt ATTLIST variable value type CDATA IMPLIED gt lt ATTLIST variable value type id NMTOKEN IMPLIED gt lt ELEMENT version PCDATA gt 7 Appendix Il The UniCC Document Type Definition DTD 157 UniCC Parser Generator 158 7 Appendix Il The UniCC Document Type Definition DTD With greetings from high above Phorward Software Te
3. 105 x Mone E 106 Dv Use and CoD PU THO ee ou dared t EE de up venne ii 106 30 Value types Ini semante BOOMS iit rasa dott da bon eta mein ued Rie Ou LUE RD 106 3 6 1 Semantic terminal selections iiie eee aa ei n De XN Ro buo ba 107 J ete prefix peb parser control Bloc Kesesi o eb ea 108 9 8 Additional data Structure ass a u 111 SR t2 to prep vicypes Value Type SUCUT 4 o eo S poi FERAE ann 111 5 8 2 G prefix tok Stack Token Description Structure eese 111 5 8 3 G prefix syminto symbol Information Table eerte 112 5 8 4 9 O prefix prodinfo production information table eene 113 5 9 RTE Oe GE 114 5 10 Build in syntax tree rette coe reb tree te HERR HE Ier te eine ee 115 3 TT man the build in qat E Te ME 117 5 12 G prelix parse the parser invocation function oit tiesto 119 UniCC Parser Generator Table of Contents 5 The Standard C Parser Template C tlt 3 13 pretix_utl _setcharQ UTP 8 character Tetcb function a sa aaa a aa kanan inginkan 120 53 14 Ge prenix Texemi fake current lexical Valit esse ae 121 2 15 Conk suralion M er Referee oou opere een 122 Soll UICC CEEARIN essen 122 5 15 2 UNICC DEBUG ea a P 122 3 193 UNICC ERROR DELA Tess 124 21535 UNICC EEN CT DN WEE 124 3 13 5 DINIGC MAIN es epit Ett UI Ut SEE ERNEUT IS u 125 5 15 0 UNICC_MALLOCSTEP EE 125 o Io WIC OUTOFME M os
4. S UB MUL 7 DIV extern xpl_fn xpl buildin functions void xpl dump xpl program prog xpl runtime rt int KS xpl cmd cmd for i 0 i lt prog program cnt i cmd amp prog gt program i fprintf stderr s 03d s rt amp amp rt gt ip cmd gt i opcodes cmd gt op if cmd gt op XPL JMP cmd gt op XPL JPC fprintf stderr 03d cmd gt param else if cmd gt op XPL LIT fprintf stderr Sd s s s cmd gt param prog gt literals cmd gt param gt type XPL STRINGVAL UNI sep xpl value get string prog gt literals cmd gt param prog gt literals cmd gt param gt type XPL STRINGVAL ER pH a else if cmd gt op XPL LOD cmd gt op XPL STO fprintf stderr d gt Ss cmd gt param 40 2 6 3 3 xpl program representation of a compiled program UniCC Parser Generator prog gt variables cmd gt param else if cmd gt op XPL_CAL fprintf stderr d gt s cmd gt param xpl_buildin_functions cmd gt param name fprintf stderr n 2 6 3 4 xpl fn Integrating some useful functions The build in functions in XPL are also described in one structure xp1_ fn This structure holds the name of the function as it can be called from an XPL program the minimum and maximum count of parameters and finally a C function callb
5. lt ELEMENT production code left hand side right hand side semantic right hand side lt ATTLIST production defined at NMTOKI IMPLIED gt lt ATTLIST production id NMTOKEN REQUIRED gt lt ATTLIST production length NMTOKEN REQUIRED gt Ij lt ELEMENT productions production gt lt ELEMENT prologue PCDATA gt lt ELEMENT range EMPTY gt lt ATTLIST range from NMTOKEN REQUIRED gt lt ATTLIST range to NMTOKEN REQUIRED gt lt ELEMENT raw PCDATA gt lt ELEMENT regex PCDATA gt lt ELEMENT return value EMPTY gt lt ATTLIST return value value type CDATA REQUIRED gt lt ATTLIST return value value type id NMTOKEN REQUIRED gt lt ELEMENT right hand side EMPTY gt lt ATTLIST right hand side named CDATA IMPLII lt ATTLIST right hand side offset NMTOKEN REQUIRED gt ei V lt ATTLIST right hand side symbol id NMTOKE REQUIRED lt ELEMENT semantic right hand side EMPTY gt lt ATTLIST semantic right hand side named NMTOKEN IMPLIED gt lt ATTLIST semantic right hand side offset NMTOKEN REQUIR
6. To configure a parser template for a specific mode the defines above UNICC_UTF8 UNICC WCHAR must be set to 1 If both are set to 0 ASCII mode will be used The data types of UNICC CHAR and UNICC SCHAR depends on these switches By default UNICC UTFS is switched 1 so that the UTF 8 input processing mode will be used 114 5 9 Unicode support UniCC Parser Generator 5 10 Build in syntax tree generator The UniCC Standard C Parser Template features a build in syntax tree construction automatism This automatism can be enabled by defining UNICC_SYNTAXTRE E to 1 The core aspect of this feature is to get a tree representation of the parsed input which can be used for several purposes One example are the graphical syntax trees that are printed in this manual Most of them had been build automatically using this feature in combination with some SVG rendering tools The syntax tree data structure is made up of elements of type prefix_syntree Parse tree node typedef struct prefix_SYNTREE struct prefix_SYNTREE prefix_tok UNICC_SCHAR prefix_syntree prefix_syntree prefix_syntree prefix_syntree prefix_syntree symbol token parent child prev next Member Type Content child prefix_syntree Pointer to the first child node if element is not a leaf arent 9 G prefix syntree Pointer to the parent nod
7. for i 0 i lt sizeof xpl_buildin_functions sizeof if stremp xpl buildin functions i name name return i xpl fn i 0 return 1 int xpl check function parameters int function int parameter count int line if xpl buildin functions function min gt 1 if parameter count xpl buildin functions function min fprintf stderr line d Too less parameters in call to s d parameters required at minimum line xpl buildin functions function name xpl buildin functions function min return 1 else if xpl buildin functions function max gt 1 if parameter count xpl buildin functions function max fprintf stderr line d Too many parameters in call to s d parameters allowed at maximum line xpl buildin functions function name xpl buildin functions function max return 1 return 0 Build in functions follow xpl value XPL print int argc xpl value argv int AES 6 6 xpl functions c 139 UniCC Parser Generator for i 0 i lt argc itt printf Ss n xpl value get string argv i return xpl_value NULL xpl_value XPL_prompt int argc xpl_value argv char but 256 t In if argc gt 0 printf s xpl value get string argv 0 if fgets buf sizeof buf stdin bufi strlen buf 1 N0 return xpl_value_create_string buf 1 ret
8. 0 variable 0 0 0 0 0 0 0 0 0 i 55 UniCC Parser Generator expression expression xpl emit pcb gt prog XPL DIV O0 expression xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog xpl value create integer 1 xpl emit pcb gt prog XPL MUL O0 precedence expression real TS xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog xpl value create float real string Remove the quotation marks from the string St string strlen string 1 0 Generate code f xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog xpl value create string string 1 1 z variable xpl emit pcb gt prog XPL LOD variable g i function parameter_list Semantic checks if the function parameters are in a valid count Ef if xpl_check_function_parameters function parameter_list pcb gt line 2 6 4 Implementing the compiler UniCC Parser Generator pcb gt error_count t We first push the number of parameters xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog Xpl value create integer Gparameter list Then we call up the function xpl emit pcb gt prog XPL CAL function parameter list int gt parameter list expression Qparameter list expression
9. 100 4 5 10 lexeme separation 4 6 Special symbols UniCC Parser Generator UniCC grammars may contain some special terminal and nonterminal symbols under various circumstances These special symbols begin with an ampersand amp and cannot be expressed in the grammar itself except the symbol amp error Their meaning and usage is listed in the following table below Special symbol Type Usage amp embedded num amp eof nonterminal terminal Anonymous nonterminals get this name with a consecutive number Specifies the end of file symbol This special symbol exists in every grammar and will be automatically appended to the goal symbol The end of file symbol itself is determined by the parser but is not bound to a special character sequence or value In the UniCC Standard C Parser Template the end of file symbol can be dynamically set to fit to the articular input amp error terminal Specifies the error resynchonization symbol This symbol is used to resynchronize the parser after getting scrambled by a parse error The chapter error recovery covers the use and further meaning of the error resynchonitation terminal amp whitespace amp whitespace amp whitespace nonterminal This group of nonterminals are only inserted in sensitive mode grammars that make use of the whitespaces directive Calls to this symbol are added to appropriate positions relating the lexeme configura
10. End of definition 2 5 2 Precedences 21 UniCC Parser Generator When this grammar is fed to UniCC using the w command line switch we get many lines with the following warnings unicc w conflicts par unicc warning state 16 Shift reduce conflict 1 expression gt expression expression 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 16 Shift reduce conflict 1 expression gt expression expression 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 16 Shift reduce conflict 1 expression gt expression expression 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 16 Shift reduce conflict 1 expression gt expression expression 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 17 Shift reduce conflict 1 expression gt expression expression 2 expression gt
11. GEQ ADD S UB MUL Y DIV hi extern xpl_fn xpl buildin functions void xpl dump xpl program prog xpl runtime rt int ER xpl cmd cmd for i 0 i lt prog program cnt i cmd amp prog gt program i fprintf stderr s 03d s art amp amp ftc2ipo eoomd EN ROT i opcodes cmd gt op if cmd gt op XPL JMP cmd gt op XPL JPC fprintf stderr 03d cmd gt param else if cmd gt op XPL LIT fprintf stderr s s s cmd gt param prog gt literals cmd gt param gt type XPL STRINGVAL wan T xpl value get string prog gt literals cmd gt param prog gt literals cmd gt param gt type XPL STRINGVAL WA wn i else if cmd gt op XPL LOD cmd gt op XPL STO fprintf stderr d gt Ss cmd gt param prog gt variables cmd gt param else if cmd gt op XPL_CAL fprintf stderr xpl buildin functions fprintf stderr 138 Np gt s cmd gt param cmd gt param name 6 5 xpl debug c UniCC Parser Generator 6 6 xpl functions c finclude xpl h xpl fn xpl_buildin_functions exit els 1 XPL exit hy Morante hy 1 ae XPL_print prompt qs dy XPL prompt integer SR d XPL integer float 1 l1 XPL float by String i5 1 XPL string h int xpl get function char name int Sr Try to find function
12. Given the example that an empty string and a nonempty string shall be matched as single symbols this can be handled with a semantic symbol selection Match a string or empty string string empty string lt char gt BR PCM AE Select symbol if strlen gt gt 2 symbol string else symbol empty_string Run below code only when shifting and not only at token detection this is one of the caveats in the standard C parser template when using semantic terminal determination with dynamic memory allocation S dif UNICC_ON_SHIFT strdup gt endif 4 4 1 4 Terminal anomalies When using the whitespace sensitive parser construction mode terminals could be broken down to grammatical constructs parsing lexemes This sometimes raises a problem that is known as a terminal anomaly This terminal anomly can occur between regular expression based terminals these are terminal symbols that are based on strings or regular expressions and grammar constructs build on character classes that may consume the same input sequence like the regular expression based terminal does In such a situation the parser may run into an ambiguitiy conflict that matches both the content of a lexeme that is build of the same characters as the valid regular expression based terminal does 78 4 4 1 3 Regular expressions UniCC Parser Generator The following example defines a grammar that generates this kind of
13. The function s prototype is UNICC STATIC prefix_vtype prefix_parse prefix_pcb pcb The function only requests for one parameter which is a structure of the parser control block The structure must be initialized to zero and user defined values must be filled correctly before the parse function is invoked Else the parser will come into unpredictable states and parse errors or simply cores It is also possible to pass prefix_pcb NULL to the parser invocation function In this case the function will allocate a parser control block structure on its own but this is only useful in validating parsers that do not perform much semantic actions working on pointers of the parser control block The parser function returns a value of kind prefix_vtype which is the return value of the goal symbol To find out if errors occured the variable error_count from the parser control block should be checked 5 12 prefix_parse the parser invocation function 119 UniCC Parser Generator 5 13 prefix_utf8_getchar UTF 8 character fetch function If the parser is compiled with UNICC_UTF8 enabled default the macro UNICC_GETINPUT directs by default to the function prefix_utf8_getchar getchar The function reads characters from a character retrival function by default get char until a valid UTF 8 sequence was matched It can be used by invidividual tasks also The function s prototype is UNICC_STATIC UNICC CHAR pref
14. expression 22 on on on on on on on on on on lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead rox d rox d IER rx d rox v IER rox d rox v rox d De EES EE igicur FOREN kn quete a E r UM SES SE BE tr EE 2 5 2 Precedences 3 4 unicc This is caused due the ambiguity of the grammar which comes up in the nonterminal definition of expression This ambiguity can be visualized when analyzing the expression 2 3 Because our UniCC Parser Generator expression expression Shift reduce conflict on expression expression expression expression expression Shift reduce conflict on expression expression expression expression expression Shift reduce conflict on expression expression expression expression expression Shift reduce conflict on expression expression expression expression expression Shift reduce conflict on expression expression expression expression expression Shift reduce conflict on xpression gt expression xpression
15. jmp gt 0 pcb gt prog gt program jmp param pcb gt prog gt program_cnt 129 expression 130 UniCC Parser Generator pcb gt prog gt program jpc param jmp 1 else pcb gt prog gt program jpc param pcb gt prog gt program_cnt el while pcb prog program cnt begin expression xpl emit pcb gt prog XPL JPC 0 jPC Statement xpl emit pcb gt prog XPL JMP begin pcb gt prog gt program jpc param pcb prog program cnt statement expression xpl emit pcb gt prog XPL DRP O0 Pat r variable expression xpl emit pcb gt prog XPL DUP O0 xpl emit pcb gt prog XPL STO variable e expression expression xpl emit pcb gt prog XPL EQU O0 expression expression xpl emit pcb gt prog XPL NEQ O0 SI expression lt expression 6 2 xpl par UniCC Parser Generator xpl emit pcb gt prog XPL LOT O0 i expression gt expression xpl emit pcb gt prog XPL GRT O0 expression lt expression xpl emit pcb gt prog XPL LEQ O0 expression gt expression xpl emit pcb gt prog XPL GEQ O0 expression expression xpl emit pcb gt prog XPL ADD O0 expression expression xpl emit pcb gt prog XPL SUB O0 expression
16. print lexem 0 prec 0 assoc N v null 0 print value valu A Z a z 0 9 lexem 0 prec 0 assoc N v null 1 gt ident 2 gt integer ident A Z a z lexem 0 prec 0 assoc N v null 3 gt A Z_a z ident 1 gt A A2 a rz integer 0 9 lexem 0 prec 0 assoc N v null 5 0 9 integer 6 gt 0 9 start print lexem 0 prec 0 assoc N v null 7 start amp eof amp whitespace lexem 1 prec 0 assoc N v null 8 gt o amp whitespace lexem 1 prec 0 assoc N v null 9 gt amp whitespace amp whitespace 10 amp whitespace amp whitespace lexem 1 prec 0 assoc N v null 11 gt amp whitespace 12 gt print print lexem 0 prec 0 assoc N v null 13 print amp whitespace 0 9 0 9 lexem 0 prec 0 assoc N v null 14 0 9 amp whitespace A Z a z A Z a z lexem 0 prec 0 assoc N v null 15 A Z_a z amp whitespace start print lexem 0 prec 0 assoc N v null 16 amp whitespace start 4 5 9 lexeme 97 UniCC Parser Generator The rewritten nonterminals derived from the terminal symbols are extended by their origin symbol name with an appended quote In this case the character class terminal A Za z is rewritten to a new nonterminal with the name A Za z which allows for whitespace behind every character that matches the character class The
17. 4 Generatir b v RET grid a m Ee A universal LALR 1 Parser Generator User Manual for UniCC Version 1 0 Phorward Software Technologies contact lt at gt phorward lt dash gt software lt dot gt com http www phorward software com UniCC A universal LALR 1 Parser Generator User Manual Manual Version 1 0 0 Release Date July 02 2012 Published by Phorward A Software Technologies Phorward Software Technologies Jan Max Meyer e K DorotheenstraBe 6 44137 Dortmund Germany 49 0 231 7251724 contact lt at gt phorward lt dash gt software lt dot gt com http www phorward software com Copyright 2006 2012 by Phorward Software Technologies Jan Max Meyer Modifications on this manual are not permitted All mentioned third party products and solutions are in the respective trademarks and rights of their owner Responsiblity for all web links provided in this manual remains to the publishers of the particular website UniCC Parser Generator Table of Contents 1 Introducing the UniCC parser generator nennen 1 LE Welcome to Umi i EE jl L2 The intention Behind Une nern a 2 1 3 A fow words on this MANUAL ae au 3 1 4 Contributions to UNE ne 4 2 Constructing parsers with Uni C neis ee sen 5 2 1 What does parsme mean see aaa 5 2 2 enne Sei aaa aa a a 6 2 3 Grammar ee EL EE 10 2 0 building a E 12 2 3 Wet S COMPUCTS ns 15 2 3 1 Using semone SOlOHS
18. Calls the function prefix_lexem which returns the semantic content of the scanned lexem as UNICC_SCHAR pointer UNICC_SCHAR is of type wehar_t if UNICC_WCHAR is switched else of type char In UTF 8 mode the memory behind this pointer is allocated but memory allocation is done by the parser template automatically and connected to the exem member of the parser control block The pointer is always a zero terminated string gt UNICC SCHAR Defines the length of the matched string in characters This is in UTF 8 mode lt int also the amount of characters not the amount of bytes To get the number of characters in UTF 8 mode call strlen gt 106 5 4 Contributions UniCC Parser Generator Table 9 Macros in lexer related actions and their behavior in the UniCC Standard C Parser Template The following example shows how to use the macros ident lt char gt A Za z This works in UTF 8 mode and without Unicode support printf ident length d gt s lt n lt gt This works in wide character mode printf ident length d gt S lt n lt gt dif UNICC ON SHIFT Copy current token strdup gt endif Zb 5 6 1 Semantic terminal selections The feature of semantic terminal determination is turned off by default within the UniCC Standard C Parser Template because it may be the source for memory leaks in the sensitive parser mode It can be
19. EOF eof UNICC CHAR but can be explicitly set to any other value This member can be changed and defines the dynamic end of file handing error count int Current number of errors error delay int This is set during error recovery to reduce inherited errors idx int Index of reduced production is_eof UNICC_BOOLEAN Defines if end of file character has already been read len unsigned int Holds the length of the matching string in the current input buffer Retrieves a semantic string pointer to the current input This member is lexem UNICC SCHAR filled by the prefix_lexem function and relates to the input type switching UTF 8 or Unicode Contains the current line number within the input It can be used for error line unsigned int reporting in combination with the column member variable First line begins at 1 Ihs int Left hand side index during reduction next UNICC CHAR Temporary character holding space Holds the ID of the old lookahead symbol during error recovery It will old sym int automatically be re used and reset ret prefix_vtype Last return value of reduction action or symbol to be shifted stack Oprefix tok The parser state and value stack stacksize junsigned int Variable to determine the maximum stack size of stack sym int Holds the ID of the lookahead symbol which is the current token 5 7 Q Q prefix pcb parser control block 109 UniCC Parser Generator test Oprefix
20. There is a downloadabe setup package on the offical UniCC product website hostet at Phorward Software Technologies The setup package contains both the 64 bit or 32 bit binary versions of UniCC which can be installed as wanted The system runs out of the box after installation It only has to be arranged with the favorized compiler and or IDE that is used for productive usage On Linux it is recommended to follow the steps described in the following chapter and compile UniCC from source Currently there is no RPM or similar package available but may follow soon 3 Using UniCC 59 UniCC Parser Generator 3 3 Building UniCC from source UniCC is a product that was entirely established and developed on top of the Phorward Foundation Toolkit to be more exactly most parts of the Phorward Foundation Toolkit are having their origin in earlier development stages of the UniCC Parser Generator but where moved into libraries for usage outside UniCC The Phorward Foundation Toolkit and its library libphorward provide many useful functions for general purpose and extended software development tasks including standard data structures a system independent interface extending data types and regular expression management functions required by UniCC to construct the lexical analyzers The Phorward Foundation Libraries are released under the BSD License more information can be obtained from the official product website at http phorward phorward soft
21. debug s debug current token 7 0 9 debug sym 7 0 9 len 1 tos gt state 0 act reduce idx 13 debug Stack Dump 0 debug lt lt reducing by production 13 amp whitespace gt debug after reduction shifting nonterminal 14 amp whitespace debug current token 7 0 9 debug sym 7 0 9 len 1 tos gt state 2 act shift idx 7 debug Stack Dump 0 2 amp whitespace 122 5 15 Configuration Macro Reference debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug debug 5 15 2 shifting terminal 7 lt lt reducing by product after reduction shift current token 1 sym 1 len 1 Stack Dump 0 2 amp whit lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift current token 1 sym 1 len 1 Stack Dump 0 2 amp whit gt gt shifting terminal 1 current token 7 0 9 sym 7 0 9 len 1 Stack Dump 0 2 amp whit lt lt
22. expression 5 88 go variable function int gt identifier LS if xpl_get_function identifier gt 0 symbol function else xpl get variable pcb prog identifier Parser Control Block pcb xpl_program prog FILE input Prologue amp Epilogue prologue finclude xpl h extern xpl_fn xpl buildin functions define UNICC GETINPUT fgetc pcb gt input epilogue int xpl compile xpl program prog FILE input prefix_pcb pcb memset amp pcb 0 sizeof prefix_pcb 2 6 4 Implementing the compiler 57 UniCC Parser Generator pcb prog prog pcb input input pcb eof EOF prefix_parse amp pcb return pcb error count The entire compile and executable source code of the XPL project can be found in appendix 1 of this manual 58 2 6 4 Implementing the compiler 3 Using UniCC 3 1 Overview The UniCC parser generator is a software program that compiles augmented parser definitions into program modules of a higher programming language or optionally into a structured not further targetted parser description file A parser definition analyzed by UniCC is an UTF 8 ASCII formatted textfile that generally contains definitions for terminal symbols nonterminal symbols and productions to describe a context free grammar These definitions are expressed in a Backus Naur
23. expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 17 Shift reduce conflict 1 expression gt expression expression 2 expression expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 17 Shift reduce conflict 1 expression gt expression expression 2 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 17 Shift reduce conflict 1 expression gt expression expression 2 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 18 Shift reduce conflict 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression expression 3 expression gt expression expression 4 expression gt expression expression unicc warning state 18 Shift reduce conflict 1 expression gt expression expression 2 expression gt expression expression 3 expression gt expression
24. gt expression warning state 18 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression warning state 18 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression warning state 19 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression warning state 19 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression warning state 19 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression warning state 19 xpression gt expression xpression gt expression xpression gt expression xpression gt expression xpression gt expression expression expression expression expression expression amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof lookahead amp eof tat tat tx rat ry rox d rox d IER IER rox d IER rox v grammar defines each operator production as expression operator expression and expression itself can exists of such a composition again the grammar
25. gt type XPL STRINGVAL prefer XPL STRINGVAL else if op 0 gt type XPL FLOATVAL op 1 gt type XPL FLOATVAL prefer XPL FLOATVAL else prefer XPL INTEGERVAL switch rt ip gt op case XPL ADD Addition or with Strings concatenation if prefer XPL STRINGVAL char str str xpl malloc char NULL strlen xpl value get string op 0 strlen xpl value get string op 1 1 sizeof char sprintf str s s xpl value get string op 0 xpl value get string op 1 val xpl value create string str 0 else if prefer XPL_FLOATVAL val xpl_value_create_float xpl value get float op 0 xpl value get float op 1 else val xpl_value_create_integer xpl value get integer op 0 xpl value get integer op 1 break case XPL_SUB Substraction if prefer XPL_FLOATVAL val xpl_value_create_float xpl value get float op 0 2 6 3 6 Implementing the virtual machine UniCC Parser Generator xpl_value_get_float op 1 else val xpl_value_create_integer xpl value get integer op 0 xpl value get integer op 1 break case XPL MUL Multiplication if prefer XPL_FLOATVAL val xpl value create float xpl value get float op 0 xpl value get float op 1 else val xpl_
26. lt ATTLIST goto symbol id NMTOKEN REQUIRED gt lt ATTLIST goto to state NMTOKEN IMPLIED gt A lt ELEMENT left hand side EMPTY gt lt ATTLIST left hand side offset 0 1 REQUIRED gt lt ATTLIST left hand side symbol id NMTOKEN REQUIRED gt lt ELEMENT lexer state gt lt ATTLIST lexer id NMTOKEN REQUIRED gt lt ELEMENT lexers lexer gt lt ELEMENT parser version copyright description symbols productions states lexers value types prologue epilogue pcb source gt lt ATTLIST parser basename NMTOKE REQUIRED gt lt ATTLIST parser char max NMTOKE REQUIRED gt lt ATTLIST parser char min NMTOKE REQUIRED gt lt ATTLIST parser mode NMTOKEN REQUIRED gt lt ATTLIST parser name NMTOKEN REQUIRED gt lt ATTLIST parser prefix NMTOKEN REQUIRED gt 7 Appendix Il The UniCC Document Type Definition DTD 155 UniCC Parser Generator lt ATTLIST parser source NMTOKEN REQUIRED gt lt ATTLIST parser target language NMTOKEN REQUIRED gt lt ATTLIST parser unicc version NMTOKEN REQUIRED gt lt ELEMENT pcb PCDATA gt
27. 0 named name gt lt code defined at 55 gt lt raw gt lt raw gt lt variable target left hand side value type int value type id 0 gt raw amp amp variables lt raw gt lt variable target right hand side offset 0 gt raw a lt raw gt code lt production gt which can be easily converted into another representation The entire document type definition DTD of the UniCC Parser Description Files is printed in the appendix 2 of this manual 66 3 5 2 The XML based parser description generator UniCC Parser Generator 3 6 The parser construction modes UniCC is a flexible parser generator that can handle two different methods to construct its parsers and their lexical analyzators The first and defaultly used method is called the sensitive parser construction mode This construction mode is a speciality of UniCC and gives a maximum of flexibility to implement parsers for nearly any type of context free language UniCC analyzes and rewrites the grammar according to several rules influencing whitespace and lexeme detection and separation The lexical analysis including the handling of whitespace is broken down to single input characters enabling full context free grammars on lexem level A lexical analysis is still done silently in the background but with the option that there is no direct cut between lexer and parser required The second method called insensitive pa
28. allows for parsing it as 2 3 and as 2 3 which results in completely different parse trees It is correct that the order of the operators in additions doesn t care but what about 2 3 Here again the grammar from above allows both for 2 3 and 14 2 3 but only latter one is a valid result we want to accept expression expression expression expression integer integer 1 2 2 5 2 Precedences expression integer 3 23 UniCC Parser Generator expression xr Md expression gs expression expression 1 integer et 2 3 Fig 5 Two different parse trees are yielding in ambigous grammars simplyfied For these cases precedence and associativity assignments to terminals can be used to define the correct way of how UniCC should handle these conflicts Using a left directive we give terminal symbols a left sided associativity which means that the left expression of the terminal is resolved first When we do this to addition and subtraction these operators will resolve the left expression first so the parse tree grows left derivative The input string 2 3 will then be parsed as 1 2 3 and 142 4344 will be parsed as 1 2 3 4 Because substraction is on the same precedence level as addition 2 3 will correctly be parsed as 2 3 But what about multiplication and division When we assign the same left bounded associativity to these operators 2 3 will be incorrect
29. at maximum line xpl_buildin_functions function name xpl_buildin_functions function max return 1 return 0 Build in functions follow xpl_value XPL_print int argc xpl_value argv int SE for i 0 i lt argc itt printf Ss n xpl value get string argv i return xpl_value NULL xpl_value XPL_prompt int argc xpl_value argv char but T 2560 s t Ts if argc gt 0 printf Ss xpl value get string argv 0 if fgets buf sizeof buf stdin buf strlen buf 1 N0 return xpl value create string buf 1 return xpl value create string 1 xpl value XPL exit int argc xpl value argv int rc 0 if argc gt 0 rc xpl value get integer argv 0 exit rc return xpl value NULL xpl value XPL integer int argc xpl value argv return xpl_value_create_integer xpl_value_get_integer argv xpl_value XPL_float int argc xpl_value argv 42 return xpl value create float xpl value get float argv 2 6 3 4 xpl fn Integrating some useful functions UniCC Parser Generator xpl_value XPL_string int argc xpl_value argv return xpl_value_create_string xpl_value_get_string argv 1 2 6 3 5 xpl runtime the runtime data structure This implementation of an XPL compiler and interpreter distinguishes between static data describing
30. begin using UniCC and to become familar with the UniCC parser definition language Authoring of this section has been started already a few years ago during earlier development stages of UniCC so this is the reason why it is not up to date with all the technical possibilities UniCC provides right now But it s a good place to start The goal of this quick start guide is the implementation of a small programming language compiler called xpl The second section is the UniCC reference guide relating to the installation or build the use of the UniCC command line interface and the general features of UniCC The third section directs to the grammar definition language and all its features in a detailed way for topic based lookup This chapter also includes practical examples and snippets on feature related problems and their solutions The fourth section provides information about the Standard C Parser Template delivered with the UniCC software package It should be mostly attended by C programmers who want to develop parsers with UniCC targeting the C programming language There will be more sections or separate manuals like this one for other target languages e g C or Java as soon as they are implemented and well tested The manual will be continously be updated and extended with more or detailed information Hopefully it answers all of your questions coming up when UniCC shall become the workhorse of your upcoming compiler project If not
31. bit around with the parser looks like this unicc w dates par cc o dates dates c 2 4 Building a working parser 13 dates el 2010 ax Hell tax ax ax ax ax ax An important error Oo error SLLOrF error error error error ok August 24 ok 23 5 2011 ok Holidays 24 7 line 1 syn ok 29 10 2010 line 1 syn ok line syn ok line syn ok line syn ok line syn ok line syn ok 14 on on on on on on on Birthday of Mr x token Lo Lo Lo Lo Lo Lo Ken Ken Ken Ken Ken Ken UniCC Parser Generator meeting amp eof amp eof 2 4 Building a working parser UniCC Parser Generator 2 5 Writing compilers With above examples we only created parsers that match valid and reject invalid input Based upon the above grammar for parsing dates it can now be simply turned it into a real compiler 2 5 1 Using semantic actions This example is not a compiler in terms of a programming language but some kind of converter to compile an input date into an other output format For this purpose the parser is augmented with semantic operations to be performed during the parse A semantic action is a piece of program code that is executed on a part of the parse tree when a production rule has been entirely matched This causes an internal reduction of the rule to its left hand side the nonterminal
32. can be used everywhere in the grammar definition behind nonterminal and terminal symbols on the right hand side and invokes the so called virtual production feature of UniCC For each of these virtual productions UniCC automatically inserts a virtual nonterminal that implements the desired syntactical element in its correct well formed grammatical structure UniCC provides the following three virtual production operators which are well known from regular expressions for kleene closure multiple or none repetitions for positive one or multuple repetitions e for optional closure one or none repetition For the above shortcut the contract out version would be title gt UM virtuall FH virtuall virtual2 virtual2 Sea virtual2 ai You can see Much lesser efforts in writing the grammar by reaching the same effect right 2 3 Grammar modularity 11 UniCC Parser Generator 2 4 Building a working parser UniCC is a parser generator This means that is generates parsers from such as the above shown grammar definitions into an adequate simplified parser representation UniCC parses grammar definitions performs some revisions on it constructs the parse tables and lexical analyzers which are required by the underlying parser driver to match the defined grammar and produces some output In terms of a compiler writer UniCC is nothing else than a compiler to compile grammars into parsers So this is als
33. don t avoid to drop a mail to get individual support and help with UniCC 1 3 A few words on this manual 3 UniCC Parser Generator 1 4 Contributions to UniCC Altought the UniCC parser generator is under continuous development since 2006 its initial release to the public as an open source project had been in summer 2011 This means that UniCC is a relatively young open source project but also a proven and well tested software saddled and ready for many new creative challenges In order to get even more better and famous any kind of contribution to UniCC is acutely welcome Code fixes enhancements and documentation contributions to the UniCC project are always accepted if they are in favor of the general public and in the interest of Phorward Software Technologies and its product strategies behind UniCC Help is also appreciated in form of support and advertise From the project view the UniCC software package consists currently of two projects e The UniCC LALR 1 Parser Generator which is an open source project released under the Artistic License version 2 It has its product website located under http unicc phorward software com e The UniCC Standard C Parser Template C tlt which is part of the UniCC software package and released under the BSD license This project also has its own website located under http cparser phorward software com UniCC uses its Standard C Parser Template to bootstrap its own parser out of its
34. expression xpl emit pcb gt prog XPL_MUL O0 expression expression xpl emit pcb gt prog XPL DIV O0 K expression xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog xpl value create integer 1 xpl emit pcb gt prog XPL MUL O0 3 precedence expression real LS xpl emit pcb gt prog XPL LIT 6 2 xpl par 131 parameter_list lt int gt variable function lt int gt gt 132 UniCC Parser Generator xpl_get_literal pcb gt prog xpl_value_create_float real Fa string Remove the quotation marks from the string Ef string strlen string 1 0 Generate code Ay xpl_emit pcb gt prog XPL LIT xpl_get_literal pcb gt prog xpl_value_create_string string 1 1 variable IS xpl emit pcb gt prog XPL LOD Gvariable z function parameter_list Semantic checks if the function parameters are in a valid count y if xpl check function parameters function Gparameter list pcb line pcb error countt We first push the number of parameters xpl emit pcb gt prog XPL LIT xpl get literal pcb gt prog xpl value create integer Gparameter list Then we call up the function xpl emit pcb gt prog XPL CAL function XI parameter_list expression parameter_list expression expres
35. expressions Terminal symbols based on regular expressions are the most flexible style of expressing a terminal symbol They can be made up of a fully fledged type 0 regular language Regular expressions must explicitly be defined in their own definition block within the grammar using the following syntax identifier regular expression term semantic actions for the lexical analysis The character introduces the regular expression both in its definition and wherever the symbol is used The regular expression itself is defined by sequences of the following already known syntactical elements Construct Usage Specifies a character character class or negated character class eene Specifies a static string Specifies a character class standing for any character Using this construct causes the terminal to be configured as non greedy and Parantheses to build sub expressions equal to embedded productions The alternative operator to define multiple expressions at one expression level al Kleene closure none or several of previous expression modifier Positive closure one or several of previous expression modifier 5 Optional closure none or one of previous expression modifier Table 4 Lexical symbols for regular expression construction 74 4 4 1 1 Characters UniCC Parser Generator To get more familar with this syntax a few examples follow A simple identifie
36. for such a terminal To define all characters from a to f as valid input abcdef and a as range definition is possible To define a terminal matching the characters from a to z 0 to 9 and the equal sign a z0 9 can be specified If the range is negative for example z a its still interpretered as a z To define the dash itself as part of the character class it the best to express it as first character of the character class string or in a place where no dash is interpreted as range definition for example or a z UniCC accepts escape sequences to define special characters and unprinable characters UTF 8 characters are directly accepted For example Nn is a valid definition with an escape sequence for newline and the Euro currency sign Character terminals can be negated using an preceding exclamation mark For example a z accepts all characters except the range from a to z this will include all characters in the UniCC character universe from 0x0 to OxFFFF except above range In the UniCC Standard C Parser Template a character terminal is always associated with the data type int containing the character code of the matched character This value can be used in semantic actions digit int 0 9 dig Gdig 0 4 4 Grammars 73 UniCC Parser Generator It depends on the used implementation language and parser template which data type is used for terminals expressed by c
37. if jmp gt 0 pcb gt prog gt program jmp param pcb gt prog gt program_cnt pcb gt prog gt program jpc param Qjmp 1 else pcb prog program jpc param pcb gt prog gt program_cnt while L pcb gt prog gt program_cnt begin expression LS xpl emit pcb gt prog XPL JPC O0 jpc Statement xpl emit pcb gt prog XPL JMP begin pcb prog program jpc param pcb prog program cnt A statement expression xpl emit pcb gt prog XPL DRP O0 2 6 4 Implementing the compiler expression UniCC Parser Generator variable expression LS xpl emit pcb gt prog XPL DUP xpl emit pcb gt prog XPL STO expression expression E xpl emit pcb gt prog XPL EQU bei expression expression LS xpl emit pcb gt prog XPL_NEO expression expression xpl emit pcb gt prog XPL LOT X expression gt expression LS xpl emit pcb gt prog XPL_GRT m expression lt expression xpl emit pcb gt prog XPL LEQ expression gt expression KS xpl emit pcb gt prog XPL_GEO Si expression expression L xpl emit pcb gt prog XPL ADD expression expression ES xpl emit pcb gt prog XPL SUB E expression expression L xpl emit pcb gt prog XPL MUL Ae 2 6 4 Implementing the compiler
38. in XPL We define a virtual execution machine for a stack based virtual machine interpreting 1 address code This simple virtual machine remains entirely independent from the overlying language here XPL It can also be seen as a target platform for any other kind of language e g a simple BASIC dialect 2 6 3 1 xpl value a dynamic value object The stack elements of the simple virtual machine are described in a structure called xpl value defined in xpl h Virtual machine values typedef enum XPL NULLVAL Nothing Undefined XPL INTEGERVAL Integer type XPL FLOATVAL Float type XPL STRINGVAL String type xpl datatype typedef struct xpl_datatype type Value type union 32 2 6 3 The virtual machine UniCC Parser Generator int ase float f char S value Value storage union char strval Temporary string value representation pointer xpl value To work with these value objects a functions library is implemented in xpl value c include xpl h Value Objects xpl value xpl value create void return xpl value xpl malloc char NULL sizeof xpl value xpl value xpl value create integer int i xpl_value val val xpl_value_create xpl_value_set_integer val i return val xpl_value xpl_value_create_float float f xpl_value val val xpl_value_create xpl_valu
39. information table is a structure of type prefix_prodinfo and can be refered as a static global variable called prefix_productions Typedef for production information table typedef struct char definition int length int lhs prefix_prodinfo Member Type Content definition char A string representing the production s definition as expressed in the UniCC parser definition length jint The length of the production This is the number of symbols on the right hand side The id of the default left hand side symbol indexing an entry in the symbol information Ihs int table Table 13 Member variables of prefix_prodinfo 5 8 4 prefix_prodinfo production information table 113 UniCC Parser Generator 5 9 Unicode support The UniCC Standard C Parser Template supports Unicode in various ways Unicode support means that it is capable to consume Unicode characters as input In general the template supports three types of input methods UNICC_CHAR UNICC_SCHAR Type Define switch Desciption internal external Unicode less mode works with ASCH symetric 8 bit ASCH input char char only UTF 8 UNICC_UTE8 UTF 8 mode supporting 8 bit default asymetric input sequences wchar t char Wide character 32 bit symetric input mode Table 14 Input type modes of the UniCC Standard C Parser Template IWide character UNICC_WCHAR wchar t wchar t
40. is empty n void xpl_run xpl_program prog xpl_runtime rt xpl_value val int a Initialize runtime memset amp rt 0 sizeof xpl runtime rt variables xpl value xpl malloc char NULL prog variables cnt sizeof xpl value rt ip prog gt program Program execution loop while rt ip lt prog gt program prog gt program_cnt fprintf stderr IP p n rt ip xpl_dump prog amp rt xpl_stack amp rt K 144 6 9 xpl run c UniCC Parser Generator switch rt ip gt op case XPL_NOP No nothing break case XPL_CAL Calling build in functions int param_cnt Last stack item contains the number of parameters val xpl_pop amp rt param_cnt xpl_value_get_integer val xpl_value_free val Call the function val xpl buildin functions rt ip gt param fn param cnt rt stack rt stack cnt param cnt If no value is returned create a default value if val val xpl value create integer 0 Discard the parameters from stack while param cnt gt 0 xpl value free xpl pop amp rt param cnt Push the return value xpl push amp rt val break case XPL_LIT Load literal and push duplicate xpl_push amp rt xpl_value_dup prog gt literals rt ip gt param break case XPL_LOD Load value from variable and pu
41. it belongs to which is then a part of another production again or at least the goal symbol For this purpose UniCC allows to store values into the different symbols used in each production definition this includes all symbols on the right hand side of the particular production and the left hand side symbol the nonterminal the production belongs to Storing a value to the left hand side means that it is taken over to the right hand side of the next rule reduction Before we drift now into a too complex textual clutter lets assign some semantic actions to the original simple date grammar from above whitespaces E AE lexeme integer default action 1 Defining the grammar dates month month integer day integer year printf 02d 02d 04d n day month year month January 1 February 2 March 3 April 4 May 5 June 6 July 7 August 8 September 9 October 10 November 11 December 12 integer gt 0 9 1 0 integer 0 9 Q0 Q1 10 Q2 0O That s everything required for a simple date format compiler parsing an input date in the format Name of Month Day Year and compiling it into the format Day Month Year 2 5 Writing compilers 15 UniCC Parser Generator In comparison to the grammar dra
42. language related code is read from tags defined in a parser template file which must follow a static structure that is pretended by the program module generator in order to construct the output code The standard C parser template provided with UniCC is a parser template of such kind It gives UniCC the ability to build parsers written in the C programming language which can be compiled after generation without any further modification If more parser templates will exist somewhere in the future UniCC will also be capable to generate program modules for parsers written in other programming languages like C C Java Pascal Fortran or anything else Given the very simple grammar mode insensitive parser Simple start gt Hello World UniCC constructs a C program that consists of more than 1000 lines of code using the program module generator in combination with the standard C parser template This output source can directly be passed to a standard C C compiler like gcc without any further modification 3 5 2 The XML based parser description generator The parser description generator outputs an XML based representation of the generated parser This parser description file contains the LALR 1 parse tables tables for the lexical analyzer conditioned semantic code a structured listing of all symbols and productions with all of its tagged information the original parser definition source and any warning or error messages
43. literals_cnt Number of elements in literals char variables The variable symbol table The index of the variable name represents the address in the variables member of the runtime data structure SCH int variables_cnt Number of elements in variables xpl program 38 2 6 3 3 xpl program representation of a compiled program UniCC Parser Generator To build up the program structure we also require some modification functions in xpl program c anda function xpl emit to emit the virtual machine code finclude xpl h Program structure handling int xpl_get_variable xpl_program prog char name int ase A function name with the same identifier may not exist if xpl get function name gt 1 return 1 Try to find variable index for i 0 i lt prog variables cnt i if stremp prog variables i name 0 return i Else eventually allocate memory for new variables if i XPL MALLOCSTEP prog gt variables char xpl_malloc char prog gt variables i XPL MALLOCSTEP sizeof char Introduce new variable prog gt variables prog gt variables_cnt xpl strdup name return prog gt variables_cnt int xpl get literal xpl program prog xpl value val Else eventually allocate memory for new variables if prog literals cnt XPL MALLOCSTEP prog gt litera
44. of build system but we will run them manually for now 12 2 4 Building a working parser UniCC Parser Generator The first step is the same on all platforms and with all versions of UniCC Store the grammar you want to compile into a format free text file let s use dates par as filename for the above grammar To invoke UniCC from a shell or Makefile simply type unicc w dates par If UniCC does output nothing at all the grammar is valid consistent and has been successfully compiled and generated without any errors and warnings If UniCC reports errors these must be fixed If errors are reported this always causes that no output respective a parser is generated Errors arise if UniCC comes into a situation where a valid result is not possible to generate or the algorithm on generating the parser is initiated with missing or incomplete data This can be e g a parse error in the input where no valid grammar can be recognized or wrong use of left hand side items within a productions which avoids replacing a placeholder within executable code with a valid item access actions Some warnings are normal in the daily use of UniCC Warnings can normally be ignored because the reference to automatically fixed grammar ambiguities or default mechanisms taking place if they where not explicitly defined If you want to suppress all warnings run UniCC without the w or warnings option Using the v or verbose option UniCC outputs some pro
45. on the right hand side Using this definition our grammar allows for integer numbers with one digit If we want to extend the grammar now to allow for one or multiple digits forming an integer we have to add a new production which calls nonterminal integer recursively besides its one digit rule Adding such a new production is done by separating the first production from the second using a pipe character Adding more productions requires even more pipes to separate them integer gt 0 9 integer 0 9 This is now a finished and valid grammar to detect multiple digit integer numbers The concept becomes increasingly clear when we see how the parser moves along the input for a number e g 321 and how it consumes productions each digit Let act as an end of input marker Input 321 21 1 us ert EE i I rox Productions a AN gt EE gt integer gt integer rue ger gt 0 9 integer 0 9 50 97 integer integer integer integer ee Den integer integer integer 1 Sa LL Parse tree integer 2 gt S integer 1 3 integer 2 EFF 3 integer 2 3 3 Fig 1 Construction of parse tree from the string 321 From this parse process and the structure how the input string is absorbed by the parser the following visualized recursion tree is constructed integer integer Sa Sap integer 1 integer 2 3 Fig 2 Parse tree derived from the input 321 simplyfied Th
46. produced by UniCC during the parser construction process Third party programs can work on this information to generate individual parser code or code parts directly interpret analyze modify represent or rewrite the compiled parser for any desired purpose To trigger the parser description generator UniCC must be run with the x or X command line option x runs both the program module generator and the parser description generator X will only run the parser description generator With the very simple grammar from above mode insensitive parser Simple start gt Hello World UniCC causes to build a grammar definition file like this when using the parser description file generator lt xml version 1 0 encoding UTF 8 gt lt DOCTYPE parser SYSTEM unicc dtd gt 3 5 1 The program module generator 63 UniCC Parser Generator lt parser unicc version 1 0 0 mode insensitive name Simple source simple par basename simple target language C char min 0 char max 65534 gt lt symbols gt lt symbol type terminal id 0 name Hello terminal type string defined at 4 gt lt symbol type terminal id 1 name World terminal type string defined at 4 gt symbol type terminal id 2 name amp amp eof terminal type system lt symbol type non terminal id 3 name start defined at 4 gt lt symbol type non terminal id 4 name World deri
47. reducing by production 21 calculator gt amp whitespace calculator debug goal symbol reduced exiting parser debug parse completed with 0 errors 5 15 3 UNICC ERROR DELAY Defines the error delay that is used to mark a parse error as successfully recovered If there raise parse errors within this number of shifts after an inital parse error during error recovery no following error will be reported UNICC ERROR DELAY is set to 3 shifts by default 5 15 4 UNICC GETINPUT Defines a macro that is called when a character is fetched for buffing in the current input The UniCC Standard C Parser Template buffers input according to the needs of the lexer This buffering is done automatically but it is required to provide a way how characters are read from the input By default UNICC GETINPUT is defined as define UNICC GETINPUT prefix_utf8_getchar getchar in UTF 8 mode and define UNICC GETINPUT getchar in all other modes By re defining it a function must be given that returns the next character and moves the input pointer one character to the next one It must be assured that no more input is returned by UNICC_GETINPUT when the end of file marker has been reached One could define UNICC GETINPUT as define UNICC GETINPUT pcb gt inputstring 124 5 15 3 UNICC ERROR DELAY UniCC Parser Generator to read from a buffered string but this could read ov
48. side symbols via they offset Access right hand side via their speaking alias names which can be an identifier or a string with blanks and other special characters when put in quotation marks lt name gt lt name gt lt name gt Specifiy semantic value dependent nonterminal within a production s symbol name o reduction code Table 6 Semantic macros to be used within reduction actions The reduction value also refered as the left hand side value can be accessed with the macro boolean gt true GG false GG 0 ll bi There are several ways to access symbols on the right hand side The simplest is to access them by their offset of appearance This is done with the variables 1 02 3 etc UniCC validates all used semantic macros within reduction code blocks so if there is an access to offset 3 in a production that has only two symbols it will drop an error A simple augmented grammar Get integer from input integer WERE atoi gt Parse tiny expressions example gt expr printf Sd n Q1 expr gt integer integer G1 03 zi integer integer Q1 G4 zl Error UniCC does also support the feature of providing individual reference identifiers for every symbol of the right hand side using the syntax symbol identifier Forexample a z charor name ident would be
49. so far To use it with a parser the top level directive language C must be set and the UniCC target language template file C t 1t must be in the path directed by UNICC_TPLDIR Without any configuration the C standard template results in parsers that are perfectly useable for grammar testing by reading character by character from stdin Errors are printed to stderr To integrate the parser into other execution environments some more configuration is necessary Configuration of the parser is performed with the use of C preprocessor directives only which must be defined in the header code to disable their default declarations All symbols of the parsers which are or could become visible from outside the parser module contain a prefix value which is configured using the prefix directive 5 6 Value types in semantic actions The UniCC Standard C Parser Template sets by default the data type int for all symbols used in the grammar A character class terminal always returns the matched character as int value test 0 9A Za z printf d Sc n G1 char 1 In semantic actions submitted to regular expression terminals the following table shows data types and description Macro C Datatype Usage ES Sie Defines the return value that is associated with the terminal This variable is of ee terminal symbol default is int the same type that is specified for the symbol or the default type
50. start which introduces new virtual nonterminals amp whitespace amp whitespace amp whitespace Hello and start But it uses a fully fledged nonterminal whitespace which could possibly exist of very flexible grammatical constructions The behavior of terminal symbols within the revision can also be influenced by the directives lexeme ffixate and lexeme separation Whitespace becomes part of the grammar but its existence is hidden from the semantic actions of the grammar developer There is no obvious indicator if whitespace handling has been done this can only be indicated on demand 4 5 8 2 whitespaces in insensitive mode In the insensitive parser construction mode the whitespaces directive only accepts terminal symbols Trying to specifiy a nonterminal at whitespaces results in an error because whitespace consumed by the parser can t be handled here The grammar is not rewritten as in sensitive mode All specified terminal symbols are flagged as whitespace and are simply ignored when read by the resuling lexical analyzer The grammar is not rewritten but the grammar developer is limited to only use character classes strings and regular expressions as whitespaces symbols without a flexible grammar beyond Resulting grammars result in much lesser states and will be parsed faster 4 5 9 lexeme The 4 1exeme directive can only be used within the sensitive parser construction mode It configures nonterminal symbo
51. switched on by setting the define UNICC_SEMANTIC_TERM_SEL to 1 It is strongly recommended to use the feature of semantically determined nonterminals instead If the feature of semantically determined terminal recognition is turned on semantic actions may be run multiple times in some cases to correctly identify the current input in combination with semantically determined terminal selection actions For these special cases code blocks that allocate memory for token return should be expressed in blocks of if UNICC_ON_SHIFT endif conditioned directives to avoid multiple memory allocations and lost of allocated pointer addresses The allocation of memory in above example could be performed like this ident name lt char gt A Za z Semantic terminal determination if strcmp gt name 0 symbol name dif UNICC ON SHIFT Copy current token strdup gt endif This causes the block within UNICC_ON_SHIFT only will be executed if the token had been clearly identified before and is stacked 5 6 Value types in semantic actions 107 UniCC Parser Generator 5 7 prefix_pcb parser control block The parser control block is the main data structure which is passed to all parser related functions in order to provide any runtime information on the current parser state input buffering and much more It is always accessable within all configuration macros as well as semantic codes and s
52. the compiled XPL program and dynamic data that is only required at program runtime The compiler creates and modifies the pl program object the interpreter only reads the xp1_program object and works on a xpl_runtime object containing the variable data and stack Runtime structure typedef struct xpl value variables Array of objects representing the variable values The number of objects is in the corresponding xpl program structure in the member variables cnt NA xpl value stack Array representing the value stack int stack cnt Defines the top of stack and the numbers of elements resisting Xy xpl_cmd ip Instruction Pointer to the currently executed cod address 3 xpl_runtime 2 6 3 6 Implementing the virtual machine Finally the virtual machine is implemented into xp1 run c as function xpl_run include xpl h extern xpl_fn xpl_buildin_functions static int xpl push xpl runtime rt xpl value val if rt gt stack_cnt XPL MALLOCSTEP 0 rt gt stack xpl_value xpl_malloc char rt stack rt stack cnt XPL MALLOCSTEP sizeof xpl value rt stack rt 5stack cnt val return rt 5stack cnt t static xpl value xpl pop xpl runtime rt 2 6 3 5 xpl runtime the runtime data structure 43 UniCC Parser Generator if rt gt stack_cnt return xpl_value NULL return rt gt stack rt gt stack_cn
53. token number 1 from the left It is also possible to assign meaningful names to right hand side symbols simply by separating the symbol from the identifying name using a colon This is done in the production dates month integer day integer year printf 02d 02d 04d n day month year Ed from above so we are able to access month day and year over their meaningful names instead of their position offsets There is no label specified for month because UniCC automatically associates the nonterminal s name with an right hand side value specifier This default can be overwritten by defining it manually as its done with day and year By using the position offsets the same result could be reached by writing date month integer integer printf 02d 02d 04d n Q2 Q1 Q4 If names are given the symbols of the right hand side can be accesses both via offset or by name The 16 2 5 1 Using semantic actions UniCC Parser Generator advantage of using identifying names is that no changes in the semantic action code is required if the production symbol order changes e g when a new separation symbol is introduced between two symbols To assign a value to the left hand side which is the upper node in the parse tree an placeholder is used never contains a value it is initialized to zero and is only used to pass a result from one successfully recognized production to another at the time
54. 3 2 Defining a code set 37 UniCC Parser Generator Virtual machine opcodes typedef enum XPL_NOP No operation XPL_CAL Call function XPL_LIT Load literal value XPL_LOD Load value from variable XPL_STO Store value into variable XPL_DUP Duplicate stack item XPL_DRP Drop stack item XPL_JMP Jump to address XPL_JPC Jump to address if false XPL_EQU Check for equal XPL_NEQ Check for not equal XPL_LOT Check for lower than XPL_LEQ Check for lower equal XPL_GRT Check for greater than XPL_GEO Check for greater equal XPL_ADD Add or join two values XPL_SUB Substract two values XPL_MUL Multiply two values XPL_DIV Divide two values xpl op Command description typedef struct xpl_op op int param xpl_cmd 2 6 3 3 xpl_program representation of a compiled program To provide a data representation of a compiled program XPL requires a structure holding a symbol table for variables the used literals and the program code All information in this structure should remain untouched at program runtime and is only build up by the compiler Program structure typedef struct xpl_cmd program Virtual machine program int program_cnt Numer of elements in program xpl_value literals Array of literal objects int
55. 5 11 main the build in main function The UniCC Standard C Parser Template provides an automatically constructed main function This function will be made available if no semantic code is submitted to UniCC s epilogue directive or if the define UNICC_MAIN is manually configured to 1 The build in main function provides a test mode to test the parser s behavior Input is read from stdin When a parser is compiled with the build in main function it is possible to provide the following command line parameters to change the parser s test mode behavior Short N option Long option Usage s silent Run in silent mode so no additional output is printed lc endi ss Runs the parser in an endless loop A detected end of file EOF stops the parser and starts a new parser Runs the parser in line mode where the newline character is configured as l line mode end of file token Table 16 Command line options of the build in main function If UNICC SYNTAXTREE is enabled a syntax tree will be printed to stderr after the input has been correctly parsed Given a grammar Some grammar related directives language Ch whitespaces TAE flexeme integer default action Q1 left rer tat left 1x At Defining the grammar calculators expression printf Sd n expression expression gt expression expression GG Q1 03 zl expression expres
56. ED gt lt ATTLIST semantic right hand side symbol id NMTOKEN REQUIRED gt lt ELEMENT shift EMPTY gt lt ATTLIST shift symbol id NMTOKEN REQUIRED gt lt ATTLIST shift to state NMTOKEN REQUIRED gt lt ELEMENT shift reduce EMPTY gt lt ATTLIST shift reduce by production NMTOKEN REQUIRED gt lt ATTLIST shift reduce symbol id NMTOKEN REQUIRED gt lt ELEMENT source PCDATA gt lt ELEMENT state goto shift shift reduce transition gt lt ATTLIST state accept NMTOKEN IMPLIED gt lt ATTLIST state default production NMTOKEN IMPLIED gt lt ATTLIST state default transition NMTOKEN IMPLIED gt lt ATTLIST state derived from state NMTOKEN IMPLIED gt lt ATTLIST state id NMTOKEN REQUIRED gt lt ATTLIST state lexer NMTOKEN IMPLIED gt lt ELEMENT states stater gt lt ELEMENT symbol character class code dfa regex gt lt ATTLIST symbol defined at NMTOKEN IMPLIED gt lt ATTLIST symbol derived from NMTOKEN IMPLIED gt lt ATTLIST symbol id NMTOKEN REQUIRED gt 156 7 Appendix Il The UniCC Document Type Definition DTD UniCC Parser Generator
57. Form styled notation but with many powerful extensions and the possibility of integrated semantic code segments In most cases a parser definition also contains several configuration directives or simply called directives They are used for the configuration of symbol task generation and augmentation related features There are also a few directives that must be defined before any directive or grammar construction is done because they influence general settings that cannot be changed later on These directives are called top level directives Additionally a parser definition file can contain operational programming code that is revised by UniCC and inserted into appropriate positions within the resulting program module These code fragments have the purpose to fit a particular need within the parsing process Code blocks can be specified to various parser directives to productions and some special terminal definitions Due the target language independency of the UniCC parser generator and its parser definition language a parser definition file can also contain additional information called tags These tags can be defined globally or associated with various grammatical objects Use of this feature is in the interest of subsequent from UniCC detached tasks which perform operations on the output of UniCC and the use this additional information for various purposes or results 3 2 Setting up UniCC Setting up UniCC on Windows is quite simple
58. I tts whitespaces whitespace whitespace gt t n DH lexeme integer integer gt 0 29 DH starts expr DH expr gt expr expr expr expr expr expr expr expr Ft expr Wb integer whitespace is configured as whitespace nonterminal here integer is a lexem that is triggered as a coherent lexical unit where not whitespace is allowed within Using this grammar to parse the input string 78 2 the following syntax tree will be constructed The nonterminals amp whitespace amp whitespace and amp whitespace are automatically inserted by UniCC Also the symbol r n likewise a generated nonternimal symbol has been inserted by UniCC during character class separation Grammars modified by UniCC are heavier to read but they gain this unique features in flexibility start amp whitespace start m 5 m K UN a cm T uS Ans me MC 0 94 amp whitespace amp whitespace 0 9 0 9 0 9 Ee RSA oa 0 9 se me Fig 8 Syntax tree of the string 18 2 using a sensitively constructed parser The underlying rewritten grammar that is generated by UniCC can be dumped by using the G command line option 68 3 6 1 Sensitive mode UniCC Parser Generator 3 6 2 Insensitive mode The whitespace insensitive parser construction mode can be compared to most existing parser generators It strictly separates the part of the lexical analysis from the parser Whitespace is onl
59. L return val value i case XPL FLOATVAL return int val gt value f case XPL STRINGVAL return atoi val value s default break 34 2 6 3 1 xpl value a dynamic value object UniCC Parser Generator return 0 float xpl value get float xpl value val switch val gt type case XPL INTEGERVAL return float val gt value i case XPL FLOATVAL return val value f case XPL STRINGVAL return float atof val gt value s default break return 0 0 char xpl value get string xpl value val char buf 128 1 7 char p val gt strval xpl free val gt strval switch val gt type case XPL_INTEGERVAL sprintf buf d val gt value i val gt strval xpl_strdup buf return val gt strval case XPL_FLOATVAL sprintf buf Sf val value f Remove trailing zeros to make values look nicer for p buf strlen buf 1 p gt buf p if p kp 0 break else if p 0 break kp 0 val gt strval xpl strdup buf return val gt strval case XPL STRINGVAL return val value s default break return 2 6 3 1 xpl value a dynamic value object 35 36 UniCC Parser Generator 2 6 3 1 xpl_value a dynamic value object 2 6 3 2 UniCC Parser Generator Defining a code set The internal assembly language for our virtual machine cons
60. L LOT XXX Se 134 yk Nothing Undefined Integer type Float type String type Value type Value storage union Temporary string value representation pointer Function name Minimal parameter count Maximal parameter count Function callback pointer No operation Call function Load literal value Load value from variable Store value into variable Duplicate stack item Drop stack item Jump to address Jump to address if false Check Check Check Check for for for for equal not equal lower than lower equal 6 3 xpl h XPL_GRT XPL_GEQ XPL_ADD XPL_SUB XPL_MUL XPL_DIV xpl op Command description typedef struct xpl_op Op int param xpl_cmd Program structure typedef struct xpl_cmd program int program_cnt xpl_value literals int literals_cnt char variables int variables_cnt xpl program Runtime structure typedef struct xpl_value variables xpl_value stack int stack_cnt xpl_cmd ip xpl runtime Some general defines define XPL MALLOCSTEP 6 3 xpl h UniCC Parser Generator Check for greater than Check for greater equal Add or join two values Substract two valu
61. NICC WCHAR Switches wide character mode on if configured as 1 If UNICC WCHAR is set UNICC UTFS will be automatically disabled 126 5 15 9 UNICC REDUCE UNICC SHIFT 6 Appendix I The XPL programming language This is the appendix that includes the entire source code of the XPL programming language example from the tutorial To obtain the latest version of the XPL source code presented here access the Mercurial source repository using hg clone http unicc hg sourceforge net 8000 hgroot unicc xpl 6 1 Makefile XPL xpl XPL_PARFILE xpl par XPL_PARSER xpl parser XPL_PARSER_C XPL_PARSER c XPL_PARSER_H S XPL PARSER h XPL PARSER O S XPL PARSER o XPL SOURCE xpl debug c xpl functions c xpl main c xpl program c xpl run c xpl util c N xpl value c XPL PARSER C XPL OBJECTS xpl debug o xpl functions o xpl main o xpl program o xpl run o xpl util o xpl value o XPL_PARSER_O all XPL S XPL XPL_SOURCE Makefile CC o S8 XPL_SOURCE XPL PARSER C XPL PARFILE unicc svwb XPL_PARSER XPL PARFILE clean rm f XPL PARSER C rm f XPL PARSER H rm f S XPL OBJECTS rm f XPL 6 Appendix I The XPL programming language 127 UniCC Parser Generator 6 2 xpl par language CHE Meta information parser XPL desc
62. XPL string int argc xpl value argv parser c program c void xpl reset xpl program prog xpl run c void xpl run xpl program prog xpl util c char xpl malloc char oldptr int size char xpl strdup char str char xpl free char ptr xpl value c xpl value xpl value create void xpl value xpl value create integer int i xpl_value xpl value create float float f void xpl value free xpl value val void xpl value reset xpl value val xpl value xpl value dup xpl value val xpl runtime rt _compile xpl program prog FILE input l get variable xpl program prog char name 1 get literal xpl program prog xpl value val l emit xpl program prog xpl op op int param xpl value xpl value create string char s short duplicate void xpl value set integer xpl value val int i void xpl value set float xpl value val float f void xpl value set string xpl value val char s int xpl value get integer xpl value val float xpl value get float xpl value val char xpl value get string xpl value val endif 6 4 xpl proto h 137 6 5 xpl debug c UniCC Parser Generator finclude xpl h static char opcodes 3 1 NOP CAL IM OD EE DUP DRP JMP DESCH Ve EE NEO 10 Ve EQ GRT
63. a eter Pa Vic Ete ree leise nnd 125 2 15 8 UNICC PARSE ERROR unseren 125 5 139 UNICC REDUCE UNICCG SHIP ss en 126 5 15 10 UNICC _S TACK DEBU Gy entree eebe Ee 126 SS DNIGC S E RE 126 51912 NICS UTES rc 126 S I3 13 UNIEELWCHAR tue u u 126 6 Appendix I The XPL programming language ccssccsscsssssssssscsssscsssccssscssssscssssscsssscesscssssssssessssseecs 127 SUEDE TJI epe ee een 127 5 4 RDLD in uo ob E DRE is d a RD UEM te de Rd cute UM LO iU UE 128 mq T 134 m cp eque O EE 137 meme ilu 138 6 0 suite uiii c ET 139 6 7 Bn c 141 ue d P UI M E 142 DEER ege ege erger 144 oye HU ae ol Py isl Ee TE TED UT 150 6 11 oC d m 151 7 Appendix II The UniCC Document Type Definition DTD cec eeee eere cerne eene a tenant 155 UniCC Parser Generator 1 Introducing the UniCC parser generator 1 1 Welcome to UniCC UniCC standing as an acronym for Universal Compiler Compiler is a powerful LALR 1 Parser Generator and Language Development System for computer professionals Its design goal is to serve as an all round design and build tool assisting compiler writers in any parsing related task including production quality compiler construction and the implementation of domains specific languages UniCC unions an integrated generator for lexical analyzers and a powerful LALR 1 parser
64. a in its logical way If the syntax a parser follows is not matched the information is useless or error prone and cannot be processed by subsequent tasks that rely on the correctness of the parsed information Every compiler as the most common example has a parser for the language it compiles the language must follow the correct syntax to let the compiler produce valid output like an assembly program or a program in a lower programming language In nearly all business and trough the whole bunch of areas where information technology is used languages are created to describe data structures configuration files workflows ways of how information could be accessed all such tasks may require an underlying parser that fetches the information expressed in a domain specific language approach Serving another example a program like an appointment assistant allows to import appointments from a file and accepts date and time values in different human readable formats for example May 5 2008 or 5 5 08 a parser is required to analyze the correct format based on a syntax which describes all possible formats a date can expressed As you can see there are so many different applications requiring a parser that all of them never can be mentioned or grouped The limit is only the creativity of every individual and his or her skills and ideas The UniCC parser generator supports the programmer dealing with any type of parsing issue in two important ste
65. ack pointer requiring for functions that follow the prototype declaration xpl value function int argc xpl value argv All build in functions some service functions and an array holding the functions as a symbol table is stated in xpl functions c finclude xpl h xpl fn xpl_buildin_functions exit San d XPL exit ion print Ty 1 XPL print hy prompt EX 1 XPL prompt integer E dy XPL integer float Ty 1 XPL_float string Ly ie XPL_string int xpl_get_function char name int Se Try to find function for i 0 i lt sizeof xpl_buildin_functions sizeof if strcmp xpl_buildin_functions i name name return i xpl fn rt 0 return 1 int xpl check function parameters int function int parameter count int line if xpl buildin functions function min 1 if parameter count xpl buildin functions function min fprintf stderr line d Too less parameters in call to s d parameters required at minimum line xpl buildin functions function name xpl buildin functions function min return 1 else if xpl buildin functions function max gt 1 2 6 3 4 xpl_fn Integrating some useful functions 41 UniCC Parser Generator if parameter_count gt xpl_buildin_functions function max fprintf stderr line d Too many parameters in call to s d parameters allowed
66. adequate right hand side identifiers Its also possible to provide long strings for example 0 9 A special number In case a nonterminal or regular expression based terminal appears an identifying name for it is automatically associated with the same name as the symbol so no identifier must be provided manually Note that this automatism fits only to the first occurence of the symbol on the particular right hand side If the same symbol appears multiple times on the same right hand side its first occurence can be accessed with this default identifier only 4 4 3 1 Semantic actions 83 UniCC Parser Generator Identified symbols can then by accessed by identifier Identifier String or Identifier String within the reduction code The offset access is always possible and can be mixed as below example gt expr printf Sd n expr zl integer Hello Folks printt ints Fan Hello Folks r expr Qinteger il integer i2 il i3 integer integer i2 GG integer 3 integer GG 1 Some target programming languages like C are strongly typed If there s no special data type given the default type specified by the UniCC standard C parser template is int UniCC provides a way to assign data types to nonterminals and this breaks down to production level and the reduction code If we assume a nonterminal ident is made up of several chara
67. am_cnt XPL_MALLOCSTEP sizeof xpl cmd prog gt program prog program cnt op op prog program prog program cnt param param return prog program cnttt 6 8 xpl program c UniCC Parser Generator void xpl reset xpl program prog int is Variables for i 0 i lt prog variables cnt i xpl free prog gt variables i xpl free char prog gt variables Literals for i 0 i lt prog literals cnt i xpl value fr prog literals i xpl free char prog gt literals Program xpl free char prog gt program memset prog 0 sizeof xpl program 6 8 xpl program c 143 UniCC Parser Generator 6 9 xpl run c finclude xpl h extern xpl_fn xpl buildin functions static int xpl push xpl runtime rt xpl value val if rt gt stack_cnt XPL MALLOCSTEP 0 rt gt stack xpl_value xpl_malloc char rt gt stack rt gt stack_cnt XPL_MALLOCSTEP sizeof xpl_value rt gt stack rt 5stack cnt val return rt stack cnt t static xpl value xpl pop xpl runtime rt if rt 5stack cnt return xpl value NULL return rt 5stack rt stack cnt static void xpl stack xpl runtime rt int i for i 0 i lt rt gt stack_cnt i fprintf stderr 3d s n i xpl value get string rt gt stack i LEME fprintf stderr Stack
68. anonymous nonterminal with an empty production to be used as embedded semantic code x Sp pcb gt current_depth fprintf pcb gt debugfile New procedure gt s lt d n funcname pcb current depth depth statements j This is the usual production s semantic block Do some code generation here then replace th stacked value AY pcb current depth depth 86 4 4 3 3 Anonymous nonterminals UniCC Parser Generator In some parser generators this feature is called embedded actions but in UniCC the generic term anonymous nonterminals is used for both embedded actions and embedded productions 4 4 3 3 Anonymous nonterminals 87 UniCC Parser Generator 4 5 Directives Directives are used to configure and influence various behaviors and settings of UniCC related to the parsing revision or construction of the grammar and its parse tables Directives do always begin with a symbol hash similar to C preprocessor directives A special kind of directive are the top level directives which begin with the sequence Directives may appear anywhere in the grammar definition except top level directives because they modify options and parameters of the fundamental behavior of UniCC or the grammar which affect any subsequential directive or definition They must be specified on top of the grammar but their use is not required so the default settings for these top level directives take plac
69. ansition goto 9 gt lt character class count 1 gt lt range from 108 to 108 gt lt character class gt lt transition gt lt state gt lt state id 8 gt lt transition goto 2 gt lt character class count 1 gt lt range from 100 to 100 gt character class transition state state id 9 gt transition goto 1 gt lt character class count 1 gt lt range from 111 to 111 gt lt character class gt lt transition gt lt state gt lt state id 10 gt lt transition goto 8 gt lt character class count 1 gt lt range from 108 to 108 gt lt character class gt lt transition gt lt state gt lt lexer gt 3 5 2 The XML based parser description generator 65 UniCC Parser Generator lt lexers gt lt value types gt lt prologue gt lt epilogue gt pcb gt lt source gt mode insensitive parser Simple start amp gt Hello World source lt parser gt All semantic code parts and their macros and variables are split into several tags which can be easily adapted modified or enhanced For example the following production definition variable lt int gt gt a z name amp variables name a is compiled into the following XML structure lt production id 20 length 1 defined at 55 gt lt left hand side symbol id 21 offset 0 gt lt right hand side symbol id 37 offset
70. arisons The function call which is part of nonterminal expression first pushes and evaluates all parameter values onto the stack then pushes the number of parameters to be taken by the CAL operation to find out how many parameters shall be used for the function call and to clear them later on Semantic checking if the function call provides too less or too many parameters is done here by the function xpl_check_function_parameters already The more difficult part is the nonterminal statement This nonterminal requires the backpatching of addresses and code generation within its productions The best example for this is the statement while The XPL compiler should compile a while statement while i 0 i i 1 into a virtual machine code like this 000 LOD 0 gt i 001 LIT O 0 002 GRT 003 JPC 011 004 LOD 0 gt i 005 LIT 1 1 006 SUB 007 DUP 008 STO 0 gt i 009 DRP 010 JMP 000 011 NOP The first operations from address 000 to adress 002 are emitted from the comparison The content of variable iand the literal O are pushed and compared for greater than The result of the comparison will then be pushed onto the stack 0 for false 1 for true The JPC instruction at address 003 now compares against this value on top of the stack and if it is false jumps to address 011 If the comparison is true the machine will continue to run at address 004 The opcodes from address 004 to address 009 are the result of
71. arse dateslike August 18 2008 which follows the basic syntax Name of Month Day Year Semantic checks e g if a valid day for the month or year is given will be ignored to simplify matters For the day and year number we can re use the integer grammar from above here to integrate with this new date grammar The date grammar definition will simply be fwhitespaces 5 NUTS flexeme integer date gt month integer integer month January February March April May June July August September October November December integer gt Tt integer 0 9 8 2 2 Defining grammars UniCC Parser Generator This is a valid UniCC grammar definition Instead of the first simple grammar definition this grammar can be fed to UniCC as it is and produces a working parser In the first three lines some parser configuration is performed An end of input symbol is defined as a line break by default this is the zero byte and a character class acting as whitespace symbol is defined How this special kind of symbol is handled will be discussed later You currently just have to know what its meaning is The third line defines nonterminal integer to be a lexeme This must be done to disallow whitespace between the terminals of integer itself to let it act as a syntactical coherent unit If this is not done the input 12 34 would also be valid for one single integer although it
72. ate 3 gt lt state gt state id 3 default production 0 derived from state 2 lt shift reduce symbol id 1 by production 1 gt lt state gt lt states gt lt lexers gt lt lexer gt lt state id 0 gt lt transition goto 3 gt lt character class count 1 gt lt range from 72 to 72 gt lt character class gt lt transition gt lt transition goto 4 gt lt character class count 1 gt lt range from 87 to 87 gt gt 64 3 5 2 The XML based parser description generator UniCC Parser Generator lt character class gt lt transition gt lt state gt state id 1 accept 0 gt state id 2 accept 1 gt lt state id 3 gt lt transition goto 5 gt lt character class count 1 gt lt range from 101 to 101 gt lt character class gt lt transition gt lt state gt lt state id 4 gt lt transition goto 6 gt lt character class count 1 gt lt range from 111 to 111 gt lt character class gt lt transition gt lt state gt lt state id 5 gt lt transition goto 7 gt lt character class count 1 gt lt range from 108 to 108 gt lt character class gt lt transition gt lt state gt lt state id 6 gt lt transition goto 10 gt lt character class count 1 gt lt range from 114 to 114 gt lt character class gt lt transition gt lt state gt lt state id 7 gt lt tr
73. ate C tlt This manual relates to the UniCC Standard C Parser Template version 1 0 and higher 5 1 Overview The UniCC Standard C Parser Template enables the UniCC LALR 1 Parser Generator to support the C programming language in its program module generator Using this template UniCC is capable to generate parsers expressed in the C programming language from a UniCC Grammar Definition The template also provides facilities for further grammar processing and integration of the generated parser modules with other C modules For now there is no other programming language template available Phorward Software Technologies aims to provide more Standard Parser Templates for UniCC targetting other programming languages in future If there is an interest in contributing adapted parser template to make UniCC s taget language range more richer than like now check out the respective chapter The UniCC Standard C Parser Template is also used by UniCC itself for bootstrap meaning that UniCC constructs its own parser out of itself 5 2 Features Every parser template has some special features and specifications provided to the compiler writer so there s no way to get around this section Below is the feature set the UniCC Standard C Parser Template provides Well tested feature proven used by UniCC s own grammar parser Platform and C compiler independent based on the C standard library only ANSI C89 and C99 compliant Thread safe par
74. ations after a regular terminal is shifted or at the beginning of the grammar This grammar revisions comes visible when calling UniCC with the G option to print out the revised grammar that is used to finally construct the parse tables For example the simple grammar mode sensitive whitespaces whitespace COMMENT start gt Hello World whitespace Nt COMMENT yields in a revised grammar UniCC outputs as unicc G w sensitive par GRAMMAR whitespace Nt COMMENT lexem 1 prec 0 assoc N v null Hoe md 2 gt NE 3 COMMENT start Hello lexem 0 prec 0 assoc N v null 0 gt Hello World start Hello lexem 0 prec 0 assoc N v null 4 gt start amp eof amp whitespace Nt COMMENT lexem 1 prec 0 assoc N v null 5 whitespace amp whitespace Nt COMMENT lexem 1 prec 0 assoc N v null 6 gt amp whitespace amp whitespace 7 gt amp whitespace amp whitespace Nt COMMENT lexem 1 prec 0 assoc N v null 8 gt amp whitespacet 9 gt Hello Hello lexem 0 prec 0 assoc N v null 4 5 8 whitespaces 95 UniCC Parser Generator 10 Hello amp whitespace World World lexem 0 prec 0 assoc N v null 11 World amp whitespace start Nt COMMENT Hello lexem 0 prec 0 assoc N v null 12 gt amp whitespace
75. build in The selected parser template describes various parts of the parser broken down to single parse tables cells that are used to construct parsers in nearly any possible problem oriented programming language If UniCC is forced to create a parser description file the 1anguage directive will only be attached as attribute 88 4 5 Directives UniCC Parser Generator Please read more about the UniCC code generation possibilites in the section about the UniCC code generators 4 5 3 case insensitive strings The case insensitive strings directive switches string based terminals to be case insensitive or not The directive can be called multiple times Each time it is called and switched subsequent string based terminal symbol definitions are threatened according to the current case order configuration state The case insensitive strings directive allows for one parameter which is of type boolean and accepts on and of f All subsequent strings are case insensitive case insensitive strings on xample hello test This string can be written in any case order case insensitive strings off test world This one must be in lower case order Default value is of f 4 5 4 default action default epsilon action default value type The default actionand default epsilon action directives define a default semantic action code that is used for productions with no attached semantic block incl
76. c checks to define a symbol s correct meaning This can be the case if a grammar allows for function and variable names that are made up of character sequences There may be a construct within the grammar to recognize the identifier e g a regular expression terminal ident But there is no possibility on the grammar definition level to clearly define wether a parsed identifier refers to a function name or to a variable name For this case UniCC supports semantic nonterminal determinations The semantic code block triggers some kind of selection task for example a symbol table call to find out if the addressed identifier is the name of a variable or a function Then the semantic code selects the kind of nonterminal to be used in the parser The feature of semantic nonterminal determination allows to perform a dynamic manipulation of the parser within semantic actions This mechanism is done by declaring a production with multiple left hand sides The first specified left hand side will always be used as default The semantic selection is specified in the reduction code using a special macro symbol name where lt name gt refers to the name of the left hand side symbol to be set For the problem described above the following grammar would be the adequate solution ident lt char gt A Za z It gt Two productions are used here values variable function Multiple nonterminal definition with semantic sy
77. ce rules defined by the grammar s productions Every symbol both terminal and nonterminal symbol is shifted which means it is consumed and put on a stack when its matched in the input Then if a sequence of symbols resisting on the stack exactly matches to one production it will be reduced In this case the term handle is used so there is a handle to a production on the stack To reduce the production means that the parser validated the input according to a productions rules as valid It then replaces the productions sequence the handle by its left hand side nonterminal which is part of the next subsequent production or the goal symbol which defines the input to be valid and sucessfully terminates the parser All this shifting and reducing is done on an internal stack within the parser which is holding the current parser state The parser is oriented on the parse tables constructed by UniCC They predict which terminal symbols are valid in the current context and which action must be performed next A production may also exist of no symbol sequence These productions are called epsilon productions Their appearance performs an immediate reduction of the nonterminal if no other production mets This is an example for an nonterminal with two productions Latter one is an epsilon production nonterm gt hello 4 4 3 1 Semantic actions The reduction of a production means that a new node in the parse tree is constructed The terminal s
78. cessing informations and grammar statistics unicc v dates par UniCC version 1 0 Parsing grammar Done Parser construction mode sensitive Goal symbol detection Succeeded Setting up single goal symbol Done Rewriting grammar Done Fixing precedences Done Computing FIRST sets Done Validating rule integrity Done Building parse tables Done Terminal anomaly detection Done Constructing lexical analyzer Done Detecting default rules Done Code generation target C default Invoking code generator UniCC Standard C Template v1 0 Done dates par produced 36 states 0 errors 0 warnings 2 files To get an overview about all supported command line options run unicc without any parameters There is also a section abount command line parameters in the reference manual In case of the standard C parser template delivered with UniCC above call will generate two output files which are the program source file dates c and dates h a header file containing some definitions This output can immediatelly be compiled with a suiting C compiler It is not required to write a main function because the default UniCC parser template for C parsers uses a predefined main function if no individual code for the parser s footer is specified Note that this feature is always parser template related Especially third party drivers and own modifications may not support this feature Building compiling and playing a little
79. chnologies Compiler Construction Scripting Languages Software Development Tools www phorward software com
80. configured to take a higher precedence than the ary minus normally takes Defining it expression precedence as production definition the correct syntax tree will be constructed for 2 3 expression expression amp eof expression a expression integer integer Fig 15 Correct syntax tree of 2 3 using a production predecence declaration The precedence directive has to be put right beind the production and its reduction code so expression 2 precedence could be a valid reduction code submit 94 4 5 7 left right nonassoc precedence UniCC Parser Generator 4 5 8 whitespaces The whitespaces directives defines symbols to be handled as whitespace which means they are allowed everywhere between terminals and simply ignored In programming languages comments blanks tabs and in some languages even line breaks are handled as whitespace whitespaces is one of the most influencing UniCC directives and works different in both of the two parser construction modes 4 5 8 1 whitespaces in sensitive mode If whitespaces is used in sensitive mode default it allows for symbols of any kind including nonterminal symbols This means that whitespace can be constructed from sub grammars which are made up of entire grammatical rules and their subsequent rules Using this parser construction mode UniCC rewrites the entire grammar to make whitespace become valid in situ
81. conflict lexeme ident lexeme separation on whitespaces ag programs gt empty or stmt 0 empty or stmt gt An stmt stmt IF empty or stmt ELSE empty or stmt ENDIF ident ident gt If this is compiled with unicc w the following warnings are raising up A Za z A Za z0 9 unicc warning state 5 0 empty or stmt gt 2 empty or stmt gt unicc warning state 5 0 empty or stmt gt 2 empty or stmt gt UniCC detects these conflicts by testing suspicious nonterminals against all regular expression based Terminal anomal empty or stmt empty or stmt Terminal anomal empty or stmt empty or stmt t at shift on empty or stmt Ten at shift on empty or stmt x0 and reduce on and reduce on A Z a z ELSE ENDIF A Z a z ELSE ENDIF LSE ENDIF terminals These messages declare that the input for ELSE and ENDIF is both recognized by the nonterminal ident and their relating string terminals defining them as keywords Due to UniCCs build in terminal precedence leveling the string terminals will always take a higher precedence than the character class terminals but maybe the grammar developer gets an unwanted parse result This warning helps to detect and fix this issue To avoid this warning the reserv terminals directive s
82. create xpl_value_set_integer val i return val xpl_value xpl_value_create_float float f xpl_value val val xpl_value_create xpl_value_set_float val f return val xpl_value xpl_value_create_string char s short duplicate xpl_value val val xpl_value_create xpl value set string val duplicate xpl strdup s s return val void xpl value free xpl value val if val return xpl value reset val free val void xpl value reset xpl value val if val gt type XPL STRINGVAL amp amp val gt value s fr val value s val gt strval xpl free val gt strval memset val 0 sizeof xpl value val gt type XPL NULLVAL 6 11 xpl value c 151 UniCC Parser Generator xpl_value xpl_value_dup xpl_value val xpl_value dup dup xpl value create if val return dup memcpy dup val sizeof xpl value dup gt strval char NULL if dup gt type XPL STRINGVAL dup gt value s xpl_strdup dup gt value s return dup void xpl_value_set_integer xpl_value val int i xpl_value_reset val val gt type XPL INTEGERVAL val value i i void xpl value set float xpl value val float f xpl value reset val val gt type XPL FLOATVAL val gt value f f void xpl value set string xpl value val char s xpl value reset val
83. ct value 0x1 which is then passed to the left hand side In the second step of our parse of the input sequence 17 we first have to multiply the first digit by its base 10 to derive 10 from the 1 and then perform the same procedure as in the first production but with the difference that is will be added to this 10 The result is 17 as a true decimal number to be stored to an int data typed variable This latter step can then be performed for every digit in the integer The already parsed value is multiplied with 10 to be moved one digit to the left again and then added by the next digit Note the recursion of nonterminal integer It can parse one single digit or a chain of digits as one unit As a beginner you can now say that coding parsers in UniCC is hard to understand Well you might be a little right But this is only the first impression on it The more you learn about the techniques the more practical experience you will have things will make sense and the parse trees you want to climb will grow in your mind Don t give up even if you feel so it wouldn t be worthwhile If you re already familiar with parsing this low level looking way of extracting atomic integer numbers from the input will look unconventional to most of you This is done because UniCC allows but not relies to perform lexical analysis by the grammar rather than an upstreamed lexer To let the reader become more familar with grammar definitions in UniCC fi
84. cters its return type should be a char that points to a constructed string buffer Nonterminal ident can them simply be declared to hold data type char string char gt A Za z string create string add char 1 M string A Za z string add char 80 Q1 zl string is always of type char when its used so constructions like statement print string printf Ss n string are possible In other target language templates this must not be a problem it strongly relies on the target language template In some cases a default action to set nonterminal values by default is wanted if no action is assigned to the particular production By using the default action and default epsilon action directives such a default code block can be specified for both production with a sequence and for epsilon productions default action GG Q1 default epsilon action example abc printf Sse lt n SI Ae Za abc gt bas 5s Le H 84 4 4 3 1 Semantic actions UniCC Parser Generator 4 4 3 2 Virtual productions UniCC provides a virtual production feature which is very useful to prototype a grammar quickly Virtual productions are created when the modifiers and are used These modifiers can be assigned to any symbol which is virtually turned into a nonterminal with some productions then Modifier Meaning x Kl
85. d oso dde ehe 15 2 3 2 PRECEMENCES e Reno adn nse ieee een diene 18 2 0 mpl menting a programming TEE ee EE 26 2 6 1 Source Code dawnlbpads s oce ttem c esce u 26 pA cow Dr ltins Ihe erc inest E 28 2 6 2 1 Terminals based on regular expressions iiiter etes e tre kx aan 30 2 6 2 2 Overwpttine production EE 30 2 6 23 Multiple lett hand ds ee an 30 2 6 2 4 The dangle else problemi ann 31 2 6 3 The virtual machine 32 263 4 xpl value adynamic value E 32 2 0 3 2 Denning acode Ret aa a eT a 31 2 6 3 3 xpl program representation of a compiled program 38 2 6 3 4 pl fn Integrating some useful funeti ns uu eston 0eckenienkannen 41 2 635 xXpl runtune he runteme data SITUGLUEE asus 43 2 6 3 6 Implementing the yirtual machine aan ae 43 2 04 Implemenung the Compilers eege nates tle EUR D FD UH IE RU tele 50 3 Um DN ad aaa aaa CUR DNE GRRUK L a 59 SMS E 59 32 SC TE UMCG C ponon 59 3 3 Building BN Ge 60 34 Invocation and command ne OMIS eresse aseina t PR prt ters Patines seca bas rab raae ber tude bus onde 61 e Socii en 62 35 24 Ihe proeramemiodule BetieraloE eii sn 63 3 5 2 The XML based parser description generator in tren eite sn 63 3 0 The parser Ee e sera ee 67 SoG ll SENSI YS MOUS sans un 67 SET KE 69 d UmCC Grammar Delnmolis ass ER 71 c Kommen ee S 71 42 EE 71 4 3 Definiti n blacks nauenensn nn 72 e 73 4 4 1 Terminal ymbols ass en 73 UniCC Parser Generator Table of Con
86. e Directives follow nearly the same syntactic rules than any other definition in UniCC The synopsis is directive name parameterl parameter2 parameterN Because the directive names are build in and constant some of them even contain blanks The syntax of the parameter part depends on the directive itself The various directives are explained in detail in the following sections besides their behavior in the different parsing modes 4 5 1 mode The mode top level directive defines the parser construction mode UniCC should follow Only the values sensitive and insensitive are allowed as parameters This mode selection influences the general way how UniCC constructs or rewrites if necessary the grammar and must be specified before any other directive or grammar construct mode sensitive More about the parser construction modes UniCC provides is explained in a separate chapter 4 5 2 language The language top level directive specifies the implementation language to which the parser should be compiled in which language its semantics are written in It may only be defined once on top of a grammar before any other directive except other top level directives or language construct follows language C The value specified here defines the parser template that is selected by the UniCC program module generator UniCC itself does not really care about its value or meaning due it has no language specific decisions
87. e and then call this higher level nonterminal from expression The resulting grammar of this idea is Some grammar related directives whitespaces Nt lexeme integer Defining the grammar expressions expression term expression term term term term integer term integer integer integer gt 409 9 integer 0 9 End of definition The new nonterminal t erm matches this idea and is called by expression where integer was called before The compiled and run version of this grammar allows to enter any desired expression with mixed additive and multiplicative operators One demand is missing Overwriting precedence rules with bracketing Because brackets may appear in the same positions where our operand nonterminal integer currently appears a replacement for integer must be added to allow for both ways in this uppermost precedence level Rewriting the grammar with a new decider between integer anda call to a new expression enclosed with brackets results in a terminal factor which is then called by term factor integer expression H Finally we have the complete grammar to parse expressions the correct way The only disadvantage We still see no results again So there s just a little bit of augmentation required to this grammar to make it a working expression calculator Some grammar related directives whitespaces PONES flexeme integer defau
88. e if any This value is NULL in the root node rev prefix_syntree Pointer to previous node in the current level A copy of the stack element representing the symbol It contains all the Symbol Oe pees Ik information that can be found in the G prefix tok structure token UNICC_SCHAR A copy of the string that represented the token in the input This is only filled when the symbol is a terminal symbol and a leaf next prefix_syntree Pointer to next node in the current level Table 15 Member variables of prefix syntree An example for a rendering function working on this parse tree void print syntax tree int static int if whil d 5 10 Build in syntax tree generator i node gt symbol symbol gt name prefix_pcb pcb prefix_syntree node ec i i rec node return e node for i 0 i lt r printf printf s if node gt token printf f Ti s node gt token 115 printf stream n rectt print_syntax_tree pcb rec node node gt next results in an output like calculator amp whitespace calculator expression expression integer integer Ys T4 amp whitespace expression expression TI Carera amp whitespace expression amp eof 116 UniCC Parser Generator node gt child 5 10 Build in syntax tree generator UniCC Parser Generator
89. e as operands Let s first implement a grammar for additive calculations Simple terms of the syntax integer integer integer integer shall be valid and parsed A first draft of the new grammar would be Some grammar related directives whitespaces Nt flexeme integer Defining the grammar expressions gt integer integer integer integer H integer gt r PIE integer 0 9 End of definition Compiled and run this works for simple terms with always two operands but terms with just one single operand or even terms with three operand to be added or subtracted are punished with a syntax error Extending nonterminal expression to read as expressions gt expression integer expression integer integer will allow for a recursion which is the correct method to enable terms with variable length But how to add precedence now to this grammar enabling multiplicative operators There are two methods in UniCC to make this possible The first is obvious and the second is for lazy people We select the obvious method first requiring more efforts in writing the grammar and even more parse states in the resulting parser The lazy method is described below when the expression parser is finished 18 2 5 2 Precedences UniCC Parser Generator Our obvious approach is simple Why not copy the definition of nonterminal expression to fit the demands of operators with higher precedenc
90. e equipped with individual semantic code and a data type for semantic augmentation to return an individual semantic value to the reduction actions used in the grammar This semantic code can be used for different purposes also simultaneausly To extract semantic values to modify the input or to perform semantic symbol selections Same as within semantic reducton codes of productions some macros within this semantic code block for termnals can be used to access the return value and the matched input string Their meaning is not the same in all target languages it depends on the target language template Macro Usage ee Defines the return value that is associated with the terminal This variable is of the same type that is specified for the symbol or the default type e gt Defines the start of the string In the C standard template parser this is a char pointer to the first byte of the matched string e lt Defines the end of the string In the C standard template parser this is a char pointer to the last byte of the matched string symbol name Sets the returned symbol to name for semantic symbol selections Table 5 Semantic macros in regular expressions Some examples in C Match an integer integer lt int gt O 9 atoi gt Match a string string lt char gt In LES i E Eea tue eG D In various target languages the semantic action code of regular expres
91. e surrounding code that is required to run the parser in its environment it was implemented for It strongly depends on the implementation language and parser template how these directives are handled exactly Multiple calls of prologue or epilogue are possible and glue all specified code blocks together Please read more annotations about the UniCC standard parser template for the C programming language and its features and interfaces which are mostly switched and modified by C preprocessor directives that are put into prologue and epilogue directives 4 5 7 left right nonassoc precedence The left right and nonassoc directives set associativity and precedence weightings to terminal symbols and are used to resolve conflicts in ambiguous grammars They influence the behavior of UniCC when constructing the parse tables whether to shift right associativity or to reduce left associativity on shift reduce conflicts nonassoc defines terminal symbols not to be associative in any way if this is tried at parser s runtime it will throw a syntax error Depending on the left or right associativity configuration the syntax tree grows left or right leaning This can be visualized with two simple grammars and their related syntax trees on the input expression 2 3 4 mode insensitive left Vat os integer O 9 start expr expr gt expr expr integer start start am
92. e that allows for the detection of integer numbers We have one terminal which is a character class that exists of a digit from 0 to 9 This means that the characters 0 1 2 3 4 5 6 7 8 and 9 are our valid characters forming this specific terminal and we generalize them by defining a character class by saying 0 9 Characters and character classes are enclosed by single quotation marks to identify them as terminals Then we have one nonterminal in our grammar let s call it integer Nonterminals are directly identified by writing their name like integer Using these two elements we can now define productions for them We begin with the requirement that integer should only exist of a single digit Then we write integer gt 0 9 This single line defines all three elements described above One nonterminal called integer one terminal defined as 0 9 and one production that is associated with the nonterminal integer This production defines that one digit is the valid input to match the rule that forms an integer 6 2 2 Defining grammars UniCC Parser Generator The arrow symbol gt separates the nonterminal name from its production definitions semicolon closes the definition block Because of this syntactical structuring of grammar elements note that BNF is even a language with its own grammar to express grammars the nonterminal name is sometimes called as the left hand side where the associated productions resist
93. e_set_float val f return val xpl_value xpl_value_create_string char s short duplicate xpl_value val val xpl_value_create xpl value set string val duplicate xpl strdup s s return val void xpl value free xpl value val if val return xpl value reset val free val void xpl value reset xpl value val if val gt type XPL STRINGVAL amp amp val value s 2 6 3 1 xpl value a dynamic value object UniCC Parser Generator TE val gt value s val gt strval xpl_free val gt strval memset val 0 sizeof xpl value val type XPL NULLVAL xpl value xpl value dup xpl value val xpl_value dup dup xpl value create if val return dup memcpy dup val sizeof xpl value dup gt strval char NULL if dup gt type XPL STRINGVAL dup gt value s xpl strdup dup gt value s return dup void xpl_value_set_integer xpl_value val int i xpl_value_reset val val gt type XPL INTEGERVAL val value i i void xpl value set float xpl value val float f xpl value reset val val gt type XPL FLOATVAL val gt value f f void xpl value set string xpl value val char s xpl value reset val val gt type XPL STRINGVAL val value s s int xpl value get integer xpl value val switch val gt type case XPL INTEGERVA
94. eak case XPL GRT val xpl value create integer res 0 1 0 break case XPL GEO val xpl value create integer res gt 0 1 0 break Free the operands xpl value free op 0 xpl value free op 1 Push the operation or comparison result xpl push amp rt val break Increment instruction pointer rt iptt Clear stack if code was clearly generated this would not be required Wi 48 2 6 3 6 Implementing the virtual machine UniCC Parser Generator for 1 0 ri lt rt stack ent i xpl value free rt stack i 1 xpl free char rt stack Clear variables for i 0 i lt prog variables cnt i xpl_value_free rt variables i xpl free char rt variables 2 6 3 6 Implementing the virtual machine 49 UniCC Parser Generator 2 6 4 Implementing the compiler Now both modules the XPL runtime environment and the parser need to convence The parser must be extended to work as a compiler the code must be executed when the program successfully has been compiled To turn the parser into a compiler semantic actions need to be inserted to appropriate positions to generate the machine code for our virtual machine along the parse process now The easiest part of this implementation is the nonterminal expression The productions of this nonterminal only require some code generation to push values onto the stack and perform operations and comp
95. eene closure none or several of previous expression modifier Positive closure one or several of previous expression modifier Optional closure none or one of previous expression modifier Table 7 Virtual production modifiers Each of these modifiers expand into one or in case of the Kleene closure two automatically derived nonterminal symbols that provide the desired language construct when UniCC compiles the grammar A simple example is to parse integer symbols like integer gt 0 9 To allow for a nonterminal statement in several times one could write statements or null gt statements statements gt statement statements statement or one could say statements or null gt statement Virtual productions have one disadvantage Their expanding content can t be augmented with semantic code This is because their most common use is in language prototyping or in places where no special semantic operations on the input are required All default semantic action blocks for productions and empty productions are automatically assigned and executed 4 4 3 2 Virtual productions 85 UniCC Parser Generator 4 4 3 3 Anonymous nonterminals There are several situations where the grammar developer requires to create a new nonterminal that is only used once to build up a desired grammatical construct Other situations require the creation of a nonterminal with only one empty production just to perform semantic task
96. elf To compile UniCC the newest version of the Phorward Foundation Libraries amp SDK http phorward phorward software com is required The Phorward Foundation Libraries amp SDK serves as a toolchain build environment and C programming framework for software projects written at Phorward Software Technologies There is also an howto guide about how to build UniCC from source within this manual To get more stuff support and information on the UniCC project visit the product homepage at http unicc phorward software com or ask directly via e mail 4 1 4 Contributions to UniCC 2 Constructing parsers with UniCC 2 1 What does parsing mean Parsing is required in many situations of a software developers everyday business Information is fed to a program and must be analyzed sometimes with a more or less logical structure Information can be a data dump in a character separated file the command line parameters of a program a stream of data in various formattings an XML file or another type of file with a logical meaningful structure and input syntax maybe even a program source code written in a programming language This syntax can be analyzed and verified for correctness using a special type of program task A parser This parser is used to analyze a sequence of input data and produces a logical interpretation of this data according to an underlying grammar which describes the data s valid syntax rules that express this dat
97. er on the wall bottles bottles of beer else print One bottle of beer on the wall one bottle of beer print Take one down pass it around if bottles bottles 1 0 print No more bottles of beer on the wall else if bottles print One more bottle of beer on the wall else print bottles more bottles of beer on the wall 2 6 1 Source code downloads All source code presented in the following sections is not covered in detail but it is a working example for a programming language implemented entirely with the UniCC language development system To obtain the latest version of the XPL source code presented here access the Mercurial source repository using hg clone http unicc hg sourceforge net 8000 hgroot unicc xpl 26 2 6 Implementing a programming language UniCC Parser Generator The source code printed here can also be found in the appendix or as a downloadable tarball on the Phorward Software website 2 6 1 Source code downloads 27 UniCC Parser Generator 2 6 2 Drafting the Grammar The plain grammar of the language is described below This intermixes already known grammar constructs like lexemes with some new features that had not been covered for now Meta information parser XPL description eXample Programming Language Draft copyright In the public domain 2011 version ku prefix xpi Precedence and associati
98. er the string s end when UNICC_GETINPUT is called multiple times which can be the case So a definition of define UNICC GETINPUT pcb gt inputstring pcb gt inputstring pcb gt inputstring t would be more adequate and causes no unwanted effects Check out the function G prefix utf8 getchar if UTF 8 input processing is wanted 5 15 5 UNICC MAIN UNICC MAIN controls if the parser features its automatic main function This feature enables to rapidly prototype a grammar and immediatelly test its result The main function calls the parser in an endless loop and allows By default if no semantic code is specified by epi logue UNICC MAIN will be configured to 1 so a main function is generated If an epilogue is given it is defined automatically to 0 when no previous definition is done To specifiy an epilogue and switch the parser s build in main function on define UNICC MAIN 1 must be done previously 5 15 6 UNICC MALLOCSTEP Defines the increment size of units allocated in one row in every allocation task This is done to minimize the amount of malloc realloc calls so reallocation of memory is only done when the next step of this number is reached This macro fits to the reallocation of the input buffering and stacks It is configured to 128 by default so for example 128 bytes for input buffering or 128 elements of the value stack elements in an array are allocated The next reallocation occurs when this li
99. erands pushes integer 1 equal 0 d not equal NEQ Check for non equality of next two stack operands pushes integer 1 non equal 0 equal LOT Take two operands off the stack check if left operand is lower than right operand pushes integer 1 lower 0 greater or equal LEQ Take two operands off the stack check if left operand is lower or equal than right operand pushes integer 1 lower or equal 0 greater GRT Take two operands off the stack check if left operand is greater than right operand ushes integer 1 greater 0 lower or equal GEQ Take two operands off the stack check if left operand is grater or equal than right operand pushes integer 1 greater or equal 0 lower Take two operands off the stack perform addition and push result onto the stack In ADD case of string operands perform string conversion of both operands concatenate and push the concatenated string onto the stack SUB Take two operands off the stack perform substraction and push the result onto the stack Strings will be converted to integer Take two operands off the stack perform mulitplication and push the result onto the MUL E stack Strings will be converted to integer DIV Take two operands off the stack perform division and push the result onto the stack Strings will be converted to integer Table 1 The virtual machine opcodes Defining this in C as enum pl op and structure xp1 cmd results in the following code 2 6
100. erminals generated to parse the whitespace it gets more clear that whitespace related nonterminals are themself some kind of lexeme 98 4 5 9 lexeme UniCC Parser Generator It is obvious that the 1lexeme directive is one of UniCC s most powerful features in the sensitive parser construction mode because it enables for richer grammatical constructions than any regular language allows for with the difference that this constructs are handled as they would lexeme taken from the lexical analyzer If a grammar constructs with lexemes produce shift reduce warnings lexeme can be used in combination with the lexeme separation directive 4 5 9 lexeme 99 UniCC Parser Generator 4 5 10 lexeme separation The lexeme separation directive can only be used within the sensitive parser construction mode and in combination with the 1exeme directive It is closely related to the ixate directive and configures all symbols configured as lexemes to be handled like fixate It is a boolean directive and can be switched on or off Calling it without boolean parameter switches lexeme separation as on Based on the sample grammar from above it should be allowed for multiple values where only one value should be allowed yet This could be done by introducing a new virtual production start gt print value Feeding this to UniCC and with warning reporting switched on this causes a shift reduce conflict in two sta
101. es Multiply two values Divide two values Opcode Parameter Virtual machine program Numer of elements in program Array of literal objects Number of elements in literals The variable symbol table The index of the variable name represents the address in the variables member of the runtime data structure 274 Number of elements in variables Array of objects representing the variable values The number of objects is in the corresponding xpl program structure in the member variables cnt Array representing the value stack Defines the top of stack and the numbers of elements resisting KJ Instruction Pointer to the currently executed cod address iA 256 135 UniCC Parser Generator Import function prototypes include xpl proto h endif 136 6 3 xpl h UniCC Parser Generator 6 4 xpl proto h ifndef XPL PROTO H define XPL PROTO H xpl void xpl dump xpl program prog xpl int xp xpl va xpl va xpl va xpl va xpl va xpl va xpl int xpl xpl int xp int xp int xp debug c functions c l get function char name lue XPL print int argc xpl value argv lue XPL prompt int argc xpl value argv lue XPL exit int argc xpl value argv lue XPL integer int argc xpl value argv lue XPL float int argc xpl value argv lue
102. ferent parse trees and results so that the use of this tool should only be done carefully 4 4 1 3 3 Semantic terminal determination UniCC supports a possibility to perform semantic terminal determination in lexical semantic actions This possibility is well implemented and tested but has some caveats to know about when used in the sensitive parser construction mode in combination with programming languages that don t have an automatic memory management like C It is more secure to use the method of semantic nonterminal determination as described below which has no disadvantages according to the lexical way So this topic is only shortly discussed here and it is advised to use the nonterminal way of semantic terminal determination In some cases there are terminal symbols that are build up from the same input sequence but require some more semantic checks to definitely decide which symbol matches There are also cases where the lexical value of a token has a special meaning to the grammar e g a function name referenced in a compiler s symbol table For such cases UniCC features semantic symbol determinations within the lexical analyzer All symbols that match to a given regular expression are unioned with one expression Then the semantic action code can perform checking tasks and set the matched symbol to one of the associated symbols The association is done with the macro symbol name where name defines the symbol to be returned
103. ft from above this augmented version contains programmed actions which define what the parser should do on the different grammatical elements during the parse and how values are passed trough the parse tree The following visualized parse tree shows how the input string August 17 2008 is parsed including the semantic values stored into every node which is an instance of a nonterminal symbol date date ee month 8 integer 17 integer 2008 ee ee August integer 7 integer 200 8 p mcd EE 1 integer 20 0 mu s ER integer 2 0 2 Fig 4 Parse tree with augmentation simplyfied Every code segment which is enclosed between and is executed when the parser successfully matches a production This is why the semantic code invoked on a rule s reduction is even called as the reduction action in LR and LALR parsers In such a reduction action the compiler writer is able to access all values of the current production and return values to the higher nodes of the parse tree which are based on these semantic values It is also possible to perform code generation within these reduction codes or mixed semantic checks symbol table management and more These are all the things to be done in a real compiler and go beyond the task of parsing For all elements of the reduced production the right hand side values are accessed using an character followed by the number of the desired tokens position which begins with
104. g pattern It is on the language designer s choice which how terminal symbols are made up in the particular implementation Some examples for widely used terminals in programming languages are identifiers for variables and functions operators brackets keywords like F or WHILE floating point or integer numbers The parser will expect these terminals in a valid order according to the position it is during the parse which is in turn defined by the underlying grammatical rules it follows Nonterminal symbols or simply called nonterminals can be seen as variables or function calls within a grammar although they aren t They reference to one or a bunch of the so called productions which means that each production is always associated with one and only one nonterminal but one nonterminal may exist of several productions Productions sometimes even called grammar rules or just rules finally describe the syntax Its better to say that productions define a syntactical part of the grammar which can be replaced by the specific nonterminal each production is associated with This syntactical description is done by defining a sequence in which terminals and nonterminals may occur to form a valid sentence This includes that a nonterminal can reference itself recursively in its own productions which is a very important aspect in non regular languages Let s see an example to get more familiar with these new terms We want to define a languag
105. generator into one software solution The programming interface of UniCC is a rich extendable and innovative EBNF based grammar definition language Extended Backus Naur Form This language gives the compiler developer s task much more comfort and simplicity in implementing parsers than ever before It comes with useful features for both grammar prototyping and design parser optimization semantic augmentation and parser programming Lexical symbols can be directly defined within productions right hand side items can be referenced by meaningful names within semantic actions instead of only their offsets Features like virtual and embedded productions finally help to rapidly build up iterative and optional grammatical structures Standard features like automatic conflict resolution terminal and production precedence association state compression as well as parser trace and behavior modification trough semantic actions round up the whole system By default UniCC constructs whitespace sensitive parsers This paradigm is a speciality of UniCC and causes an internal revision of the grammar according to rules that match whitespace only under certain circumstances Whitespace handling will be performed by the parser rather than the lexer The advantage of this approach is that the entire power of a LALR 1 grammar can be used for creating very complex lexemes or whitespace constructs If this feature is not required by a grammar it can be switched off re
106. hall be used It disables terminal anomaly detection and always assumes that regular expression terminals take precedence over any lexemes Terminal anomalies do not raise up in insensitive mode parsers 4 4 1 4 Terminal anomalies 79 UniCC Parser Generator 4 4 2 Nonterminal symbols Nonterminal symbols nonterminals can be seen as grammar variables They always cause a branch to a subsequent grammatical structure within the parse tree and represent nodes to other nonterminals which are nodes as usual or terminals which are leafs of the tree Nonterminals contain one or more productions Productions form sequences of terminal and nonterminal symbols describing the syntax to which the defined nonterminal can be expanded to Everytime a nonterminal appears within a production all its specific rules are valid and possible at the particular situation in the sequence Both the nonterminal symbol and its productions are described as a grammar definition block and follow a variant of the Backus Naur Form meta language to describe context free languages nonterminal gt productionl production2 productionN The nonterminal s name that should be defined appears on the left hand side Left hand side means left from the arrow sign gt A grammar definition block unions the definition of nonterminals their productions and the symbol sequences within the productions which in turn can be definitions of terminal symbols E
107. haracter classes 4 4 1 2 String sequences String terminals define a sequence of characters which must exactly match to the string specified in the current input They are defined similarly to character terminals but are enclosed by double quotation marks The same escape sequences as in character terminals can be used while is an example for a string definition matching a keyword special operation defines a string with a blank If this blank is not given in the input the string sequence is not matched If two blanks are given the string sequence is even not matched Such a string definition overrides any whitespace definition the blank is mandatory in this situation and handled as part of the lexical analysis rather than the grammar Internally string terminals are handled like terminals based on regular expressions and have the strongest level of specialization within the lexical analysis This means that a string terminal that exactly fits to a current input string is matched before any regular expression terminal matching the same string does String terminals don t have a semantic data type association They stand on their own and represent a static lexical unit in most cases some kind of keyword With the use of the case insensitive strings directive string terminals can be configured to ignore upper lower case order if case conversion is possible with the characters they are made up from 4 4 1 3 Regular
108. he draft version of the XPL parser already runs very well in the UniCC Standard C Parser Template s test mode for some simple expressions Only function calls don t work because the semantically determined nonterminal selection is not implemented yet The parser now always detects a variable and throws a parse error when the opening bracket for the function call are statet and a variable followed by an opening bracket is not valid This problem will be solved after adding some semantic determination code soon 2 6 2 3 Multiple left hand sides 31 UniCC Parser Generator 2 6 3 The virtual machine The topic of intermediate code generation performed by a compiler to build a binary program or directly run it is that huge and complex that it can t be covered in this manual or even in huger compiler textbooks You only have to know that every compiler creates some kind of intermediate representation of its input In most cases some kind of a more or less abstract tree structure abstract syntax tree representing semantics and nesting but there are also compilers that compile into an intermediate language that can be used for further optimization or platform dependent code generation Both methods are used in many of todays existing compiler implementations An abstract syntax tree AST is the representation of a parse tree where atomic syntactical elements are hidden and only the semantic information e g a literal from the code a variable addre
109. her nonterminal avoiding the creation of a special extra nonterminal to fullfill one special task like it is the case here Our anonymous nonterminal right after the closing bracked of the header of the while statement generates a JPC operation with an empty address and returns its own operation address as semantic value Then the following nonterminal statement is parsed and finally the production of the statement while will be reduced In the reduction code we then know the final address where the JPC operation should jump to so we now can backpatch its parameter with the correct address To finally generate the JMP call at the bottom of the loop right behind the last statement code was executed we also have to remember the last program memory address before the code for statement while was generated Another anonymous nonterminal helps out and we get the following exemplary version of the backpatching construct statement gt while stack up current program address expression generate jpc code statement generate jmp to stacked begin address backpatch the jpc call The implemented version of this theory is this one and may look a little bit confusing when used the first time statement while pcb prog program cnt begin expression xpl emit pcb gt prog XPL JPC 0 Jpe statement 2 6 4 Implementing the compiler 51 UniCC Parser Ge
110. hoose the way of how values are passed by default Some sentences above we mentioned that there are two ways of implementing precedences within UniCC grammar definitions We first chose to implement the obvious method as described above but there is also a version for those lazy people among you Lazy means that you write lesser grammar code but take the same precedences effect as you will get with writing an obvious grammars as the one above Another advantage is that UniCC produces lesser states for the same parsing behavior even the same semantic actions can be used The key elements for lazy grammar writers are the parser configuration directives left right and nonassoc whereas for our case of the expression language we only require the 4 1eft directive These directives furnish grammar symbols with precedence and associativity weighting to influence the parse table generator and to resolve parse table conflicts which come up with ambiguous grammars Such an grammar would be the following when we try to compile it Some grammar related directives fwhitespaces NETS lexeme integer Defining the grammar 20 2 5 2 Precedences calculators expression integer UniCC Parser Generator expression gt expression expression expression expression expression expression expression expression expression integer 0 9 integer 0 9
111. imply identified by a variable called pcb The parser control block itself is a typedef that is refered as prefix_pch prefix is replaced by the value specified at the prefix directive in UniCC if not specified it will be replaced by nothing empty string so pcb will be the type name Using the pcb directive as described in the UniCC reference above individual members of the parser control block can be defined which avoids the problem of using global variables Global variables must not be used in the standard C parser This makes it entirely thread safe recursive calls of the parser are possible If there is the need to pass on any values or pointers to subsequent parts of the parse tree that hold global run time information of the parser they should be put into the parser control block using the pcb directive Its also strongly recommended to put the input file or buffer pointer into the parser control block if the parser s input should be read from another source than stdin If this is the case make sure UNICC_GETINPUT is defined correctly Most of the members of the parser control block are used internally but some of them may also be used please as read only by the compiler writer Modification of these initial members during runtime is not recommended Parser Control Block typedef struct Stack prefix_tok stack prefix_tok tos Stack size unsigned int stacksize Values prefix_vty
112. in the prologue semantic code area or at compiler invocation If not submittet the UniCC Standard C Template forces their definition to default values 5 15 1 UNICC_CLEARIN Clears buffered input of the current token This macro requires the parser control block as its parameter and is pre defined as define UNICC_CLEARIN pcb prefix_clear_input pcb so it invokes the build in function clear input It can be re defined by any desired task but its behavior must complain with the one given by clear input and the lexer functions reading the input 5 15 2 UNICC_DEBUG Switches parser trace which is written to stderr The following table gives information about the trace levels to be set by defining UNICC_DEBUG to a desired debug level depth UNICC_DEBUG level depth Description INo debug default Enables parser and error recovery debugging Additionally enables debugging for the lexical analyzer Additionally enables debugging for the input buffering gt WIN IS Additionally enables debugging for single character fetching only in UTF 8 mode Table 17 UniCC Standard C Parser Template debug levels Enable debug mode by defining define UNICC_DEBUG 1 To show stack contents also switch on UNICC_STACKDEBUG Here is an example for a debug mode level 1 session with stack debug enabled unicc w debug par cc o debug DUNICC DEBUG 1 DUNICC STACKDEBUG 1 debug c echo n 1 2 3
113. inal is added behind every terminal symbol and in front of the goal symbol during the grammar revision process To disallow whitespace in particular constructs some nonterminals can be defined as lexemes and will be covered like terminal symbols during grammar revision in order to the whitespace matching but are also made up as part of the context free grammar The advantage of this construction mode is that whitespace and lexemes can be expressed in a much more powerful context free grammar rather than as regular patterns matched by the lexical analyzer The lexical analysis apparently becomes part of the parser with all its possibilites but the grammar is expressed as whitespace would be handled apart from it Additionally the true lexical analyzer can be optionally used to parse the atomic terminals as part of lexemes or on its own To make this parsing approach possible overlapping character terminals are made unique and split up into nonterminals A lexical analyzer is constructed individually for every LALR state to only match symbols that are valid in the sensitive context of the given state Some grammars that use this mode may cause a high number of states and many different lexical analyzers But it enables the ultimative maximum of flexibility ever provided by any LALR 1 parser 3 6 The parser construction modes 67 UniCC Parser Generator As an example then following grammar is given mode sensitive left r left T
114. inition DTD lt xml version 1 0 encoding UTF 8 gt 1 UniCC LALR 1 Parser Generator DTD for Parser Definition Files Copyright C 2006 2012 by Phorward Software Technologies Jan Max Meyer http www phorward software com contact lt lt AT gt gt phorward software lt lt DOT gt gt com All rights reserved S license agreements below File unicc dtd Author Jan Max Meyer Usage Document Type Definitions for Parser Definition Files You may use modify and distribute this software under the terms and conditions of the Artistic License version 2 Please see LICENSE for more information lt ELEMENT begin of match EMPTY gt lt ELEMENT character class range gt lt ATTLIST character class count NMTOKEN REQUIRED gt lt ELEMENT code begin of match command raw return value variable gt lt ATTLIST code defined at NMTOKEN IMPLIED gt lt ELEMENT command EMPTY gt lt ATTLIST command action NMTOKEN REQUIRED gt lt ATTLIST command symbol NMTOKEN REQUIRED gt lt ELEMENT copyright PCDATA gt lt ELEMENT description PCDATA gt lt ELEMENT dfa state gt lt ELEMENT epilogue PCDATA gt lt ELEMENT goto EMPTY gt lt ATTLIST goto by production NMTOKEN IMPLIED gt
115. is structure is called the parse tree of the input where the input is broken down into its grammatical structure created by the parser Due the left recursion of our simple grammar the tree grows left leaning because the leftmost recursive call to nonterminal integer results in a new branch each level As you can see all leafs of the parse tree are terminal symbols It is also possible to write right leaning trees but in the LALR parse algorithm right recursive grammars require much more parse stack usage than leftmost ones while the parser analyzes the input Parser stack usage should always be kept low and this is the case in leftmost grammars 2 2 Defining grammars 7 UniCC Parser Generator A tree of this kind is automatically generated for each input a parser analyzes But its only a fictitious tree a data structure that is constructed virtually from the natural and logical flow of the parse process and the underlying grammar It is only a visualization of how the parser walks along the grammar and maps the input into this structure Like with the above grammar each call to a nonterminal in a production allows the parser to parse this nonterminal s entire underlying structure which includes all its productions the productions of the nonterminals in these productions and so on in any possibility the syntax allows for This can cause branches to giant structures in this virtual parse tree Just imagine what happens when we parse a thirt
116. ists of an opcode defining the operation to be performed and an integer parameter that is only used by some operations as index reference The following table describes the opcodes their meaning and the use of the parameter Opcode Operation Parameter INOP Does nothing Call a build in function Expects the number of arguments as first integer item on the CAL X stack then the arguments Arguments will be dropped off the stack after the function Function Index was called and the function s return value is pushed Load a literal value Literal values are hold as xp1_value objects within a LIT xpl program structure For the literal load the literal value is duplicated and Literal Index ushed onto the stack Load a variable content Variables are hold in a xpl runtime structure during LOD runtime For the variable load the variable s xpl value objectis duplicated and Variable Index ushed onto the stack STO Stores the value on top of the stack into a variable The variable s old value will be Variable Index deleted DUP Duplicates current top of stack value and pushes it DRP Drop current top of stack value and clear its memory JMP Change instruction pointer jump to address Code Address Conditionally change instruction pointer if value of top of stack is false Jump to uis address if Er The top if stack a will be a ode netics EQU Check for equality of next two stack op
117. ix_utf8_getchar int getfn The function expects a function pointer as parameter to a function returning one byte per character prefix_utf8_getchar then automatically calls get fn until the correct number of sequences is taken to match one valid UTF 8 character 120 5 13 prefix_utf8_getchar UTF 8 character fetch function UniCC Parser Generator 5 14 prefix_lexem take current lexical value The function prefix_lexem has the purpose to return the current input token recognized from a terminal symbol as UNICC_SCHAR pointer that can be used in semantic actions The function s prototype is UNICC STATIC UNICC SCHAR prefix_lexem prefix_pcb pcb It should not be called in reduction actions only on shift operations respective in terminal actions The gt macro within semantic terminal actions calls prefix_lexem internally The function allocates memory only in UTF 8 mode all other input modes force only a pointer mapping so no memory allocation is done Memory management will be taken automatically by the parser so it is not required to care about 5 14 prefix_lexem take current lexical value 121 UniCC Parser Generator 5 15 Configuration Macro Reference This short reference should bring an overview of the provided macros that can be pre defined in the UniCC Standard C Parser Template and their behavior All macros within the UniCC C Standard Template can be overridden when defined
118. le quotation mark Ar Double quotation mark Backslash NOOO ASCII character in octal notation O octal digit xHH ASCII character in hexadecimal notation H hexadecimal digit NuHHHH 32 Bit Unicode character in hexadecimal notation H hexadecimal digit UHHHHHHHH 64 Bit Unicode character in hexadecimal notation H hexadecimal digit Table 3 Global escape sequences to be used in UniCC 4 UniCC Grammar Definitions 71 UniCC Parser Generator 4 3 Definition blocks The UniCC grammar definition language provides blocks of several definitions Each block has an introductional symbol e g an identifier for a nonterminal definition or a particular parser configuration directive and ends with a semicolon Confguration directive block lexeme int Nonterminal definition block mt gt fett Terminal definition block for terminals based on a regular expression id name A Za z_ Any other type of definition is invalid and will result in a parse error 72 4 3 Definition blocks UniCC Parser Generator 4 4 Grammars Grammars are expressed using rules that describe in which way symbols may appear The symbols within a grammar define grammatical units in form of terminal symbols terminals which immediatelly are expected in the input and nonterminal symbols nonterminals Nonterminal symbols union one or more grammatical rules so called productions into one fictive symbol to be inserted int
119. ls xpl_value xpl_malloc char prog gt literals prog gt literals_cnt XPL_MALLOCSTEP sizeof xpl_value prog gt literals prog gt literals_cnt val return prog literals cnt int xpl emit xpl program prog xpl op op int param if prog program cnt XPL MALLOCSTEP 0 prog gt program xpl cmd xpl malloc char prog gt program prog gt program_cnt XPL_MALLOCSTEP sizeof xpl_cmd prog gt program prog gt program_cnt op op prog gt program prog gt program_cnt param param return prog program cnttt 2 6 3 3 xpl program representation of a compiled program 39 UniCC Parser Generator void xpl reset xpl program prog int d Variables for i 0 i lt prog variables cnt i xpl free prog gt variables i 1 xpl free char prog gt variables Literals for i 0 i lt prog literals cnt i xpl value fr prog literals i xpl free char prog gt literals Program xpl free char prog gt program memset prog 0 sizeof xpl program And for debug purposes we want to get a visible representation of the compiled program This can be done with a program dump function that is implemented in xpl debug c finclude xpl h static char opcodes 3 1 NOP CAL MT OD ESTOT DUP DRP JMP DOS EQU NEQ 10 ae igo GRT Za GEQ ADD
120. ls including all their subsequent productions and terminal nonterminal calls to be handled as coherent lexical units the so called lexeme 1exeme directly influences the grammar revision process Each as 4 1exeme configured symbol is handled like a terminal symbol within the whitespace grammar revision so whitespace is allowed behind a lexem but not within Given is this example grammar whitespaces start print value value ident integer ident gt A Za z ident A Za z integer gt 0 9 integer 0 9 When this grammar is passed to UniCC it will produce a parser that allows to parse statements like 96 4 5 8 1 whitespaces in sensitive mode UniCC Parser Generator print hello print 4711 print h e 11 o wor ld print 1 675467 9 123 The two last lines are internally parsed as helloworld and 16754679123 but this is not the desired syntax for our language Adding a lexem configuration flexeme ident integer to this grammar yields in a different correct parser which only accepts the lexeme we want to allow for The difference that is done during the automatic revision of the grammar caused by the whitespaces directive in combination with the sensitive mode can be made visible using the G command line flag to let UniCC print out the final grammar The inital version of the revised grammar is this one if no lexeme configuration is done GRAMMAR start
121. lt action G1 Defining the grammar calculators expression printf Sd n expression expression gt expression term Q1 Q3 expression term Q1 Q3 term 2 5 2 Precedences 19 UniCC Parser Generator term term factor G1 Q3 term factor Q1 Q3 factor factor integer expression expression integer gt 0 9 1 0 integer 0 9 integer 10 82 tuo 3 End of definition That s the complete program code which calculates any desired expression for you the correct way Simple isn t it Maybe you recognized the line with default action Q1 which was already present in our first example using semantic actions This parser directive is required to define a default action that should be performed at an nonempty productions reduction code if no code has been provided by the grammar For example in factor integer expression Q2 The first production uses this default action code to assign the return value of the integer nonterminal automatically to the value of factor Just keep this in mind if you feel that your parser is loosing values because you d forgotten to add this directive to your grammar Because UniCC is a target language independent parser generator it was decided to let the grammar designer c
122. ly parsed as 2 3 So another higher precedence level than to addition and subtraction must be added which we do with yet another call to the left directive As deeper a precedence directive in the grammar occurs below other left right and nonassoc directives so higher the precedence will be assigned When this secondary precedence and associativity assignment is done the grammar will behave the correct way even for expressions like 2 3 4 5 6 7 which will then correctly be parsed as 1 2 3 4 5 6 7 o e e L N e e int e e e e 7 A RA e e int int int FT int e e 4 5 6 1 int int 3 Fig 6 The unambigous parse tree of 1 2 3 4 5 6 7 simplyfied Symbol names had been shortened Adding this precedence directives to above grammar we ll get its unambiguous pendant Some grammar related directives language NS fwhitespaces N lexeme integer default action 1 left Ske Mets left EAT 24 2 5 2 Precedences UniCC Parser Generator Defining the grammar calculators expression printf Sd n expression expression gt expression expression GG Q1 03 expression expression Q1 Q3 expression expression Q1 Q3 expression expression G1 3 expression Q2 integer integer c 9 9 ea G1 0 intege
123. mbol determination variable function ident Return symbol function if this is the name of a function in the symbol table if is function ident symbol function A similar mechanism can also be implemented within the use of lexical analyzer terminal determination but is not advised Refer the topic to get more information about related caveats 4 4 2 2 Semantic nonterminal determination 81 UniCC Parser Generator 4 4 3 Productions A production defines a sequence of symbols that is valid in the context the production may appear in Every production is associated with a nonterminal symbol a nonterminal symbol in turn is made up of one or multiple productions see above Productions can only be defined within a grammar definition block as part of a nonterminal definition by using a Backus Naur Form compliant syntax All productions a nonterminal is constructed from are defined on the right of the arrow sign gt which is the reason why a production is also called the right hand side of a grammatical rule In turn the nonterminal it belongs to is called the left hand side Multiple production definitions on the right hand side are separated by pipes or must be stated into a new grammar definition block nonterm gt hello world equals to nonterm gt hello nonterm world When the parser of a defined grammar is executed it reads terminal symbols from the input according the sequen
124. minal 21 integer ion 6 expression gt integer ing nonterminal 11 expression tos gt state tax m space 4 4 act shift reduce expression 7 act reduce idx 13 tos gt state space 4 expression 7 ion 13 amp whitespace gt ing nonterminal 14 amp whitespace ion 14 gt amp whitespace ing nonterminal 15 shift idx 7 11 tos gt state 11 act space 4 expression 0 9 ion 7 integer gt ing nonterminal 9 0 9 integer 13 integer idx 3 tos gt state 3 act reduc space 4 expression 11 ion 13 amp whitespace gt ing nonterminal 14 amp whitespace ion 20 integer gt integer amp whitespace ing nonterminal 21 integer ion 6 expression gt integer ing nonterminal 11 expression shift reduce idx 9 11 16 expression tos gt state 16 act space 4 expression idx 13 16 expression tos gt state 9 act reduc space 4 expression 11 ion 13 amp whitespace gt ing nonterminal 14 amp whitespace ion 16 gt amp whitespace ing nonterminal 17 tos gt state 13 act shift idx 7 space 4 expression 11 16 expression 0 9 ion 7 integer gt 0 9 ing nonterminal 9 integer 0 tos gt state 3 act reduce idx 13 space 4 expression 11 16 exp
125. minals based on regular expressions The first new feature used here is the definition block Regular expressions string Int int cv identifier A Za z A Za z0 9 This definition introduces two terminal symbols st ring and identifier These terminal symbols are made of regular expressions Regular expressions are some kind of formal language to describe a way in which characters or character classes words and sentences may appear in the input to match the corresponding terminal symbol and give this string sequence a semantic meaning All terminal symbols in UniCC the already known character classes and strings are some kind of regular expressions These terminals can be defined where they are used but regular expressions may be defined that complex and powerful that they require their own definition block The first of the above regular expressions describe a terminal st ring that begins with a double quotation mark a sequence of zero or multiple characters except a quotation mark and a closing double quotation mark The second regular expression describes a terminal identifier which is made of at least one character that may be in the upper and lower case letters or underscore A Za z and zero or multiple characters that are made up of upper lower case letters digits and underscores This regular expression pattern matches for strings like a Hello _test or even 122 but not for 22
126. mit of UNICC MALLOCSTEP is reached again 5 15 7 UNICC OUTOFMEM Is a macro that is called when a malloc realloc call fails It can be replaced with a product related error function call or similar Its default definition fprintf s a Memory error to stderr and exits the program with error code 1 5 15 8 UNICC PARSE ERROR This macros is invoked on a parse error It requires the parser control block as its parameter and is pre defined with an error text written to stderr The UNICC PARSE ERROR macro defaults to define UNICC PARSE ERROR pcb fprintf stderr line d syntax error on token s n pcb gt line pcb gt sym gt 0 prefix_symbols pcb gt sym name prefix_lexem pcb but can be replaced by any other function call or macro construct 5 15 4 UNICC_GETINPUT 125 UniCC Parser Generator 5 15 9 UNICC_REDUCE UNICC_SHIFT Internal use only may not be changed or redefined 5 15 10 UNICC STACKDEBUG If switched 1 in comination with UNICC_DEBUG stack content will be printed additionally to stderr 5 15 11 UNICC STATIC A define that holds the keyword static by default It can be configured to be empty or anything else just in case static data should be made accessable outside the parser module The parser invocation function is always not declared to be UNICC STATIC 5 15 12 UNICC UTF8 Switches UTF 8 mode on if configured as 1 default 5 15 13 U
127. n 2011 version Qus prefix xpl default action G1 default epsilon action 0 Precedence and associativity left Vz left nl W DI lt S 1 gt Vel left EN ENS left ME rng Regular expressions string lt char gt rue KWR Int 2 a e 52 2 6 4 Implementing the compiler UniCC Parser Generator identifier lt char gt A Za z A Za z0 9 D Lexemes lexeme real real float gt real_integer real_fraction real_integer real_fraction zl real integer Qreal integer zl real fraction Qreal fraction real integer float gt real integer 0 9 dig 10 Greal integer dig 0 0 9 dig dig 0 real fraction float real fraction 0 9 dig QGreal fraction 0 dig 10 0 0 9 dig dig 0 10 0 Whitespace grammar construct whitespaces whitespace whitespace gt NrXnNEUSE wy fe TENTAK rant Goal symbol programs gt statement statement gt if expression xpl emit pcb gt prog XPL JPC 0 jpc statement 2 6 4 Implementing the compiler 53 54 UniCC Parser Generator else pcb prog program cnt xpl emit pcb gt prog XPL JMP O0 jmp statement GG jmp 1 jmp
128. n diene Fig 12 Syntax tree with association precedences As a comparison to the above correctly parsed syntax tree this is the version without any associativity and precedence weightings start start amp eof oe KAN SC integer 1 expr 4 I expr mem SSC Fig 13 Syntax tree without association precedences UniCC compile time warnings had been ignored The precedence directive is a special case directive It is the only kind of directive that can be used on production level precedence is used to assign a terminal s precedence to a production to let the production take a higher level of precedence in some conflicting situations This feature was adoptet from well known parser generators like bison Given a grammar that allows for both an ary and unary minus Correct precedence associaton has already been done for and left Teron left RE ee integer 0 9 i expression gt expression expression expression expression l expression expression expression expression expression integer Let this parser run with the expression 2 3 it will produce this incorrect parse tree 4 5 7 left right nonassoc precedence 93 UniCC Parser Generator expression expression amp eof expression expression icd en integer integer Fig 14 The incorrect syntax tree of 2 3 By using the precedence directive the production defining the unary minus can be
129. n process The general calling convention of the UniCC parser generator is unicc OPTIONS filename par This command line interface supports various combinable options to invoke modify and specialize the parser generation process or to trigger further tasks Option Long opuon Description name Runs UniCC to print all warnings that come up with the grammar UniCC normaly a all warnings supresses some warning messages that raise up during the parse table constructions according to their importance Defines the specified basename name to be used for the output file s instead of the b basename one derived by the prefix directive or by the name of the input filename This name name basename is used for all output files if the provided parser template causes the construction of multiple files Dumps an overview of the finally constructred grammar to stderr right before the G grammar arse tables are generated h help Prints a short overview about the command line options and exits V version Prints copyright and version information and exists Disables state optimization By default the resulting LALR 1 parse states are optimized during table construction by introducing a special SHIFT REDUCE action Las opt which combines a shift and reduction which is possible when the last symbol of a production is shifted Standard LALR 1 parsers only support SHIFT or REDUCE n
130. nalyzer start expr amp eof TUN expr caeci expr ae expr stent Gives Fig 9 Syntax tree of the string 18 2 using an insensitively constructed parser 3 6 2 Insensitive mode 69 70 UniCC Parser Generator 3 6 2 Insensitive mode 4 UniCC Grammar Definitions 4 1 Comments Commenting grammars is possible everywhere in the grammar code UniCC supports the C styled standard comment forms which are for block comments and for single line comments to ignore all the rest of a line Define goal symbol example PRINT expr IF expr THEN example Now we can do IFs START expr READ var We want to implement this later GOTO line GOSUB line ES It must be annotated that this commenting style is only possible within the grammar code itself Comments in semantic code blocks must be expressed in the commenting style of the respective target programming language that is used These comments will also be copied into the resulting parser program module 4 2 Escape sequences UniCC supports ANSI C styled escape sequences within any string or value that is used in the grammar or any directive The following table provides a listing of all available escape sequences Escape sequence Description Na Bell alert b Backspace Nf Formfeed An New line r Carriage return t Horizontal tab v Vertical tab Sing
131. nerator xpl emit pcb gt prog XPL JMP begin pcb gt prog gt program jpc param pcb gt prog gt program_cnt el But anonymous nonterminals also have their disadvantages They represent individual symbols that are only used in one position but can be made up of a fully fledged nonterminal definition with multiple productions and a return value This is also the reason why the if else statement is implemented a different way in the final XPL implementation Trying to get the drafted version with two productions for if with else and if without else driven this way will fail because the insertion of anonymous nonterminals into the existing productions removes the ambiguity of the grammar and causes the wrong production to be reduced This problem could also be resolved with the assistance of anonymous nonterminals by simply adding an optional else part to the if statement yielding in one production for the entire if and if else construct Please note that this change has its origin in the way how we want to produce our code only In other compiler implementations this could be solved much better and differently than here so the drafted grammar could also be used without anonymous nonterminals The final version of our grammar turning it into an XPL compiler is this one language gus Meta information parser XPL description eXample Programming Language copyright In the public domai
132. nerators or other related software can be used to build or analyze parsers according to their specialized area using UniCC as their parser generator In this and the following chapters we only rely UniCC s build in code generator Because the C parser template which is delivered with UniCC is the most stable and widely used one we implement parsers in this manual only in C You can easily adapt the knowledge from here to other languages when using other parser templates or code generators in combination with UniCC The examples we will deal with are hold simple so even those of you who have no experience in C will understand them with ease What is needed to run the examples in this manual is an installed C or C compiler e g gcc on Linux The general build steps for parsers implemented with UniCC are the following once your grammar is written and stored into a file Compile generate the grammar using UniCC into a parser Compile or run the generated parser with the particular compiler your parser is written for not required if your language is interpretered e Run the resulting program Using specialized code generation tools e g which use the parser definition file feature may use another approach in order they call UniCC or use a subsequent call to a generator which builds the output file for a particular programming language or solution Experienced programmers usually will put the above actions into a Makefile or other kind
133. ng this request is shown next fwhitespaces uoc lexeme integer title appointment date titl title date title ee unt date month integer integer integer integer integer integer integer integer month January February March April May June July August September October November December integer 0 9 integer 0 9 10 2 3 Grammar modularity UniCC Parser Generator You see that we have quickly extended the original grammar for simple dates to match a small language that allows for entering appointments with a date in several formats Maybe you already noticed some new elements used in this grammar e g the production title gt fmi premix nme This production is enabling a title that is enclosed as a string literal The terminal definition defines a terminal that exists of all possible characters instead of the double quotation mark which is specified by the prefixed exclamation mark and this with an automatic repetition of zero or multiple characters invoked by the postfixed star character The first of these two new symbols the exclamation mark is a negation which belongs to the character class it precedes Secondary the star is a modifying operator that causes a Kleene closure of zero or multiple of the preceding symbol in this case the negated quotation mark character class These modifier operators
134. ni End of definition Error resynchronization means that the error is handled by the grammar and its semantic actions rather than the parser In the above example the parser will in case of a parse error try to pop as many items off the parse stack until it finds the error resynchronization token amp error This is shifted then and the parser comes into a new valid state that is covered by the grammar so the production with the error state takes place The parse error had been recovered respective it is not a parse error anymore By modifiying the actions in the semantic action of a error state error messages can still be forced and errors counted as it is done in above example But the parser continues and will output correct values on wrong input e g 2 5 x unicc w error par cc o error error c echo n 2 5 x error s Error Expecting integer on line 1 column 5 102 4 7 Error recovery UniCC Parser Generator 1 errors uro il amp error can be used on several positions so it is also possible to report errors with it or to silently fix them and let the parser continue It must be mentioned that the way how parse errors and the error synchronization token is handled relies on the used target language parser template and is not a direct feature of the UniCC Parser Generator itself 4 7 Error recovery 103 UniCC Parser Generator 104 4 7 Error recovery 5 The Standard C Parser Templ
135. ntions in generated parser modules When one of these directives is specified multiple times their values will be glued together to one huge string parser myBASIC version 023095 description A simple BASIC compiler copyright C 2011 by BasicMan prefix mybasic All these values are only hold by UniCC and can be inserted into the parser template or will be written to the XML output file depending on the desired output code generator In the program code generator their values will expand with the template variables copyright description parser prefix and version 4 5 6 prologue epilogue The prologue and epilogue directives are used to define any program code that is inserted before and behind the parser implementation in the yielding parser module Some includes and variables into the prologue prologue include lt stdlib h gt include lt stdio h gt include lt string h gt static int variables 1000 static int next_var 0 el Defining the parser call in the epilogue epilogue int main prefix_pcb pcb memset amp pcb 0 sizeof pcb return prefix_parse amp pcb 90 4 5 5 copyright description parser prefix version UniCC Parser Generator In many UniCC parsers that had been extended to real compilers these two directives take much nearly the most content of the parser definition files because they contain th
136. o other productions or to form the goal symbol Every symbol both terminal and nonterminal in UniCC is always declared by its first use Its particular description or definition if required can be made elsewhere later in the grammar Symbols without a definition will be reported by UniCC and maybe stop the further compilation process of the grammar with a request for correction 4 4 1 Terminal symbols Terminal symbols terminals define a piece of raw input data that is read from the input stream They can be defined and specified in various ways A single character or a set of valid characters can form a terminal but also a static character sequence or character sequences that matches to a regular type 0 language defined via regular expressions All terminal symbols are identified by an lexical analyzer Every terminal symbol yields in a leaf within the resulting parse tree it has no children 4 4 1 1 Characters Character terminals also called character classes are the simplest definition type of a terminal symbol This kind of terminal symbol defines a character or a set of valid input characters A character class is specified by strings enclosed in single quotation marks and is defined by its use Once a character class terminal is used somewhere in the grammar or any directive it is defined and becomes known to UniCC To match only one individual character for example an a then a is the correct definition
137. o the reason why UniCC is used to compile its own grammar during its build Since the first line of UniCC was written one of its major design goals had been to be universal related to the parsers it outputs UniCC is not bounded to one or a special set of programming language a UniCC compiled parser can be written in UniCC is target language independent which means that the parsers it outputs could be implemented in any programming language a parser template or code generator is provided for UniCC provides two different parser generation target approaches The first approach is a static build in code generator working with a so called parser code template This code generator is a build in part of UniCC and allows to directly turn a UniCC grammar definition into a program written in the syntax of a particular programming language Because UniCC itself is written in C its build in code generator is also optimized to generate parsers in C similar procedural oriented languages Hence there is only one parser template for C parsers yet UniCC s build in code generator can only output C parsers for now But for an really target language independent approach UniCC provides a secondary way for generating a parser which is a parser description file This file is expressed in an XML language and includes much information on the grammar its symbols the parser and lexer states and the parser behavior itself Using this type of output third party code ge
138. on its own which should be recognized as an integer number by the grammar More about the use of regular expressions and their making up is covered in the UniCC reference guide 2 6 2 2 Overwriting production precedence The second new feature that is used here but has not covered yet is the production expression precedence This special precedence directive sets a higher precdecence level to the production to overwrite its default precedence that is associated with the substraction operator to the precedence of the multiplication operator This configuration is done to allow for an unary minus Expressions like 2 4 would be parsed as 2 4 if this configuration is not done a wrong result because substraction is below multiplication in the operator precedence order Setting the production precedence to the level of the multiplication operator or higher it is parsed the correct way as 2 4 2 6 2 3 Multiple left hand sides The last new feature used in this grammar is the nonterminal definition variable function gt identifier This definition associates one production with two nonterminals variable and function This construct allows to determine the final type of the nonterminal at runtime by using UniCC s feature of semantic nonterminal determination This feature allows to first check within the semantic code if an identifier is the name of a function or the name of a variable and then return the appropria
139. on will cause a regular expression terminal reading a comment that begins with and ends with The defines none or several characters of any kind In this special case UniCC switches the terminal QWHITESPACE to be non greedy due to the use of the placeholder Now using the input 3 Hello 4 World 5 One would expect that 3 4 and 5 are returned by the above grammar and the comments are ignored A non greedy configuration of WHITESPACE as it is automatically set by UniCC in this example would do this But a greedy configuration of WHITESPACE would only return 3 and 5 because a comment that is Hello 4 World will be identified This first is also recognized in greedy mode terminals but the lexical analyzer will scan further until it maybe gets a second To enable full control about if such regular expression based terminals are configured to be greedy or non greedy UniCC offers the directives greedy and nongreedy that had been mentioned above Changing above definition of WHITESPACE to be read as WHITESPACE t r n yxu Wx n greedy Would scan up the way in the second greedy case greedy and non greedy are attached right before the definition s end marker but behind possible semantic code Switchting terminals to be greedy or 4 4 1 3 Regular expressions 77 UniCC Parser Generator non greedy may cause entirely dif
140. or with Strings concatenation if prefer XPL STRINGVAL char SL str xpl malloc char NULL strlen xpl value get string op 0 strlen xpl value get string op 1 1 sizeof char sprintf str s s xpl value get string opl xpl value get string opl nC o 146 6 9 xpl run c 6 9 xpl run c UniCC Parser Generator val xpl_value_create_string str 0 else if prefer XPL_FLOATVAL val xpl_value_create_float xpl value get float op 0 xpl value get float op 1 else val xpl value create integer xpl value get integer op 0 xpl value get integer op 1 break case XPL_SUB Substraction if prefer XPL_FLOATVAL val xpl_value_create_float xpl value get float op 0 xpl value get float op 1 else val xpl value create integer xpl value get integer op 0 xpl value get integer op 1 break case XPL_MUL Multiplication if prefer XPL_FLOATVAL val xpl_value_create_float xpl value get float op 0 xpl value get float op 1 else val xpl value create integer xpl value get integer op 0 xpl value get integer op 1 break case XPL_DIV Division if prefer XPL_FLOATVAL val xpl_value_create_float xpl_value_get_float op 0 xpl value get float op 1 else val x
141. ot both operations at the same time When this option is used UniCC produces about 20 30 more LALR 1 states Dumps an overview about the finally produced productions and their semantic P production actions s stats Prints a statistics message to stderr when parser generation has entirely been finished a vw Dumps the generated LALR 1 states that had been generated during the parse table generation process t stdout Prints all generated output to stdout instead of files T symbols Dumps an overview of all used symbols Prints process messages about the specific tasks during parser generation process v verbose Also turns on s W warnings Print relevant warnings Triggers UniCC to run the parser description file generator additionally to the program module generator The parser description file generator outputs an x xml f XML based parser representation of the generated parse tables which can be used by third party code generators or grammar analsys and debugging tools X XML Triggers UniCC to only run the parser description file generator without running the rogram module generator Table 2 The UniCC command line interface options Errors and warnings are printed to STDERR any other kind of output to STDOUT 3 4 Invocation and command line options UniCC Parser Generator 3 5 Code generators Parser Template File tpl Parser Definition par Parse Table Generator
142. p eof expr FAS expr expr m expr expr PET expr expr integer 4 BE Ron NES expr expr integer 3 integer 1 integer 2 Fig 10 Syntax tree of input 1 2 3 4 in a left leaning configuration Now the opposite using the right directive mode insensitive 4 5 6 prologue epilogue 91 UniCC Parser Generator right LEI integer 0 9 start expr expr gt expr expr integer start start amp eof ee Ou Ser Er ie integer 1 expr expr Ec dm mee aieo 3 gineget 4 Fig 11 Syntax tree of input 1 2 3 4 in a right leaning configuration Multiple calls of these directives cause higher precedence leveling so it is necessary to perform multiple calls of left to build up a pecedence associativity matrix for e g expressions All symbols passed to the first left directive call get a precedence level of 1 all symbols passed to the second left call get a precedence level of 2 and so on mode insensitive nonassoc We left K ee left ax v Ey right ee right IA integer O 9 start expr expr expr expr expr expr expr expr expr expr expr expr expr expr expr expr integer 92 4 5 7 left right nonassoc precedence UniCC Parser Generator start start amp eof ep SU iu S expr expr expr expr atea ee
143. pe ret prefix_vtype test State int act int idx int lhs Lookahead int sym int old sym unsigned int len Input buffering UNICC SCHAR lexem UNICC CHAR buf UNICC CHAR bufend UNICC CHAR bufsize Lexical analysis UNICC CHAR next UNICC CHAR eof 108 5 7 Q Q prefix pcb parser control block UNICC_BOOLEAN Error handling int int unsigned int unsigned int UNICC_SYNTAXTREE Syntax tree prefix_syntree UniCC Parser Generator is_eof error_delay error_count line column syntax_tree endif User defined components pcb prefix_pcb Member Type Content act int The current action to perform shift reduce shift amp reduce buf UNICC CHAR Holds the current input buffer This input buffer holds all the characters necessary to identify the current input token bufend UNICC CHAR Pointer to the end of the input buffer for faster data appending operations bufsize UNICC CHAR The pointer to the last character of the input buffer This is used when input buffer reallocation is needed Contains the current column within the input beginning from the current column unsigned int line Line member variable First character is 1 Should be used for error reporting Defines the value of the end of file character This is 1 by default
144. pl value create integer xpl value get integer op 0 xpl value get integer op 1 147 UniCC Parser Generator break default float res Compare by substracting the left operand from the right operand or with the string comparion function strcmp resulting in res 0 equal res 0 not equal res lt 0 lower than res lt 0 lower equal res gt 0 greater than res gt 0 greater equal if prefer XPL_STRINGVAL res float strcmp xpl value get string op 0 xpl value get string op 1 else if prefer XPL FLOATVAL res xpl value get float op 0 xpl value get float op 1 res float xpl value get integer op 0 xpl value get integer op 1 Switch comparison switch rt ip gt op case XPL_EQU val xpl_value_create_integer Ires 1 0 break case XPL_NEQ val xpl_value_create_integer res 1 0 break case XPL_LOT val xpl_value_create_integer res 0 1 0 break case XPL LEO val xpl value create integer res lt 0 1 0 break case XPL GRT val xpl value create integer res gt 0 1 0 break case XPL GEO val xpl value create integer res gt 0 1 0 break 148 6 9 xpl run c UniCC Parser Generator Free the operands xpl value free op 0 xpl value free op 1 Push the operation or comparison res
145. ps A new or existing language can quickly be prototyped and tested This prototype then can be used to implement the parser that constructs an abstract syntax tree out of the given input which is in turn used for the further processing of this fetched information 2 Constructing parsers with UniCC 5 UniCC Parser Generator 2 2 Defining grammars The ways of how data is analyzed their syntax rules can be defined using a so called grammar the grammar of the language the parser accepts Every kind of computer language even simple text matching patterns can be defined in some kind of grammar Grammars for computer languages so called context free languages are expressed in a special notation which is called the Backus Naur Form or shortened just BNF It was invented in 1959 by John Backus and Peter Naur in the course of the ALGOL programming language Grammars expressed in Backus Naur Form exists of three fundamental elements The terminal symbols the nonterminals symbols and the productions Because all of these elements integrate together it is not possible to explain them separately Moreover lets first define what the purpose of these elements is Terminal symbols or simply called terminals are language atomics which are directly read from the input stream A terminal symbol can be a single character a character from a set of possible allowed characters a string sequence or a regular expression that matches a classifyin
146. r identifier A Za z It Extended identifier Gextidentifier A Za z A Za z0 9 A fictive example with sub expressions would be hello Hello t World A Za z A string value string rue QUK UI The order of the definition of terminals based on regular expression also indicates their level of specialization so specialized regular expressions shall be defined first in the grammar It is strongly recommended to describe all regular expressions before the first nonterminal definition is done to avoid unexpected behaviors which are having they origin in wrong definition orders in most cases Note that terminals based on regular expression can always replace a character terminal or string terminal Regular expression terminals are sorted behind string terminals but before character terminals in their specialization order Qabc abc example abc abc this production will never be matched It is obvious that terminals which are defined via regular expressions are a very powerful tool But they are also often the source for unexpected parser behaviors that let some grammar developers become desperate in many situations Especially in the sensitive parser construction mode regular expressions should really be handled with care 4 4 1 3 Regular expressions 75 UniCC Parser Generator 4 4 1 3 1 Semantic actions Terminals defined via regular expressions can also b
147. r 0 9 integer 10 2 Ss t0 si End of definition This will parse and calculate the correct values including all precedence relations 2 5 2 Precedences 25 UniCC Parser Generator 2 6 Implementing a programming language Now that we have learned to know how to implement simple expressions using UniCC it s time to implement an example of real programming language The programming language should be called eXample Programming Language shortened XPL and is a C styled language The language should match the following requirements A typeless language e Arithmetic and conditional expressions e Support of integer floating point and string values e Simple control structures conditionals iterations e Variable assignments Nested calls to build in functions with variable arguments The functions should provide simple data manipulation routines and input output facilities We first don t care about how to execute this programming language First it is only a draft for a syntax describing the language grammar A simple example to rudely define the syntax of XPL shall be the 99 bottles of beer programming example if bottles prompt Enter number of bottles default 99 bottles 99 if integer bottles lt 0 print Sorry but the input bottles is invalid exit 1 while bottles gt 0 if bottles gt 1 print bottles bottles of be
148. re and is the result for the request to an lightweight simple but flexible and platform independent parser development system UniCC is the base platform and compiler for many other compiler related projects launched by Phorward Software Technologies and will hopefully also be used by many other interesting software projects in future To get more information on Phorward Software Technologies and our projects and products visit http www phorward software com on the web 2 1 2 The intention behind UniCC UniCC Parser Generator 1 3 A few words on this manual Parsing and compiler construction is one of the most complex on the first view impenetrable looking but even fascinating and challenging topic of computer sciences The knowledge and experience teached by this topic can be integrated into many software development issues of a programmer s everyday life and opens new possibilities and faster ways to success Some readers of this manual would have already taken some experience on this complex topic maybe from a study on computer sciences a business related project that used some kind of programming or definition language or a private approach of writing a compiler for an domain specific programming task Rather others don t have any knowledge yet but want to know how things work or require knowledge for an upcoming project The UniCC User Manual is as it stands for a user manual for the UniCC parser generator and not a general
149. re contains information about the symbol its semantic value and informations about state and position in the input typedef struct prefix_vtype value prefix_syminfo symbol int state unsigned int line unsigned int column prefix_tok Member Type Content column unsigned int The column of the symbol occurence in the input id int The symbol id which refers to the stack symbol Every symbol terminal or nonterminal has its own id line unsigned int The line of the symbol occurence in the input Pointer to the symbol entry in the parser symbol table This symbol table T ab Poe contains the name and additional informations about the symbol state int The state that was current when the symbol was pushed value prefix vtype The semantic value that is associated with the symbol Table 11 Member variables of prefix tok 5 8 Additional data structures 111 UniCC Parser Generator 5 8 3 prefix_syminfo symbol Information Table Parsers constructed with the UniCC Standard C Parser Template contain a symbol table holding information about the grammatical symbols used in the parser This symbol table is used by the parser itself to get additional informations about symbols e g its type or if its configured as whitespace For debug purposes this table also contains the original name of the symbol as defined in UniCC The grammar symbol table is a structure of t
150. reducing by product after reduction shift lt lt reducing by product after reduction shift current token 7 0 9 sym 7 0 9 len 1 Stack Dump 0 2 amp whit gt gt shifting terminal 7 lt lt reducing by product after reduction shift current token 3 sym 3 len 1 Stack Dump 0 2 amp whit lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift current token 3 sym 3 len 1 Stack Dump 0 2 amp whit gt gt shifting terminal 3 current token 7 0 9 sym 7 0 9 len 1 Stack Dump 0 2 amp whit lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift current token 7 0 9 sym 7 0 9 len 1 Stack Dump 0 2 amp whit gt gt shifting terminal 7 lt lt reducing by product after reduction shift current token 8 amp eof sym 8 amp eof len Stack Dump 0 2 amp whit lt lt reducing by product after reduction shift lt lt reducing by product after reduction shift UNICC_DEBUG UniCC Parser Generator 0 9 ion 7 integer gt ing nonterminal 9 0 9 integer tos state 3 act reduce idx 13 espace 3 integer ion 13 amp whitespace gt ing nonterminal 14 amp whitespace ion 20 integer gt integer amp whitespace ing nonter
151. ression ion 13 amp whitespace gt ing nonterminal 14 amp whitespace ion 20 integer gt integer amp whitespace ing nonterminal 21 integer 9 13 13 3 integer 123 UniCC Parser Generator debug lt lt reducing by production 6 expression gt integer debug after reduction shifting nonterminal 11 expression debug current token 8 amp eof debug sym 8 amp eof len 0 tos gt state 18 act reduce idx 3 debug Stack Dump 0 2 amp whitespace 4 expression 11 16 expression 13 18 expression debug lt lt reducing by production 3 expression expression expression debug after reduction shifting nonterminal 11 expression debug current token 8 amp eof debug sym 8 amp eof len 0 tos gt state 16 act reduce idx 1 debug Stack Dump 0 2 amp whitespace 4 expression 11 16 expression debug lt lt reducing by production 1 expression gt expression expression debug after reduction shifting nonterminal 11 expression debug current token 8 amp eof debug sym 8 amp eof len 0 tos gt state 4 act shift idx 0 debug Stack Dump 0 2 amp whitespace 4 expression debug gt gt shifting terminal 8 amp eof debug lt lt reducing by production 0 calculator gt expression amp eof debug after reduction shifting nonterminal 10 calculator debug lt lt
152. ription eXample Programming Language copyright In the public domain 2011 version T S25 prefix zpi default action G1 default epsilon action 0 Precedence and associativity left WW left Wi 1 n lt S LSC Vet left Letz P o left E MIR Regular expressions string lt char gt Kito E s EUIS D DH identifier lt char gt A Za z A Za z0 9 Hz Lexemes lexeme real fraction real lt float gt real_integer real_integer Greal integer real fraction Qreal fraction real integer float real integer 0 9 dig 128 GQreal integer QGreal fraction 6 2 xpl par real_fraction lt float gt gt UniCC Parser Generator 10 real integer dig 0 0 9 dig dig 0 real fraction 0 9 dig Greal fraction Di dig 10 0 Q 9 sag dig 0 10 0 Whitespace grammar construct whitespaces whitespace whitespace Goal symbol program gt statement gt 6 2 xpl par UNPANNGIE wN LINTI CNTT statement if expression xpl emit pcb gt prog XPL JPC 0 SI jpe statement else pcb prog program cnt xpl emit pcb gt prog XPL JMP O0 A jmp statement jmp GG 1 jmp p if
153. rser construction mode always uses one single lexical analyzer that identifies terminal symbols The difference to the sensitive mode is that lesser states are produced because the grammar is not rewritten and whitespace is directly absorbed within the stage of lexical analysis Overlapping character classes can not be used in this mode This construction mode can be compared to most other parser generators like the one used by the combination of lex and yacc It depends on the requirements of the grammar which construction mode should be used The specialities on the two construction methods are described below The construction mode can be changed with the mode top level directive 3 6 1 Sensitive mode The whitespace sensitive parser construction mode gives a maximum of flexibility on whitespace and lexeme construction and their behavior and is the default if no other construction mode is wanted It is a UniCC speciality that was never provided by any other parser generator before in this way The most common characteristic of this construction mode is that UniCC entirely rewrites the grammar according to whitespace and lexeme definitions to make whitespace only valid in selected situations The definition of whitespace is not limited to one terminal symbol anymore in this mode The symbol defining whitespace can be a nonterminal described as part of the context free grammar itself without any limitations An optional call to this nonterm
154. rst this approach was chosen here All possibilites and facilities on lexical analysis will be introduced later on The advantage of this approach is that full LALR power is available on this character based parsing method so even recursive lexemes can be parsed as true context free grammars not only regular ones Terminal symbol based on regular expressions are even 2 5 1 Using semantic actions 17 UniCC Parser Generator possible in UniCC but this will be discussed later on 2 5 2 Precedences In the last chapters we did some experience on how grammars are written and how a simple parser is furnished and attributed with semantic actions A real compiler compiling a programming language or similar into another representation e g assembly code is nothing else than this but much more effort in writing the grammar and especially the semantic code is required To build grammars for such higher and complex targets you need experience time and the most important thing Patience We will now write a grammar that parses mathematical expressions a four function calculator Nearly every high level programming language supports mathematical expressions so why not to begin here Mathematical expressions have special demands to their grammar Multiplicative operations multiplication and division take precedence over additive operations addition and subtraction but this order can be broken by bracketing terms to take same precedenc
155. s within another production before this production is reduced For this special kind of situation UniCC provides the feature of anonymous nonterminals which can be defined right in place where they are required An anonymous nonterminal is stated on a right hand side by surrounding its productions with brackets and The nonterminal has no naming or alias an automatically generated name will be assigned by UniCC Within the brackets there is the ordinary way of defining productions separated by the already known separator pipe A data type preceding the brackets allows to define a data type for the anonymous nonterminal var type gt lt BOOLEAN gt default TRUE This is empty FALSE default Gident printf var s s n Gdefault default ident Within the semantic action code of productions defined within anonymous nonterminals access to the semantic values of all the right hand side items in front of the appearance of the anonymous nonterminal is possible The values are intermixed with the semantic values defined within the anonymous production A real life example is this one to stack up a value temporarily within a right hand side It defines an anonymous nonterminal with only one empty production and a reduction code procedure_def gt def ident funcname Need to stack the current_depth value lt int gt This is an
156. sers can recursively be called using an extendible parser control block pcb e Wide character and UTF 8 unicode input support Trace and stack trace facilities Build in error recovery Build in syntax tree generator e Symbol and production tables for debug and syntax tree construction Provides a default parser test environment if no semantic code is given Dynamic end of file behavior 5 3 License The UniCC Standard C Parser Template is licensed under the BSD open source license The BSD license differs from the Artistic License that is used by UniCC itself but it is compatible with it This decision has been done to make the use of the parser template and its derivative works every parser generated using this template is a derivate of the original template also possible in other combinations including proprietary software The license terms are printed on top of the UniCC Standard C Parser Template Cd 5 The Standard C Parser Template C tlt 105 UniCC Parser Generator 5 4 Contributions Any contributions to the UniCC standard parser template are welcome Contributions can be send to us so we will integrate them into the distribution version of UniCC and the standard template if they are necessary and useful for everyone 5 5 Use and configuration As the C standard parser template has been developed simultaneously with UniCC itself and results in UniCC s own grammar parser it is the best integrated parser template
157. sh duplicate xpl_push amp rt xpl_value_dup rt variables rt ip gt param break case XPL_STO Store value to variable if rt variables rt ip gt param xpl value free rt variables rt ip gt param rt variables rt ip gt param xpl pop amp rt break case XPL DUP Duplicate stack item val xpl pop amp rt xpl push amp rt val xpl push amp rt xpl value dup val break 6 9 xpl run c 145 UniCC Parser Generator case XPL_DRP Drop stack item xpl value free xpl_pop amp rt break case XPL JMP Jump to address rt ip prog gt program rt ip gt param continue case XPL JPC Jump to address only if stacked value is nonzero if xpl value get integer val xpl pop amp rt xpl value free val rt ip prog gt program rt ip gt param continue xpl_value_free val break default xpl_datatype prefer xpl_value op 2 Pop operands off the stack op 1 xpl pop amp rt op 0 xpl pop amp rt Get best matching type for operation from both operands Ef if op 0 gt type XPL STRINGVAL op 1 gt type XPL STRINGVAL prefer XPL STRINGVAL else if op 0 gt type XPL FLOATVAL op 1 gt type XPL FLOATVAL prefer XPL FLOATVAL else prefer XPL INTEGERVAL switch rt ip gt op case XPL ADD Addition
158. sion AVG GS ee identifier 6 2 xpl par UniCC Parser Generator if xpl_get_function identifier gt 0 symbol function else xpl get variable pcb gt prog identifier Parser Control Block pcb xpl_program prog FILE input Prologue amp Epilogue prologue finclude xpl h extern xpl_fn xpl buildin functions define UNICC GETINPUT fgetc pcb gt input epilogue int xpl compile xpl program prog FILE input prefix_pcb pcb memset amp pcb 0 sizeof prefix_pcb pcb prog prog pcb input input pcb eof EOF prefix_parse amp pcb return pcb error count 6 2 xpl par 133 UniCC Parser Generator 6 3 xpl h ifndef XPL_H define XPL_H lt stdlib h gt lt stdio h gt lt string h gt lt locale h gt include include include include Virtual machine values typedef enum XPL_NULLVAL XPL_INTEGERVAL XPL_FLOATVAL XPL_STRINGVAL xpl datatype typedef struct xpl datatype type union int i float char value S char strval xpl_value Functions typedef struct char int int xpl value xpl fn name min max fn int xpl value Virtual machine opcodes typedef enum XPL NOP x x D XM XM PX X X U 1 EQU PL MEO P
159. sion GG Q1 Q3 expression expression GG Q1 Q3 expression expression G1 Q3 expression Q2 integer integer gt 10 9 Dk BR Ql Ho 0 integer 0 9 integer 10 Q2 0 End of definition and compiling this with 5 11 main the build in main function 117 UniCC Parser Generator unicc w expression par CC o expression expression c the parser test mode can be invoked with expression 1 If 1 is not sumbitted a parse error occurs when input is read via stdin from keyboard and the enter key is pressed Writing the expression to be parsed to a file will be consumed without 1 echo n 1 2 3 gt test expr expression test expr or simply echo n 1 2 3 expression 118 5 11 main the build in main function UniCC Parser Generator 5 12 prefix_parse the parser invocation function The inital parser invocation function of the C standard parser template is called prefix_parse where prefix will be replaced by the prefix specified at the prefix directive If no prefix is given the function is simply called parse If the parser function is called from within a function defined in the parser definition file itself it is still possible to write prefix_parse also in the semantic code The prefix variable will be replaced with its correct content by the code generator when the code is build
160. sion terminals may be the source for memory leaks if they are used faulty Depending on the used parser target language template and the parser mode the semantic part of a regular expression terminal can be executed multiple times to allow for semantic symbol selections but also to extract semantic values from the input as shown above For such cases every parser template sets some variables or pre processor directives to turn off areas within the semantic code which may cause memory problems The documentation of the particular UniCC parser target language template should handle this topic 76 4 4 1 3 Regular expressions UniCC Parser Generator 4 4 1 3 2 greedy non greedy Defining greedyness Terminal symbols based on regular expressions can be defined to work greedy or non greedy relating to their behavior during the process of lexical analysis UniCC automatically defines a terminal that uses the character class for any character as non greedy The greedyness can also be overwritten and explicitly defined using the terminal symbol configuration directives greedy and non greedy To describe the problem the following grammar is used mode insensitive language C whitespaces WHITESPACE WHITESPACE NCAA n on Q Wa fn int r0 printf int s n gt zl start gt int There is a whitespace definition for comments reading nien x xm This definiti
161. ss or a node defining a conditional statement and its nested structure is clearly defined Such an AST representation could already be used for a direct interpretation of the input In C for example such an abstact syntax tree could be expressed as a huge union data structure providing several structures for every language element There could be one structure describing a value one structure describing an assignment one structure describing an IF construct and so on The result is a tree structure that is made up of many of such nodes links and leafs that can be easily interpretered by a recursively executed virtual machine Althought this method is very simple to implement it still requires a lot of coding efforts and saving this tree to a file is not possible without the use of some nested structuring The benefit of this method is that no backpatching is required because the structure is a logical tree The generation of intermediate code could also be done from an abstract syntax tree but it can also directly be done out of the parser in simple languages where the parser serves as the immediate representation of the abstract syntax tree during input recognition Intermediate code in turn can be established on various paradigms 3 address code is very useful to generate optimized code for register machines 1 address code can be used for stack based machines like the Java Virtual Machine This is also the approach we want to implement
162. stdin fclose src return 0 6 7 xpl main c 141 UniCC Parser Generator 6 8 xpl program c finclude xpl h Program structure handling int int int 142 xpl_get_variable xpl_program prog char name int Le A function name with the same identifier may not exist if xpl get function name gt 1 return 1 Try to find variable index for i 0 i lt prog variables cnt i if strcmp prog gt variables i name 0 return i Else eventually allocate memory for new variables if i XPL MALLOCSTEP prog gt variables char xpl malloc char prog gt variables i XPL MALLOCSTEP sizeof char Introduce new variable prog gt variables prog gt variables_cnt xpl_strdup name return prog gt variables_cnt xpl_get_literal xpl_program prog xpl_value val Else eventually allocate memory for new variables if prog literals cnt XPL MALLOCSTEP prog gt literals xpl_value xpl_malloc char prog gt literals prog gt literals_cnt XPL_MALLOCSTEP sizeof xpl_value prog gt literals prog gt literals_cnt val return prog literals cnt xpl emit xpl program prog xpl op op int param if prog program cnt XPL MALLOCSTEP 0 prog gt program xpl cmd xpl malloc char prog gt program prog gt progr
163. sulting in smaller parse tables and faster parsers UniCC is technically a target language independent parser generator which means that it is not bound to a special programming language Currently only support of the C programming language is given due the UniCC Standard C Parser Template More standard templates for other languages like C and Java will be developed and released in future by Phorward Software Technologies but are not available for now Next to the generation of parser program modules in a particular programming language UniCC also offers the possibility to export all the grammar information its parse tables semantic actions and settings extracted from a UniCC parser definition into a target independent XML based output format that can be handled by any individual subsequent tasks or module specialized code generators analyzers or direct interpreters 1 Introducing the UniCC parser generator 1 UniCC Parser Generator 1 2 The intention behind UniCC The UniCC Parser Generator has its origin at Phorward Software Technologies and is initially written and maintained by its 1985 born core developer Jan Max Meyer Phorward Software Technologies is a software company from Dortmund Germany which focuses its business on compiler construction scripting languages and software development tools both in the area of open source and commercial solutions UniCC designates to be as one of the flagship products of Phorward Softwa
164. t static void xpl_stack xpl_runtime rt int i for i 0 i lt rt gt stack_cnt i fprintf stderr 3d s n i xpl value get string rt gt stack i D if i fprintf stderr Stack is empty n void xpl run xpl program prog xpl_runtime rt xpl_value val int qs Initialize runtime memset amp rt 0 sizeof xpl runtime rt variables xpl value xpl malloc char NULL prog variables cnt sizeof xpl value rt ip prog gt program Program execution loop while rt ip lt prog gt program prog gt program_cnt fprintf stderr IP p n rt ip xpl dump prog amp rt xpl stack amp rt ZA switch rt ip gt op case XPL_NOP No nothing break case XPL_CAL Calling build in functions int param_cnt Last stack item contains the number of parameters val xpl_pop amp rt param_cnt xpl_value_get_integer val xpl_value_free val Call the function val xpl buildin functions rt ip gt param fn param cnt rt stack rt stack cnt param cnt If no value is returned create a default value if val 44 2 6 3 6 Implementing the virtual machine UniCC Parser Generator val xpl value create integer 0 Discard the parameters from stack while param cnt gt 0 xpl value free xpl pop amp rt param cnt Push the re
165. te nonterminal symbol This feature could be 30 2 6 2 1 Terminals based on regular expressions UniCC Parser Generator seen as a dynamic change of the way how the grammar is recognized at runtime 2 6 2 4 The dangling else problem When compiling this grammar we will be concerned by this shift reduce conflict warning unicc warning state 81 Shift reduce conflict on lookahead else 22 statement gt if expression statement Seef jif while T 23 statement gt if expression statement else statement This problem is called dangling else and occurs in many LALR LR based grammars The nature of this problem can be obtained from the following pseudo code examples if expression statement else statement The above syntax is clear and not ambigous But what happens when we want to submit if expression if expression statement else statement Now the else is visually part of the first if but the parser would take it as the else part of the second if Both ways could be valid UniCC notices this situation that the grammar is ambigous so the above warning is printed This problem is also found in grammars for wide spread languages like C or Java The solution is to put the second if into a new statement block and everything is clear The warning can be ignored in this case if expression if expression statement else statement Except of this warning t
166. tents 4 UniCC Grammar Definitions 4 4 1 1 UI M Q 43 KAN SINE SOHUSHOOS een cine tie pertes te bats ocupado tip edt epp RE Per ads 74 44 173 Regulat expresStOUR na a eu 74 44 14 T munal dnominli s sii een 78 4 4 2 Nonterminal Symbols titer bit tnter A kn Ree ERE SEE HER ERE EAE EYE AERE du bn da 80 4 4 2 1 Te soSESHDDOD uhi pU OD EDI Gne Se EUR RE 80 4 4 2 2 Semantic nenterminalidetermination uses aa 81 2 53 Productions E 82 4 4 3 1 Semantic actions 82 4 4 32 Virtual producttOfis auis aerea saline ebenen ieh en 85 44 3 3 Anonymous nontermind ls ua 86 WR 88 4 5 1 Ii 88 GENEE 88 9 5 3 Scase LTE eege 89 4 5 4 default action default epsilon action default value pe 89 4 5 5 copyright description parser ffprefix version scc a aaa 90 4 5 5 iprolgene ele E 90 2 5 7 left Yright Enonasspc presedenee ae 91 GE 95 4 5 8 1 dFwhitespaces in sensitive mode 95 4 3 8 2 TTwhirtespaces in le E 96 Eege 96 4 3 10 d GUNG separi NON e ET 100 MU ICI S 100 Enn TE 100 GA Special Svmbuls cts ces vars vt estan vito rer u ee ens rire Ed cine pea i cs ee 101 qi EHOLTEROVER pde n pH Rides ME NI P dum EI MM Md MM UU 102 5 The Standard C Parser Template Ci en 105 MN Re EE 105 9 2 Fester M HT 105 3 94 LICENSE T
167. tes because UniCC detects an ambigous grammar in both lexeme unicc warning state 9 Shift reduce conflict on lookahead A Z a z 5 ident gt A Z a z ident 6 ident gt A Z a z n A Z a z 0 9 unicc warning state 10 Shift reduce conflict on lookahead 0 9 7 integer gt 0 9 integer 8 integer gt 0 9 rr ni A Z a z 0 9 By using lexeme separation on the warnings will be supressed and UniCC always selects the way that input for an lexeme is consumed as much as possible 4 5 11 fixate The 4 ixate directive can only be used within the sensitive parser construction mode It supresses shift reduce warnings that may raise up for some nonterminals caused by the grammar construction and always prefers to shift fixate expects a list of nonterminal symbols to be configured as fixed 4 5 12 reserve terminals The reserve terminals directive switches if terminal anomaly detection is performed by UniCC or not If reserve terminals is switched on UniCC assumes that terminals based on regular expressions take higher precedence in all places and does not perform terminal anomaly detection It is a boolean directive and can be switched on or of f Calling it without boolean parameter switches reserve terminals as on More about terminal anomalies can be found in the section terminal anomalies The directive can only be used in the sensitive parser construction mode
168. textbook on compiler construction So this manual immediately starts into the topic of practical parsing with the assistance of examples and the learning by doing principle Deep knowledge on what s going on behind the grammars how the parser internally works in detail and how it is constructed is not required or even covered here But if there s more interest in these topics e g language theories parsing concepts machine code generation and their optimization or if generally deeper information on the topic of parsing and compiler construction is wanted it is heavily recommended to read some adequate textbooks covering all these topics This manual does only focus on the usage of UniCC itself and how parsers are implemented with it Altought the UniCC parser generator comes with entirely target language independent facilities all programming related examples in this manual are in the C programming language The C standard template for the UniCC program module generator which is delivered with the UniCC program package is currently the only well tested and proven parser framework for UniCC so far In future other implementation languages and frameworks will be made available by Phorward Software Technologies or by third parties The UniCC User Manual is divided into four sections The first section contains a quick start guide into parser development in assistance with UniCC It is written in the style of a tutorial and is advised to quickly
169. the decrementation of variable i At address 010 a JMP operation is performed to begin at address 000 again compare the content of variable i greater than 0 This is the way how an iteration is performed within a 1 address code machine language Our compiler can only emit code sequencially and looking at the production for statement while statement gt while expression statement the code for the JPC call must be inserted already after the expression is parsed Here the technique of backpatching is applied in combination with another feature of UniCC the anonymous nonterminals Anonymous nonterminals provide a way to insert embedded grammar structures right in place into an existing 50 2 6 4 Implementing the compiler UniCC Parser Generator production including semantic actions We are able to insert anonymous nonterminals to perform semantic code generation within an unfinished production Exemplary this is done with the following syntax statement gt while expression generate jpc code statement backpatch the jpc call which is the same as we add a normal nonterminal with an empty right hand side and use it only once ANONYMOUS gt generate jpc code statement gt while expression ANONYMOUS statement backpatch the jpc call Anonymous nonterminals make it possible to define the entire nonterminal with all its productions on the right hand side of anot
170. the input August 17 2008 simplyfied 2 2 Defining grammars 9 UniCC Parser Generator 2 3 Grammar modularity Grammars are modular They can simply be extended and rewritten by re using already defined grammatical constructs as part of other grammatical constructs Due the build up that one nonterminal can be derived from one or multiple productions grammars are defined in a modular structure where existing elements can be replaced or enhanced without rewriting the whole grammar This will be demonstrated by the next example What about extending the appointment assistant to even accept other date formats e g the format lt Day gt lt Month gt lt Year gt and lt Month gt lt Day gt lt Year gt In this case we can reuse the nonterminals of the above grammar and extend the grammar to fit our requirements simply by adding some productions Extending nonterminal date will tune up our grammar to recognize more than one date format dates month integer integer integer integer integer integer integer integer Fine And now we want to extend our appointment assistant grammar to allow for adding an appointment message using the format Date lt Title gt or why noteven Title Date Here the grammar s goal symbol which was date until now must be replace with another goal symbol the appointment which is a more meaningful goal for the input we want match The final grammar fulfilli
171. the production is reduced uncompleted upper lying production calling the current production s nonterminal it is associated with Here this value can be accessed again by a reduction action to compute or output a result from it All values stored to nonterminals are written there by reduction actions The atomic values from the input stream the terminals are the base for these values and obtain their values directly from the input or an lexical analyzer which is introduced later In the standard C parser template delivered with UniCC every character terminal gets the character code of the character it matches in the input Therefore the semantic value that is constructed using nonterminal integer is integer gt 0 9 Gs Rb en EE CA integer 0 9 Q1 10 2 O to result in a true decimal number that is stored into memory as an integer data type This looks a little bit tricky for those who are not familiar with C For the parse of the number 17 for example the first scanned digit which can be a digit between 0 and 9 in the reduction code of the first production only exists in its character coded form from the input which is code 0x31 decimal 049 for the digit 1 in the ASCII character map To easily get an integer number 1 from this coded representation of the character we have to subtract the value of the character code of digit 0 which is 0x30 decimal 048 so the operation 0x31 0x30 returns the corre
172. tion of the grammar More about this topic can be read in chapter about the whitespaces directive 4 6 Special symbols Table 8 Special symbols in UniCC parsers 101 UniCC Parser Generator 4 7 Error recovery Requirements to modern parsers and compilers expect that a parse error on wrong input don t causes a final stop of the parser or compiler As many errors as possible should be reported to the user before a correction of the input and a re compilation process is invoked UniCC parsers normally stop on invalid input respective a parse error To get a parse error resynchronized and continue parsing the special symbol amp error can be used within the grammar Some grammar related directives language TET whitespaces NEN default action 1 left ra left TI ng Defining the grammar calculators expression printf Sd n expression printf d errors n pcb error count expression gt expression expression GG Q1 03 expression expression Q1 Q3 expression expression GG Q1 Q3 expression expression G1 3 expression Q2 integer amp error A printi Error P Expecting integer on line d column d n pcb gt line pcb column pcb error count t RR 0 pil integer 59st Q1 0 integer 0 9 integer 10 Pe
173. turn value xpl push amp rt val break case XPL_LIT Load literal and push duplicate xpl_push amp rt xpl_value_dup prog gt literals rt ip gt param break case XPL_LOD Load value from variable and push duplicate xpl_push amp rt xpl_value_dup rt variables rt ip gt param break case XPL STO Store value to variable if rt variables rt ip gt param xpl value free rt variables rt ip gt param rt variables rt ip gt param xpl pop amp rt break case XPL DUP Duplicate stack item val xpl pop amp rt xpl push amp rt val xpl push amp rt xpl value dup val break case XPL DRP Drop stack item xpl value free xpl pop amp rt break case XPL JMP Jump to address rt ip prog gt program rt ip gt param continue case XPL_JPC Jump to address only if stacked value is nonzero if xpl value get integer val xpl pop amp rt xpl value free val rt ip prog gt program rt ip gt param continue xpl_value_free val break 2 6 3 6 Implementing the virtual machine 45 46 UniCC Parser Generator default xpl_datatype prefer xpl value op ES A Pop operands off the stack op 1 xpl pop amp rt op 0 xpl pop amp rt Get best matching type for operation from both operands X if op 0 gt type XPL STRINGVAL op 1
174. uding generated and virtual productions They expect a code block as parameter default action ea 01 default epsilon action 0 example gt int here the default action will be used float printf f n zl Not here here default epsilon action is inserted default actionand default epsilon action can only be defined once subsequent calls will produce a warning and are ignored The default value type directives allows to define a default value type that is used for every symbol if no individual value type is specified The parser templates for the standard UniCC code generator and maybe subsequent code generators working on the XML output of UniCC may provide their own default value types This directive allows to overide this default by explicitly specifying a value type triggered as default default value type float 4 5 2 language 89 UniCC Parser Generator 4 5 5 copyright description parser prefix version These are the simplest parser directives which only have the purpose to store an informal string value parser description copyright and version should be used to name and describe the parser and its version implemented by the grammar prefix defines a prefix value that can be inserted in function or variable identifiers within the generated parser to allow for various parsers in one input source file or to meet specific symbol naming conve
175. ult xpl push amp rt val break Increment instruction pointer rt iptt Clear stack if code was clearly generated this would not be required for i 0 i rt stack cnt i xpl value free rt stack i xpl free char rt stack Clear variables for i 0 i lt prog variables cnt i xpl_value_free rt variables i xpl free char rt variables 6 9 xpl run c 149 UniCC Parser Generator 6 10 xpl util c finclude xpl h Function for memory allocation char xpl_malloc char oldptr int size char retptr if oldptr retptr char realloc oldptr size else retptr char malloc size if retptr fprintf stderr s d Fatal error XPL ran out of memory n FILE_ LINE exit 1 if oldptr memset retptr 0 size return retptr char xpl strdup char str char nstr nstr xpl malloc char NULL strlen str 1 sizeof char strcpy nst SCH return nstr char xpl_free char ptr if ptr return char NULL free ptr return char NULL 150 6 10 xpl util c UniCC Parser Generator 6 11 xpl value c finclude xpl h Value Objects xpl_value xpl_value_create void return xpl value xpl malloc char NULL sizeof xpl value xpl value xpl value create integer int i xpl_value val val xpl_value_
176. urn xpl_value_create_string 1 xpl_value XPL_exit int argc xpl_value argv xpl_ xpl_ xpl_ 140 int rc 0 if argc gt 0 re exit return value return value return value return xpl value get integer argv 0 rc xpl_value NULL XPL_integer int argc xpl_value argv i i xpl_value_create_integer xpl_value_get_integer argv XPL_float int argc xpl_value argv xpl_value_create_float xpl_value_get_float argv XPL_string int argc xpl_value argv xpl_value_create_string xpl_value_get_string argv i 6 6 xpl functions c UniCC Parser Generator 6 7 xpl main c finclude xpl h int main int argc char argv xpl_program program FILE src Initialize program structure to all zero memset amp program 0 sizeof xpl program Open input file if provided if argc gt 1 if sre fopen argv 1 rb fprintf stderr Can t open file s n argv 1 return 1 else printf Usage s FILENAME n argv return 1 Call the compiler if xpl_compile amp program src gt 0 return 1 Dump program before execution xpl_dump amp program xpl_runtime NULL Run program xpl_run amp program Clean up used memory xpl_reset amp program if sre
177. val gt type XPL STRINGVAL val gt value s s int xpl_value_get_integer xpl value val switch val gt type case XPL_INTEGERVAL return val value i case XPL FLOATVAL return int val gt value f case XPL STRINGVAL return atoi val value s default break return 0 float xpl value get float xpl value val switch val gt type 152 6 11 xpl value c UniCC Parser Generator case XPL_INTEGERVAL return float val gt value i case XPL_FLOATVAL return val value f case XPL STRINGVAL return float atof val gt value s default break return 0 0 char xpl value get string xpl value val char buf 129 LJ char p val gt strval xpl free val gt strval switch val gt type case XPL INTEGERVAL sprintf buf d val value i val gt strval xpl strdup buf return val gt strval case XPL FLOATVAL sprintf buf Sf val value f Remove trailing zeros to make values look nicer for p buf strlen buf 1 p gt buf p if p kp 0 break else if p 0O break kp 0 val gt strval xpl strdup buf return val gt strval case XPL STRINGVAL return val value s default break return i 6 11 xpl value c 153 UniCC Parser Generator 154 6 11 xpl value c 7 Appendix Il The UniCC Document Type Def
178. value_create_integer xpl value get integer op 0 xpl value get integer op 1 break case XPL DIV Division if prefer XPL_FLOATVAL val xpl value create float xpl value get float op 0 xpl value get float op 1 else val xpl_value_create_integer xpl_value_get_integer op 0 xpl_value_get_integer op 1 break default float res Compare by substracting the left operand from the right operand or with the string comparion function strcmp resulting in res 0 equal res 0 not equal res 0 lower than res lt 0 lower equal res gt 0 greater than res 0 greater equal if prefer XPL_STRINGVAL res float strcmp xpl_value_get_string op 0 2 6 3 6 Implementing the virtual machine 47 UniCC Parser Generator xpl_value_get_string op 1 else if prefer XPL_FLOATVAL res xpl_value_get_float op 0 xpl value get float op 1 res float xpl value get integer op 0 xpl value get integer op 1 Switch comparison switch rt ip gt op case XPL_EQU val xpl_value_create_integer tres 1 0 break case XPL_NEQ val xpl_value_create_integer res 1 s 0j break case XPL LOT val xpl value create integer res SE OQ A break case XPL LEO val xpl value create integer res lt 0 1 0 br
179. ved from 1 defined at 4 gt lt symbol type non terminal id 5 name start gt lt symbols gt lt productions gt lt production id 0 length 2 defined at 4 gt lt left hand side symbol id 3 offset 0 gt lt right hand side symbol id 0 offset 0 named Hello gt lt right hand side symbol id 4 offset 1 named World gt lt production gt production id 1 length 2 gt lt left hand side symbol id 4 offset 0 gt lt right hand side symbol id 4 offset 0 named World gt lt right hand side symbol id 1 offset 1 named World gt lt production gt lt production id 2 length 1 gt lt left hand side symbol id 4 offset 0 gt lt right hand side symbol id 1 offset 0 named World gt lt production gt lt production id 3 length 2 gt lt left hand side symbol id 5 offset 0 gt lt right hand side symbol id 3 offset 0 named start gt lt right hand side symbol id 2 offset 1 gt lt production gt lt productions gt lt states gt lt state id 0 gt lt shift symbol id 0 to state 2 gt lt goto symbol id 3 to state 1 gt lt state gt lt state id 1 derived from state 0 gt lt shift reduce symbol id 2 by production 3 gt lt state gt lt state id 2 derived from state 0 gt lt shift reduce symbol id 1 by production 2 gt lt goto symbol id 4 to st
180. version with lexemes configured will be a little shorter because UniCC revises not all terminals of the grammar to allow for whitespace GRAMMAR ident A Z a z lexem 1 prec 0 assoc N v null 3 gt A Z a z ident 4 gt A 2 a z integer 0 9 lexem 1 prec 0 assoc N v null 5 0 9 integer 6 gt 0 9 start print lexem 0 prec 0 assoc N v null 0 gt print value valu A Z a z 0 9 lexem 0 prec 0 assoc N v null 1 gt ident 2 integer start print lexem 0 prec 0 assoc N v null 7 start amp eof amp whitespace lexem 1 prec 0 assoc N v null 8 ze gd amp whitespace lexem 1 prec 0 assoc N v null 9 gt amp whitespace amp whitespace 10 amp whitespace amp whitespace lexem 1 prec 0 assoc N v null 11 gt amp whitespace 12 gt print print lexem 0 prec 0 assoc N v null 13 print amp whitespace ident A Z a z lexem 0 prec 0 assoc N v null 14 ident amp whitespace integer 0 9 lexem 0 prec 0 assoc N v null 15 integer amp whitespace start print lexem 0 prec 0 assoc N v null 16 amp whitespace start Now nonterminals integer and ident stay on their own they are not revised Instead there is now a nonterminal integer and ident that allows for whitespace behind the lexeme With a sharp look to the nont
181. verything which is production related is made on the right hand side so right of the arrow sign gt Here are some examples to follow We want to use one regex terminal name A Za z Defining a goal symbol with one production xample hello empty world hello empty and world are nonterminals declared above Now their definition follows hello Hello world World name Two productions empty gt A nonterminal with an empty production A nonterminal definition can also be split into several grammar definition blocks everytime the nonterminal appears on the left hand side all the defined productions are associated to it 4 4 2 1 The goal symbol A special case is the goal symbol This nonterminal symbol defines the root of the parse tree and must exist in every grammar When the goal symbol is finally reduced the parse tree is completed and input was successfully tested for correctness The goal symbol is defined by appending a dollar sign to the nonterminal to be the goal symbol on the left hand side like below with xample hello empty world Only one goal symbol is allowed per grammar Multiple goal symbol definitions throw an error and no parser will be constructed 80 4 4 2 Nonterminal symbols UniCC Parser Generator 4 4 2 2 Semantic nonterminal determination There may be cases where the distinct determination of a nonterminal requires more semanti
182. vity left n left ne lt S Y veut DH left 4 Pod DH left VEN ur d DH Regular expressions string rue PERNT K rw DH identifier A Za z A Za z0 9 H Lexemes lexeme real DH real gt real_integer real_fraction real_integer tr real fraction DH real_integer gt real_integer 0 9 9 9 DH real_fraction gt real fraction 0 9 0 9 Whitespace grammar construct whitespaces whitespace whitespace gt Nae Nn Vt EE ry LEN An 28 2 6 2 Drafting the Grammar Goal symbol programs x H statement gt l H expression gt H parameter list gt H variable function gt 2 6 2 Drafting the Grammar UniCC Parser Generator statement if expression statement if expression statement else statement while expression statement statement expression Y v variable expression expression expression expression expression expression gt expression expression expression expression expression expression expression expression expression expression expression expression expression expression expression expression expression real string variable function parameter_list precedence parameter_list expression expression identifier 29 UniCC Parser Generator 2 6 2 1 Ter
183. vtype A zero test value tos syntax tree prefix_tok prefix_syntree Top of stack pointer Receives the pointer of the root node of an automatically constructed syntax tree This member only exists if UNICC_SYNTAXTREE is defined Table 10 Member variables of 9 prefix pcb To avoid compile errors names differing from the above ones must be chosen to add members to the parser control block structure UniCC itself does not parse or check this because this is a template related problem 110 5 7 Q Q prefix pcb parser control block UniCC Parser Generator 5 8 Additional data structures There are also some additional data structures used in the UniCC Standard C Parser Template Knowledge on them is only required when interested in the syntax tree construction feature 5 8 1 prefix_vtype Value Type Structure prefix_vtype is the union that will hold a semantic value on the parse stack and within the syntax tree If there is only one data type used in the template prefix_vtype is only an alias for this data type else it is an union containing a member called value_ lt id gt for every value data type where lt id gt is an index counted from 0 for every type This union is automatically constructed by the code generator in UniCC 5 8 2 prefix_tok Stack Token Description Structure The parse stack and the syntax tree generator make use of the prefix_tok structure This structu
184. ware com Before UniCC can be built ensure that the Phorward Foundation Toolkit is installed in its latest version Getting the latest version is simple using the Mercurial SCM with hg clone http phorward hg sourceforge net 8000 hgroot phorward phorward Alternatively download a release package of the desired version from the Phorward Foundation Libraries product website After cloning or extracting change into the created directory and run configure make make install After that clone the following repositories They provide the UniCC Parser Generator and XPL a demonstration of a tiny programming language implementation written with UniCC These packages can also be obtained from the official UniCC website hg clone http unicc hg sourceforge net 8000 hgroot unicc unicc hg clone http unicc hg sourceforge net 8000 hgroot unicc xpl Optionally if hacking the UniCC Standard C Parser Template is wanted clone hg clone http unicc hg sourceforge net 8000 hgroot unicc Cparser also Change into the directory unicc and again run configure make make install If the UniCC bootstrapping toolchain is wanted to be enabled configure with configure with bootstrap this will bootstrap the UniCC grammar parser with multiple generation states 60 3 3 Building UniCC from source UniCC Parser Generator 3 4 Invocation and command line options UniCC primarily provides a command line interface to invoke the parser generatio
185. would be stored as 1234 internally but this isn t our goal nor a syntax that we allow for a valid date Nonterminal date defines the basic syntax of a date The dollar symbol behind date defines date to be the goal symbol In terms of LALR parsing the goal symbol is the last symbol identified by the parser to ensure that the parse is finished complete and valid This is due the bottom up approach of LALR parsers The parse tree is constructed from the leafs the terminal symbols up to all nonterminal symbols to finally match the goal symbol The nonterminal marked as goal is always the root from which all subsequent branches in the yielding parse tree will start Next to date nonterminal month defines all month names in its productions where each month name is represented by a so called string String terminals are enclosed by double quotation marks and are a kind of terminal symbol In comparison to characters and character classes strings require that the character sequence from which they are made up exactly match to the input characters coming up next in the input stream More on this topic will even be discussed later Finally we use the same nonterminal which is integer for day and year number The parse tree of the input string August 17 2008 using this grammar will be date date ua month integer d integer August integer 7 integer 8 ce 1 integer 0 re integer 0 2 Fig 3 Parse tree derived from
186. y handled and consumed by the lexical analyzer Every terminal symbol is concerned as a lexical unit standing on its own The grammar is not rewritten nor modified by UniCC and results in a faster parser with lesser states The disadvantage of this parser construction mode is that the whitespace sensitive aspect enabling the full control of any whitespace situation and speciality gets lost Anyway parser constructed in insensitive mode can be used for most parsing issues Trying to compile above grammar using mode insensitive will throw some errors because this grammar uses features that can only be handled in insensitive mode mode insensitive left Er RSA left Ik Wits whitespaces t n integer O 9 H start expr expr gt expr expr expr expr expr expr expr expr expr integer In insensitive mode it is only allowed to use one terminal symbol as whitespace This terminal symbol can be a character class a string or a regular expression definition Nonterminal symbols can t be configured to be used as whitespace The use of the lexeme directive becomes effectless so the former lexem nonterminal integer must be rewritten to a regular expression terminal integer here Parsing the same expression 8 2 yields in the following much smaller syntax tree The whitespace handling is not covered by the parser anymore and is run silently in the background within the lexical a
187. y thousand digits huge integer number using the above grammar It won t be a problem for the computer to parse this input but it results in a giant logical structure that can be visually mapped into such a parse tree And a thirty thousand digit integer can never be stored to a normal computer variable for further calculation To have a more abstract view on such a parse tree compiler writers are rather dealing with another kind of tree structure which is called the abstract syntax tree Abstract syntax trees are derived from the parse tree and represent only the logical structure of information from the parse tree by hiding or merging syntactical details which are not mandatory to keep up the parsed structure By using these virtually constructed trees compiler writers can perform several actions to be executed on the particular production e g generating output code or building up data structures to be used by subsequent compiler related actions this is the way how compilers are written along the parser Unconsciously compiler writers do excessively make use of abstract tree structures along the parse tree when writing compilers But this will be discussed in the next chapters We now only rely on the definition of the grammars itself This example was really very simple What about if we go back to the idea of the appointment assistant mentioned in the above chapter a simple grammar to detect dates in various formats First we want to p
188. ymbols within the production s rule will become leafs where the nonterminal symbols become nodes to subsequent previously reduced rules Every symbol within the parse tree can be augmented with user defined data For example a terminal symbol integer may hold the integer value of the analyzed integer number When a production is defined as integer integer and matches the current sequence handle a reduction is caused Right before this is done there are three symbols on the stack An integer number the operator and another integer number Because every symbol within the parse tree can be augmented with a value the semantic value behind this sequence can be calculated right when the parse tree is constructed This 82 4 4 3 Productions UniCC Parser Generator means The programmer is able to put individual code to every production which can pass values to the newly constructed node and to upperlying productions This reduction code is written as a code block behind the sequence that defines the production Within each code block UniCC provides a set of macros to access the left hand side the return value of the rule and the right hand side items respective every symbol of the sequence on the reduced right hand side Macro Usage ee Defines the semantic value to be associated with the left hand side It can be seen as the return value of the production lt offset gt Access the semantic values of right hand
189. ype prefix_syminfo and can be accessed via a static global variable prei Fix symbols in within the parser module Typedef for symbol information table typedef struct char int UNICC BOOLEAN UNICC BOOLEAN UNICC BOOLEAN prefix_syminfo name type lexem whitespace greedy Member Type Content Defines if the symbol regular expression terminals only should be scanned in greedy or greedy B non greedy mode 1 if true 0 else lexem int Defines a nonterminal as a lexem 1 if true O else name char The name of the symbol as defined in the UniCC parser definition file Gets the symbol type 0 for nonterminal 1 for character class terminal 2 for type int regular expression terminal this includes strings 3 for special terminal e g the error resync token or end of file symbol whitespacelint Defines a symbol that is whitespace 1 if true 0 else 112 Table 12 Member variables of prefix syminfo 5 8 3 prefix_syminfo symbol Information Table UniCC Parser Generator 5 8 4 prefix_prodinfo production information table All parsers constructred via the UniCC Standard C Parser Template also contain a production information table holding information about the productions used in the parser This production information table is used by the parser itself to get additional informations about its productions The production

Download Pdf Manuals

image

Related Search

Related Contents

Promar Latex Fibrado  Philips F5552/36/  Stratos Pro User Manual  Catalogue de formations    

Copyright © All rights reserved.
Failed to retrieve file