Home

The Phoenix Parser User Manual

1. ee ded ais 9 Data SUUCIURES lt hr anae a e a a E E E E C 12 The Grammar struct re arene a ei a a E a a a R 12 Set of Attive SIGS nenese i e r Ais AAE E A Aus E a EET TEA 12 The CARE sce sede natnn A e EERE aa a ar a E AEREA E a Sa AES NAA 14 Parse Data Structuren eenn e es ela E E A R E sg E nee ee A 15 Printing ATSC ce lenh na n a a e a e e sia e 15 Extracted Baris ashen a E A A A a E E E a O 16 Pre TerminalS E E A ed A es 17 UPI the Parser eenn a e ap ccs ech assis gy autos a A R E E a 18 Command Line parameters iesin e yeas EEE AEE EE R ES 18 Required parameters acco sc ciiasseccdscesecntgcnse snes as coca soneticaveas cones Gecta cas teaaaasnnecaeentesa odes eased 18 FISS S reae e cies ee bale E AEE reed a aR aan a a dire ie an Bi 18 Optional Sea sa ee de a taal gt a oe aha a ete ons ala ade eta tt Slee 19 Change Default Symbols a cccecce cerca s ceases ya 9ch ces cases Recs aeatas a eases ees os OE 19 Buffer Sizes pene nase rasere AA E tc cea A AAEE ERT A PEA eaeeacn aula dual ch Guster hee ae 19 Stand Alone Jnterfate siisii eeir N RER RE E ERATOR E E A 21 Tutorial Examples skenen a a a a a a E tama eure eeeinaas 22 Introduction The Phoenix parser is designed for development of simple robust Natural Language interfaces to applications especially spoken language applications Because spontaneous speech is often ill formed and because the recognizer will make recognition errors it is necessary that the parser be robust to errors in
2. gra if SingleFile 0 then cat gra gt xxx mv xxx TASK gra endif append lib grammars to file cd PHOENIX Grammars cat LIBS gt gt DIR TASK gra cd DIR create list of nets to be compiled cat TASK gra PHOENIX Scripts mk_nets perl gt nets compile grammar output messages to file log echo compiling grammar PHOENIX ParserLib compile_ grammar f TASK gt log grep WARN log flag leaf nodes for extracts function echo flagging leaf nodes PHOENIX ParserLib concept_leaf grammar TASK net Figure 5 compile script net file compiled grammar file These are ascii files representing Recursive Transition Networks and have the following format The first line gives the number of nets that were compiled Number of Nets 378 Then come the compiled nets For each network The first line gives net name net number number of nodes in the net concept leaf flag Then the nodes are listed each followed by the arcs out of it A node entry in the file has the following format node number number of arcs out of node final flag 0 not a final node 1 a final node for each net nodes are numbered sequentially with the start node being 0 The entries for arcs out of a node follow the entry for the node Arc entries have the following format word number net number to node An arc may be a word arc with word word_num and net_number 0 null arc indicated by word_num 0 and net_number 0 call arc indicated
3. in each parse is in the global variable n_s ots The number of alternative parses is in num_parses SeqNode parses typedef struct seq_node Edge edge Fid frame_id short n frames frame count for path SeqNode Printing Parses The routine print_parses is used to write parses to a text buffer The parameters are e the number of the parse to be printed staring at 0 e a pointer to the text buffer to be written to e the extracted form flag 0 parsed form 1 extracted form e a pointer to the grammar structute to use void print_parse int parse_num char out_str int extract Gram gram int frame SeqNode fptr char out_ptr fptr parses parse_num for j 0 j lt n_slots j fptr if fptr gt n_act frame frame fptr gt n_act if starting a new frame sprintf out_ptr os gram gt frame_name frame print frame name and colon if extract out_ptr print_extracts fptr gt edge gram out_ptr 0 gram gt frame_name frame else out_ptr print_edge fptr gt edge gram out ptr print slot edge As and example of calling print_parse the parse_text routine uses the following for loop to print out the parse buffer print all parses to parse buffer for i 0 i lt num parses i sprintf out_ptr PARSE d n i out_ptr strlen out_ptr print_parse i out_ptr extract gram out_ptr strlen out_ptr sprintf out_ptr END _PARSE n
4. nets in the matching process Each time a net match is attempted all nets not just slots this is noted in the chart All matched networks are added to the chart as they are found Any time a net match is attempted the chart is first checked to see if the match has been attempted before When a slot match is found it is added to the slot graph Each sequence of slots in the slot graph is a path The score for the path is the number of words accounted for by the sequence Words are not skipped in matching a slot but words can be skipped between the matched slots The graph growing process prunes poor scoring paths just as the acoustic search does The pruning criteria are first number of words accounted for and second the degree of fragmentation of the sequence If two paths cover the same portion of the input and one accounts for more words than the other the less complete is pruned If the two paths account for the same number of words and one uses fewer slots than the other the one with more slots is pruned The resulting graph represents all of the sequences found that have a score equal to the best The sequences of slots represented by the graph are then grouped into frames This is done simply by assigning frame labels to the slots Again in this grouping less fragmented parses are preferred This means that if two parses each have five slots and one uses two frames and the other uses three then the parse using two frames is preferred T
5. recognition grammar and fluency This parser is designed to enable robust parsing of these types of input Phoenix parses each input utterance into a sequence of one or more semantic frames The developer must define a set of frames and provide grammar rules that specify the word strings that can fill each slot in a frame Release Configuration The Phoenix distribution contains the directories e ParserLib The parse library functions which are compiled into the archive library libparse a e Grammars A set of task independent semantic grammars These grammars can be compiled in with the task specific grammars for an application This is normally done with the compile script see Compiling Grammars section e Scripts Scripts to pack grammars into a single file and to generate a list of net names contained in grammar files These scripts are called by the compile script e Example An example frames file and grammars See the Tutorial Example section e Server Code for Galaxy HUB server interface This code will not compile without the GALAXY architecture installation bet serves as a guide to hoe to implement a GALAXY parse server e Doc Contains this file Theory of Operation The Phoenix parser maps input word strings onto a sequence of semantic frames A Phoenix frame is a named set of slots where the slots represent related pieces of information Figure 1 shows an example frame for queries for flight informat
6. slot name will be the root of the corresponding semantic parse tree The syntax for a frame definition is comment FRAME lt frame name gt NETS slot names lt slot name gt lt slot name gt lt slot name gt For example a frame for a Hotel query might be reserve hotel room FRAME Hotel NETS hotel_request hotel_name hotel_ period hotel_ location Room_Type Arrive Date want A frame definition is terminated by a at the start of the line DO NOT FORGET TO TERMINATE THE FRAME WITH A SEMICOLON OR IT WON T BE RECOGNIZED The Grammar file Names of grammar files end with a gra extension These files contain the grammar rules The grammars are context free rules that specify the word patterns corresponding to the token network name The syntax for a grammar for a token is optional comment token_name lt pattern a gt lt pattern b gt lt Macrol gt lt rewrite rule for Macro1 gt lt Macro2 gt lt rewrite rule for Macro2 gt The token name is enclosed in square brackets After this follow a set of re write patterns one per line each enclosed in parentheses with leading white space on the line After the basic re write patterns follow the non terminal rewrites which have the same format Notation used in pattern specification e Lower case strings are terminals e Upper case strings are macros e Names enclosed in are non terminals calls to
7. string gt Set the start of utterance symbol to string Default is lt s gt end_sym lt ascii string gt Set the end of utterance symbol to string Default is lt s gt Buffer Sizes The parser mallocs space for buffers Each of these has a default size If during execution a buffer runs out of space the parser will print an error message telling which buffer is out of space and what the current setting is The command line parameter can be used to increase or decrease the buffer size These all have default values The PROFILE parameter can be used to report how much space is actually used during parses if desired to optimize the buffer sizes EdgeBufSize 1000 buffer for edge structures ChartBufSize 40000 chart structures PeBufSize 2000 number of Val slots for trees InputBufSize 1000 max words in line of input StringBufSize 50000 max words in line of input SlotSeqLen 200 max number of slots in a sequence FrameBufSize 500 buffer for frame nodes SymBufSize 50000 buffer size to hold char strings ParseBufSize 200 buffer for parses SeqBufSize 500 buffer for sequence nodes PriBufSize 2000 buffer for sequence nodes FidBufSize 1000 buffer for frame ids Stand Alone Interface A stand alone interface parse_text is provided parse_text accepts lines of input from stdin and prints output to stdout Type quit to exit A script for running parse_tex
8. The Phoenix Parser User Manual Introduction eeni cya uate A a E roca ned E A EAE E a ES 2 Release Configurations norie neisa ae a aa E A AAE E EEE 2 e ParserLib The parse library functions which are compiled into the archive library HID ANSE sai coisa n i aE E A E AA E E E E alg ua a a ER 2 e Grammars A set of task independent semantic grammars These grammars can be compiled in with the task specific grammars for an application This is normally done with the compile script see Compiling Grammars SCCtION cccccceesceesseeeeceteceeeeeeseecueceseeeeeeenseeenaeenes 2 e Scripts Scripts to pack grammars into a single file and to generate a list of net names contained in grammar files These scripts are called by the compile SCript ccccscceeteeeteeeee 2 e Example An example frames file and grammars See the Tutorial Example section 2 e Server Code for Galaxy HUB server interface This code will not compile without the GALAXY architecture installation bet serves as a guide to hoe to implement a GALAXY PATS SSF E T E E E ale E E E S 2 DOGS Contains this fil ssa seusia an e E E E E a E iia 2 Ehe ry of Operation iseer ai e n E A E r a aed AA 3 Writing Semantic Grammars and Frames seseeseseseesseseeesressesressressesstssetestestsstesesresstesresersseesseseess 7 PAUA i EEE AER T EE N AN 7 The Grammar file serei ennan n enana a a a E Aa AR Aa AEGA 8 Compiling GraMmM t Sa ienien aa R a a eee
9. able LIBS is set to the set of grammars to be loaded from the common library The following is the script for compiling the grammar for the tutorial example It uses two grammars in the local directory Places gra and Schedule gra and it loads grammars date_time gra number gra and next gra from the common library The individual grammars are concatenated into the file EX gra which is then compiled into the file EX net The compile script creates the file nets before compiling the grammar This file lists the names of nets in the grammar It is read by compile_grammar and used to assign numbers to nets compile_grammar also creates the file base dic in the local directory This is the dictionary a text file containing the words used by the grammar with a number assigned to each Each line in the file contains one word and the number assigned to it The word numbers are not sequential they are assigned by hash code The word numbers are used by the parser compile grammars in current directory set DIR pwd put compiled nets in file EX net set TASK EX set PHOENIX to point to root of Phoenix system set PHOENIX Phoenix set LIBS to gra files to load from PHOENIX Grammars set LIBS date_time gra number gra next gra if SingleFile 1 then all grammar rules are in file TASK gra if SingleFile 0 then compiles all files in dir with extension gra SingleFile 0 if separate files pack into single file STASK
10. also has dynamic control of the set of slots to use when invoking each parse The global variables cur_nets int cur_nets and num_active int num_active are used to implement this capability cur_nets points to the set of slots the net numbers to be used in the parse and num_active specifies the number of elements in the array The main match loop of the parse looks like for word_position 1 word_position lt script_len word_position for slot_number 0 slot_number lt num_active slot_number match_net cur_nets slot_number word_position gram If cur_nets is null the parser will create the set of nets used by frames in the grammar point cur_nets at it and set num_active accordingly The system will therefore initialize to this the first time parse is called In order to use a subset of all slots in the parse generate an array of net numbers to used point cur_nets at the array and set num_active to the number elements in the array In order to reset back to the set of all slots in a grammar just set cur_nets or num_active to null The Chart The chart is a data structure used to record all net matches as they are found This is done so that the match doesn t have to be repeated if it is called from some other point in the grammar In some cases this saves considerable computation The chart is a triangular matrix that is indexed by the start word and end word of the matched sequence of words Figure 5 Flights f
11. by net_num gt 0 word_num is ignored in this case Airline 27 5 0 050 00 2 20294 0 3 372 372 4 344 344 2 30 30 2 101 10 TP o A 330 18753 0 1 18762 0 1 16707 0 1 430 18753 0 1 18762 0 1 16707 0 1 Data Structures The Grammar structure Phoenix is capable of using multiple grammars The function read_ grammar reads a compiled grammar from a net file and returns a pointer to an associated gram structure struct gram read_grammar dir dict_file grammar_file frames_file typedef struct gram FrameDef frame_def pointers to frame definition structures char frame_name pointers to frame names int num_frames number of frames read in char labels pointers to names of nets Gnode Nets pointers to heads of compiled nets int num_nets number of nets read in char wrds pointers to strings for words int num_words number of words in lexicon int node_counts number of nodes in each net char leaf concept leaf flags char sym_buf strings for words and names Gram As many separate grammars as desired can be read in and the pointers to the corresponding gram structures stored When the parse function is called it is passed the word string to parse and a pointer to the gram structure to use for the parse parse word_string gram Set of Active Slots In addition to having control of which grammar to use in the parse the developer
12. cd to the Example directory The file run_parse is a script to run parse_text with the command line parameters specified in config The file input contains some sample sentences that the example grammar should parse To process the sentences type run_parse lt input Edit the config file and change the extract parameter to be 1 and run the parser on the input again
13. es and buffer sizes This can be used to optimize the sizes of the internal structures allocated Default is 0 IGNORE_OOV 0 1 0 causes out of vocabulary words to prevent a rule match 1 causes out of vocabulary words to be ignored and not prevent matches Default is 1 ALL_PARSES 0 1 0 causes the system to print out only one parse 1 prints out all parses in the final parse forest 1 is the default MAX PARSES lt number gt Print out at most lt number gt parses Default is 10 USE_HISTORY 0 1 0 Do not inherit context from previous sentence 1 Inherit final context from previous sentence for fragmentation score 0 is the default BIGRAM_ PRUNE 0 1 1 causes more aggressive pruning in the parse search The dynamic programming search for the best parse only keeps one path for each end_word frame_id slot triplet This amounts to bigram level pruning with a Viterbi search Default is 0 MAX PATHS 0 1 Set to 0 this parameter has no effect If gt 0 it specifies the maximum number of paths to be expanded during a match This is for debugging in case a runaway rule is causing large amounts of output Default is 0 Optional Files function_wrd_file lt filename gt File containing stop word list These are function words that should not be counted when scoring the goodness of a parse config_ file lt filename gt Read command line parameters from the specified file Change Default Symbols start_sym lt ascii
14. he result is a graph of slots each labeled with one or more frame labels Each path through the graph all scoring equally is a parse This mechanism naturally produces partial or fragmented parses The dynamic programming search produces the most complete least fragmented parse possible given the grammar and the input The parser does not require sentence boundaries and is able to process very long input with lo newline It segments the input at natural breakpoints and performs garbage collection on its buffers This allows it to parse entire reports as single utterances if necessary The processing speed is generally linear with the length of the input How robust or constrained a system is depends on how the frames and grammars are structured Using one frame with one slot will produce a standard CFG parser This is efficient but not very robust to unexpected input At the other extreme making each content word a separate slot will produce a key word parser Somewhere in between usually gives the best combination of robustness and precision Writing Semantic Grammars and Frames The parser requires two input files a frames file and a grammar file The frames file This file specifies the frames to be used by the parser A frame represents some basic type of action or object for the application Slots in a frame represent information that is relevant to the action or object Each slot name should have an associated set of grammar rules The
15. ion Each slot has an associated Context Free Grammar that specifies word string patterns that match the slot The grammars are compiled into Recursive Transition Networks RTNs An example parsed frame is shown in Figure 2 When filled in by the parser each slot contains a semantic parse tree for the string of words it spans The root of the parse tree is the slot name Frame Air Nets Network for Origin Origin Destination Depart Date Arrive Date Depart Time Arrive Time Airline from city Origin leaving gt ae N departing airport Figure 1 Example Frame Input Show fares of flights from Denver to Boston on United Parse Frame Air Field show _fares fares of flights Origin from City Denver Destination to City Boston airline on AirlineName United e a Extract Frame Air Field fares Origin City Denver Destination City Boston AirlineName United Figure 2 Example Parsed Output The parsing process is shown in Figure 3 In a search algorithm very similar to the acoustic match producing a word graph grammars for slots are matched against a word string to produce a slot graph The set of active frames defines a set of active slots Each slot points to the root of an associated Recursive Transition Network These networks are matched against the input word sequence by a top down Recursive Transiti
16. on Network chart parsing algorithm Flights from Boston arriving in Denver before five pm 8 9 2 3 4 5 6 7 1 Set of Frames Input FlightInfo to city pe F cod Se o 2 7 a arriving airport Transport loc trans_type Chart of matched nets 6 number 8 8 arr_time 7 9 number 8 8 dest 4 6 arr_time 7 9 field 1 1 origin 2 3 dest 4 6 dep_time 7 9 field 1 1 origin 2 3 dest 4 6 arr_time 7 9 dep_time 7 9 SS Slot Graph i route a arrive X field origin 7 lt s gt Figure 3 The Parsing Process The parser proceeds left to right attempting to match each slot network starting with each word of the input as in for each word of input for each active slot match_net slot word Flig hts ge Po arriving in Denver Networks es origin ae ns ke see wo arriving from o city A s gt leaving destination lt lt a a eee no Figure 4 The Match Process The function match_net is a recursive function that matches an RTN against a word string beginning at the specified word position The function produces all matches for the network starting at the word position and may have several different endpoints The networks are not designed to parse full sentences just sequences of words The Recursive Transition Networks for the slots call other
17. other token rules e Regular Expressions item indicates 0 or 1 repetitions of the item indicates 1 or more repetitions indicates 0 or more repetitions equivalent to a Kleene star e include lt filename gt reads file at that point A macro has rewrite rules specified later in the same grammar rule These cause a text substitution of all of the expansions for the macro but don t cause a non terminal token to appear in the parse Macros allow for a simpler expression of the grammar without inserting unwanted tokens in the parse Example grammar rules for the tokens hotel_request and want are hotel_request want a HOTEL HOTEL hotel motel room accommodations place to stay be want I WANT I 1 we WANT want need would like The parse for J would like a hotel room would be hotel_request want i would like a hotel room Compiling Grammars The grammar is compiled by invoking the script compile in the Grammar directory which in turn calls the routine compile_grammar A sample compile script is shown in Figure 5 The grammar can be in a single file or spread across multiple files If the grammar is in multiple files as indicated by the variable SingleFile set to 0 the compile script concatenates them to a single file before passing it to compile_grammar It also appends any requested grammars from Phoenix Grammars to the file before invoking compile_grammar The vari
18. out_ptr strlen out_ptr Extracted Form The parser provides a mechanism for printing output more directly like that which is required by a dialogue manager It will print only the tokens and strings that are to be extracted The format for extracted output is lt frame_name gt lt Node Name gt lt Node_ Name gt lt Node_Name gt lt value gt For example Where is American Beauty playing Would generate parsed and extracted forms movie _info movie_ info Display Movie theatre name where is is Movie AMERICAN BEAUTY showing playing movie_info Display_Movie theatre_name movie_info Movie AMERICAN BEAUTY In order to use this mechanism the developer must observe a convention when writing grammars Nets that begin with an uppercase letter will appear in the extracted output token tree and the leaves of those nets will be the extracted values Nets that begin with a lower case letter will not appear in the extracted output Word strings that have only lower case nets in their parse will not appear in the output Pre Terminals A second feature of extracted output are pre terminals These are nets whose name begins with an underscore _ character For these nets the name of the net rather than the word string is used as the value in the extracted output See _theatre_name in the example This is a convenient way to put values in canonical form For example there could be many way
19. rom Boston arriving in Denver 1 2 3 4 5 6 origin from Origin City Boston net sw ew score Edge origin 2 3 2 Ne Edge Origin 3 3 1 a Edge City 3 3 1 Start word Figure 5 The Chart Since it would be a sparse matrix the chart is not actually implemented as a matrix but as a linked list with the start word as the major key and the end word as the minor key EdgeLink chart structure for linking edges into chart typedef struct edge link int SW struct edge_link link Edge edge typedef struct edge short netid sw ew score net number start word end word score char nchld number of child edges struct edge chld edge called by this edge link link to next edge in list Edge Parse Data Structure When the final parse structures are created they are put into a buffer pointed to by the variable parses A parse is a sequence of SeqNodes which represent edges each with a frame id The SeqNode structures contain a frame id and a pointer to an edge in the chart A sequence of edges with the same frame id are all in the same instance of a frame The nodes are written to the buffer in order with the first slot of a new parse appended after the last slot of the previous parse All parses for an utterance will have the same number of slots filled since any more fragmented ones would have been pruned The number of slots
20. s of saying the word yes but you might need to pass the exact string yes to the database If you define a net yes that specifies all the ways of saying yes then the string yes rather than the original word string ex certainly will appear in the extracted output This also works for mapping strings like Philly to Philadelphia etc As an example the following rules Answer _yes _no _yes yes fine sure that s good sounds good to me _no no way i don t think so would cause sounds good to me to be parsed as Answer yes sounds good to me and extracted as Answer yes and I don t think so to be parsed as Answer no I don t think so and extracted as Answer no Running the Parser Command Line parameters Required parameters Flags dir lt grammar directory gt Specifies the directory which contains all grammar files grammar lt filename gt Name of the net file containing the grammar to be used verbose lt number gt Number can have values from 0 6 As number increases progressively more trace information is printed out 0 prints out nothing 1 prints out just parses this is the default 2 prints out the input string and then the parses 3 6 print out debugging information extract 0 1 0 is default and prints full bracketed parses 1 prints extracted forms PROFILE 0 1 1 Turns on profiling This gives information on parse tim
21. t run_parse is in the Example directory It looks like SPHOENIX ParserLib parse_text config config parse_config where parse_config is verbose 1 extract 1 dir Grammar grammar EX net The parse_text routine illustrates the basic setup needed to use the parse functions int main arge argv set command line or config file parms config arge argv read grammar initialize parser malloc space etc init_parse dir dict_file grammar file frames_file priority_file for each utterance while fgets line LINE LEN 1 fp assign word strings to slots in frames parse line gram print parses to buffer for i 0 i lt num parses i print_parse i out_ptr extract gram clear parser temps reset num_nets Tutorial Example The Example directory contains a tutorial example First cd to the Example Grammar directory and type compile to compile the grammar All files in this directory including the compiled EX net file are acsii There are two source grammar files gra in the directory Places gra and Schedule gra The compile script creates the EX gra file after first saving the old one to EX gra old from these two files and the ones specified from the SPHOENIX Grammars directory It then compiles the grammar into the EX net file The script also creates the nets and base dic files The files airports cities and states are included by the Places gra grammar Then

The Phoenix Parser User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents