Home

Unitex User Manual

1. Figure 5 17 Antialiasing example This effect slows Unitex down We recommend not to use it if your machine is not powerful enough 5 3 4 Box alignment In order to get nice looking graphs it is useful to align the boxes both horizontally and vertically To do this select the boxes to align and click on Alignment in the Format 5 3 DISPLAY OPTIONS 67 sub menu of the FSGraph menu or press lt Ctrl M gt You will then see the window in figure 5 18 The possibilities for horizontal alignment are e Top The boxes are aligned with the top most box e Center The boxes are centered with the same axis e Bottom The boxes are aligned with the bottom most box Alignment Horizontal Center _ Center Vertical Figure 5 18 Alignment window The possibilities for vertical alignment are e Left The boxes are aligned with the left most box e Center The boxes are centered with the same axis e Right The boxes are aligned with the right most box Figure 5 19 shows an example of alignment The group of boxes to the right is a copy of the ones to the left that was aligned vertically to the left The option Use Grid in the alignment window shows a grid as the background of the graph This allows to approximately align the boxes 5 3 5 Display Options and Colors You can configure the display style of a graph by pressing lt Ctrl R gt or by clicking on Pre sentation in the Format
2. TO order in which the occurrences appear in the text LC left context occurrence LR left context right context 9 4 CONVERT 131 CL occurrence left context CR occurrence right context RL right context left context RC left context occurrence NULL does not specify any sorting mode This option should be used if the text is to be modified instead of constructing a concordance For details on the sorting modes see section 4 8 2 e mode indicates in which format the concordance is to be produced The four possible formats are html produces a concordance in HTML format encoded in UTF 8 text produces a concordance in Unicode text format glossanet produces a concordance for GlossaNet in HTML format The HTML file is encoded in UTF 8 name_of_file indicates to the program that it is supposed to produce a mod ified version of the text and save it in a file named name_of_file see sec tion 6 6 3 e alph alphabet file used for sorting The value NULL indicates the absence of an alphabet file e thai this parameter is optional It indicates to the program that it is processing a Thai text This option is necessary to ensure the proper functioning of the program in Thai The result of the application of this program is a file called concord txt if the concor dance was constructed in text mode a file called concord html if the mode was html or
3. Parameters in the Conf ig file Parameters in the grf file BACKGROUND COLOR BCOLOR FOREGROUND COLOR FCOLOR AUXILIARY NODES COLOR ACOLOR C COMMENT NODES COLOR SCOLOR SELECTED NODES COLOR CCOLOR Table 10 3 Meaning of the parameters The parameter ANTIALIASING indicates whether the graphs as well as the sentence automata are displayed by default with the antialiasing effect The parameter HTML VIEWER indicates the name of the browser to use for displaying the concordances If no browser name is specified the concordances are displayed in a Unitex window 10 9 2 The file system_dic def The file system_dic def is a text file that describes the list of system dictionaries that are applied by default This file can be found in the directory of the current language Each line corresponds to a name of a bin file The system dictionaries are in the system directory and 10 10 VARIOUS OTHER FILES 161 in that directory in the sub directory current language Dela Here an example of the file delacf binY delaf binY 10 9 3 The file user dic def The file user_dic def is a text file that describes the list of dictionaries the user has de fined to apply by default This file is in the directory of the current language and has the same format as the file system_dic def The dictionaries need to be in the sub directory current language Dela of the personal directory of the user 10 9 4 The file user
4. 8 2 CONVERSION OF A TABLE INTO GRAPHS 125 interpreted like lines of the table 8 2 3 The template graphs The template graphs are the graphs in which the variables appear that refer to the columns of a table of the lexicon grammar This mechanism is usually used with syntactical graphs but nothing prevents the construction of template graph for inflection preprocessing or for normalization The variables that refer to columns are formed with the symbol followed by the name of the column in capital letters the columns are named starting with A Example C refers to the third column of the table Whenever a variable needs to be replaced by a or the sign corresponds to the removal of a path through that variable It is possible to carry out the inverse operation by putting an exclamation mark in front of the symbol In that case whenever the variable refers to the sign the path is removed If the variable refers neither to the sign nor the sign it is replaced by the contents of the cell There is also the special variable which is replaced by the number of the line in the table The fact that its value is different for each line allows for its use as simple charac terization of the line That variable is not affected by an exclamation point to the left of it Figure 8 2 shows an example of a template graph designed to be applied to the lexicon grammar table 31H presented in figure 8 3 as ecw EHe
5. amp nbsp THAT PL amp nbsp lt br gt STRICT of lt a href 61 66 2 gt merry lt a gt snbsp Engl nbsp lt br gt Y4 S IN THAT lt a href 40 48 2 gt PLEASANT lt a gt snbsp Dinbsp lt br gt Y4 amp nbsp which is lt a href 84 91 2 gt watered lt a gt amp nbsp by amp nbsp lt br gt lt font gt Y lt td gt lt table gt lt body gt Y lt htm1 gt 4 Figure 10 2 shows the page that corresponds to the file below Concordance file ElMyU o amp Z MA TRE L AUTRE COMM TRE COMME DOMESTIQUE tait habit e pa UN COMME MA TRE L u de les membres le la maison portant le Figure 10 2 Example of a concordance 10 7 Dictionaries The compression of the DELAF dictionaries done by the program Compress produces two files a bin file that represents the minimal automaton of the inflected forms of the dictio naries and a inf file that contains the compressed forms allowing the dictionaries to be reconstructed from the inflected forms This section describes the format of these two file types as well as the format of the file CHECK_DIC TXT which contains the result of the verification of a dictionary 10 7 DICTIONARIES 155 10 7 1 The bin files A bin file is a binary file that represents an automaton The first 4 bytes of the file represent an number that indicates the size of the file in bytes The states of the automaton are encoded in the following way e th
6. kobe he ea ba Transducers LOOD 222 x ce eu ore Due Dh Le Antialiasing su dass does Display Options and Colors Graphs outside of Unitex Inserting a graph into a document 5 42 Printing a Graph CONTENTS CONTENTS 5 6 Advanced use of graphs 73 61 Typesot graphs lt s sos serein en desert 73 GLI IASON SIPS sise ea Gah d aaaea ARAS 73 6 1 2 Preprocessing graphs lt sso 66 604 eo bia sae e ren es 74 6 13 Graphs for normalizing the text automaton 75 OLA SYQMCUC graphs sisas ERED bed eH 76 6 1 5 ELAG Grammars 76 Clie Tpit res gt eee a PS Gina Oe EES 76 62 Compilation ofa grammar s ss Ge sams eo ee eS 76 GAl Complabonota 2raph es ee oe EN las es reves 76 6 2 2 Approximation with a finite state transducer 77 6 29 Constraints On CramiMats 6 0454 424 sam sa eds eee ds 78 6 24 Errordetection c ie ue de de de ee we GE os ele ee a 81 63 Exploring gramar paths 32 Lu sai ha Lada WADE 81 GA Graph Collechons gt Lis nu be AMER DER SRE GAS RE TES 83 65 Rules for applying transducers oec oecon a ane e e a eS 85 6 5 1 Insertion to the left of the matched pattern 85 6 5 2 Application while advancing through the text 85 633 Priority of the leftmost match ue dod os eR ewe we ES 86 654 Priority ofthe loneestiatch s ec kiasa acea gopa a ee ee RO 86
7. 6 5 5 Transductions with variables 86 6 6 Applying graphs AIO TES o ose eed han te ee Re on 90 661 Configuration of thesearch 2 ee dy bi da dd ES 90 6 6 2 Concordance 91 6 6 3 Modification of the text 93 7 Text automata 95 7a Displhiying EXCESO da A a eee e eS 95 Te COMSIPUCUGR sx odias de She OY BSE DRA dus eR OE ee aS 96 7 2 1 Construction Rules For Text Automata 96 7 2 2 Normalization of ambiguous forms 97 7 2 3 Normalization of clitical pronouns in Portuguese 98 72A Keeping TIE DESC PANS s sac RARA 100 7 3 Resolving Lexical Ambiguities with ELAG 103 7 3 1 Grammars For Resolving Ambiguities 44444444 103 ioe Compiling ELAG Grammars lt oes as gee course ER 106 Zio Resolvme AIMONS LL ere neue sms Oe ES 106 7 3 4 Grammar collections 107 Fe Window Por ELAG Processing errar POE SSH pe 108 foe Deserption OF The TORSO a au don die RR O NA 109 fo rammar OPOMIZADON ies dara subite a aoa ES 117 74 Manipulation of text automata 120 741 Displaying sentence automata lt o 120 74 2 Modify the textautomaton s so aunt 4 rss nome Re S 120 TAS PSramelerS of PIERO Lu dose ad dust ore EHS 121 6 8 10 CONTENTS Lexicon Grammar 123 51 LAS ICON grammar ABIES Li l
8. Each entity i reflects the token with index i in the file tokens txt These entities are encoded in four bytes NOTE The tokens are numbered starting at 0 10 44 The file tokens txt The file tokens t xt is a text file that contains the list of all lexical units of the text The first line of this file indicates the number of units found in the file The units are separated by a newline Whenever a sequence is found in the text with capitalization variants each variant is encoded as a distinct unit NOTE the newlines that might be in the file snt are encoded like spaces Therefore there is never a unit encoding the newline 10 45 The files tok_by_alph txt and tok_by_freq txt These two files are text files that contain the list of lexical units sorted alphabetically or by frequency In the tok_by_alph txt file each line is composed of a unit followed by a tab and the number of occurrences of the unit within the text The lines of the tok_by_freq txt file are formed after the same principle but the number of occurrences occurs after the tab and the unit 10 5 TEXT AUTOMATON 151 10 4 6 The file enter pos This file is a binary file containing the list of positions of the newline symbol in the file snt Each position is the index in the file text cod where a newline has been replaced by a space These positions are entities that are encoded in 4 bytes 10 5 Text Automaton 10 5 1 The file text fst2 The file text fst2 is
9. dictionaries and speeds up the lookup This operation is done by the Compress program This program takes a dictionary in text form as input for example my_dico dic and produces two files e my_dico bin contains the minimal automaton of the inflectioned forms of the dic tionaries e my_dico inf contains the codes that allow to reconstruct the original dictionary from the inflected forms in the my_dico bin file 40 CHAPTER 3 DICTIONARIES The minimal automaton in the my_dico bin file is a representation of inflected forms in which all common prefixes and suffixes are factorized For example the minimal automaton of the words me te se ma ta et sa can be represented by the graph in figure 3 8 Figure 3 8 Representation of an example of a minimal automaton To compress a dictionary open it and click on Compress into FST in the DELA menu The compression is independent of the language and of the content of the dictionary The messages produced by the program are displayed in a window that is not closed automati cally You can see the size of the resulting bin file the number of lines read and the number of inflectional codes created Figure 3 9 shows the result of the compression of a dictionary of simple words The resulting files are compressed to about 95 for dictionaries containing simple words and 50 for those with compound words 3 6 Applying dictionaries Dictionaries can be applied after pre processing or by
10. e gt gt doesn t begin with 1 unless the second letter is an e in other words any word except the ones starting with le By default a morphological filter alone is regarded as applying it to the pattern lt TOKEN gt that means any lexical pattern On the other hand when a filter follows a lexical pattern im mediately it applies to what was recognized by the lexical pattern Here are some examples of such combinations e lt V K gt lt lt i gt gt Past participle ending with i e lt CDIC gt lt lt gt gt A compound word containing a dash e lt CDIC gt lt lt gt gt a compound word containing two spaces e lt A fs gt lt lt pro gt gt a feminin singular adjective beginning with pro e lt DET gt lt lt u u n gt gt a French determiner different from un e lt DIC gt lt lt es gt gt a word which is not in the dictionary and which ends with es e lt V S T gt lt lt uiss gt gt a verb in the past or present subjunctive and containing uiss NOTE By default morphological filters are subject to the same variations of case as the lexical patterns Thus the pattern lt lt gt gt will recognize all the words starting with but also those which start with e or E To force the matcher to respect the exact case of the pattern it is necessary to add _f_ immediately after it Example lt A gt lt lt gt gt _f_ 4 8 Search 4 8 1 Configuration of the search In order
11. line can contain one or more compressed forms If there are multiple forms they are sep arated by commas Each compressed form is made up of a sequence that allows to find a canonical form again starting from an inflected form followed by a sequence of grammati cal semantic and inflectional codes that are associated with the entry 156 CHAPTER 10 FILE FORMATS The mode of the compression of the canonical form varies with the function of the in flectd form If the two forms are identical the compressed form summarizes the grammati cal semantic and inflectionary information like this N Hum ms If the forms are different the compression program cuts up the two forms in units These units can be a space a hyphen or a sequence of characters that contain neither a space nor a hyphen This way of cutting up units allows to efficiently take into account the inflections of the composed words If the inflected and the canonical form do not have the same number of units the the pro gram encodes the canonical form by the number of characters to remove from the inflected form followed by the characters to append Thus the first line of the file below corresponds to the line in the dictionary James Bond 007 N Since the sequence James Bond contains three units and 007 only one the canonical form is encoded with _10 0 0 7 The _ character indicates that the two forms do not have the same number of units The following number here 10 ind
12. s 100 Grammar collections 104 Grammars Collections 104 constraints 79 context free 55 Extended Algebraic 56 for phrase boundary recognitions 74 Formalism 55 inflectional 38 local 76 normalisation of non ambigue forms 23 of non ambiguous forms 74 of the text automaton 75 phrase detection 21 167 Granularity of dictionaries 96 Graph antialiasing 66 approximation through a final state trans ducer 136 approximation with a finite state trans ducer 77 Box Alignment 66 Calling a Sub Graph 59 comments in 58 compilation 76 139 connecting boxes 58 Creating a Box 57 Deleting Boxes 60 detection of errors 80 display 65 Display Options and Colors 67 error detection 139 format 147 inflection 73 main 142 model 76 Printing 71 syntactic 76 types of 73 Variables in a 61 zoom 65 Graphe antialiasing 69 including into a document 71 sauvegarde 59 Graphs Intex 56 Grid 67 Import of Intex Graphs 56 Including a graph into a document 71 Infinite loops 79 Inflectional Codes 114 Inflectional constraints 45 Information grammatical 30 inflectional 30 semantic 30 Installation 168 on Linux and Mac OS X 12 on Windows 12 Integrated text editor 17 Java Runtime Environment 11 Java virtual machine 11 JRE 11 Kleene see Kleene star Kleene star 43 Kleene star 48 LADL 9 29 Largest Size of text files 19 Lev e d ambiguit s lexicales 100 lexical entries 29 Le
13. A Anl1 33 An1Col1 33 Asc2Uni 129 144 C 34 CONJC 33 CONJS 33 CheckDic 34 131 159 Compress 31 39 131 156 Conc 33 ConcColl 33 Concord 131 Convert 133 DET 33 Dico 26 42 134 Elag 135 ElagComp 135 Evamb 135 ExploseFst2 136 Extract 136 F 34 Flatten 77 136 Fst2Grf 120 137 Fst2List 137 Fst2Txt 23 138 G 34 Grf2Fst2 76 139 Hum 33 HumCo11 33 I 34 INTJ 33 Inflect 39 139 J 34 K 34 L 73 Locate 139 MergeTextAutomaton 140 N 33 163 164 Normalization 129 Normalize 140 P 34 PREP 33 PRO 33 PolyLex 28 140 R 73 Reconstrucao 99 141 Reg2Grf 141 S 34 SortTxt 37 141 147 T 34 Table2Grf 142 TextAutomaton2Mft 142 Tokenize 24 142 Txt2Fst2 143 Uni2Asc 56 143 V 33 W 34 Y 34 30 43 Ni 30 30 31 _ 62 en 33 f 34 1 33 m 34 n 34 ne 33 p 34 s 34 se 33 t 33 z 1 38 Z 2 38 z3 33 S 21 47 140 142 152 163 Adding languages 12 Algebraic Languages 56 All matches 50 90 139 Alphabet 22 132 138 139 142 143 146 of a sort 37 INDEX sorted 147 Analysis of free composite words in Norveg ian 28 Analysis of free compounds in German 140 Analysis of free compounds in Norwegian 140 Antialiasing 66 69 161 Approximation of a grammar through a final state transducer 136 Approximation of a grammar with a finite state transducer 77 ASCII 129
14. FFFE in stead of FEFF and 000D and 000A are 0D00 and 0A00 respectively 143 144 CHAPTER 10 FILE FORMATS 10 2 Alphabet files There are two kinds of alphabet files a file which defines the characters of a language and a file that indicates the sorting preferences The first is called alphabet the second sorted alphabet 10 2 1 Alphabet The alphabet file is a text file that describes all characters of a language as well as the corre spondences between capitalized and non capitalized letters This file is called Alphabet txt and is found in the root of the directory of a language Its presence is obligatory for Unitex to function Example the English alphabet file has to be in the directory English Each line of the alphabet file must have one of the following three forms followed by a newline symbol e HITS a hash symbol followed by two characters X and Y which indicate that all characters between X and Y are letters All these characters are considered to be in non capitalized and capitalized form at the same time This method is used to define the alphabets of Asian languages like Korean Chinese or Japanese where there is no destinction between upper and lower case and where the number of characters makes a complete enumeration very tedious e ES two characters X and Y indicate that X and Y are letters and that X is equivalent in its capitalized and non capitalized form e N a unique character X defines X as a let
15. NO V vers N Figure 8 2 Example of a template graph 8 2 4 Automatic generation of graphs In order to be able to generate graphs from a template graph and a table first of all the table needs to be opened by clicking on Open in the menu Lexicon Grammar see figure 8 4 The table needs to be in Unicode text format 126 CHAPTER 8 LEXICON GRAMMAR i Table31H_ xls B Exemple abandonner Paul agabandonn s abuser Max abuse acquiescer l i Max aSacquiesc s E de la t te adouber PaulSadoube checs agioter Max agiote sur les changes agoniser j Max agonise archaiser foe lh Cet auteurgarchaisegvolontiers arquer Max a arqu toute la journ e arriver ail Max estgarriv S atermoyer Mag atermoie badauder badaud Max badaude 4 Figure 8 3 Lexicon grammar table 31H a Unitex 1 2 current language is French Text DELA FSGraph Edit File Edition Windows Info Open Compile to GRF Close Figure 8 4 Menu Lexicon Grammar The selected table is then opened in a window see figure figure 8 5 To automatically generate graphs from a template graph click on Compile to GRF in the menu Lexicon Grammar The window in figure 8 6 shows this In the field Reference Graph in GRF format enter the name of the template graph to be used In the field Resulting GRF grammar enter the name of the main graph that will be g
16. Norwegian and Russian 3 Dictionaries 31 The DELA dichonari s ee sr rire Bad Lhassa e ee Le she SLI The DELAR format 2 7 14 id dead 3 1 2 The DELAS Format Ole Dichonary C ntents o cirrosis site RRO TN 3 2 Verfication of the dictionary format Lu 4 do ss ou cs IO DONE as r orar AAA ter ete Udinese se 34 Automaticintleci n oia rd dde A 39 CODOS 2 su e A ad AA A 36 Applying CIEGOS pos oo ee ARA A AAA Sl gt PROFES os cat ss e a ee o a 11 11 11 12 12 12 12 13 Searching with regular expressions 4 1 Definition Lexical units 4 2 4 3 44 4 5 4 6 4 7 4 8 3 6 2 Application rules for dictionaries 3 7 Bibliography Patterns 4 3 1 Special symbols 43 2 References to the dictionaries 4 3 3 43 4 Inflectional constraints 4 3 5 Negation of a pattern Concatenation Morphological Filters Search 4 8 1 Grammatical and semantic constraints Configuration of the search 4 82 Presentation of the results Local grammars 5 1 The Local grammar formalism 52 SP 5 4 5 11 Algebraic grammars 5 1 2 Extended algebraic grammars Editing graphs 5 2 1 Import of Intex graphs 5 2 2 Creating a graph D255 5 24 Manipulating boxes 2 2 5 2 6 Using Variables 5 2 7 Copying Lists 5 2 8 Special Symbols 5 2 9 Toolbar Commands Display options 5 3 1 Sorting the lines of a box Oe 5 3 3 5 3 4 Box alignment ia 5 4 1 Sub Graphs
17. SS FO HOO B 3 H kHQ Hh D Q OO D HZ GG H amp N pa oso oo DM oy ul 2 2 AAA AAA AAA AAA AAA A 1 2 grammatical semantic codes used in dictionary 4 q INTI 4 INTJ warning 1 suspect char 1 space SPACE I NT J 4 sex O inflectional code used in dictionary q 10 8 ELAG Files 10 8 1 The tagset def file see section 7 3 6 page XXX 10 8 2 The Ist files THE LST FILES ARE NOT ENCODED IN UNICODE A 1st file contains a list of names of grf iles located in the ELAG directory of the current language Theelag 1st file provided for French looks like this 10 9 CONFIGURATION FILES 159 lt get from French version gt 10 8 3 The elg files The elg files contain the compiled ELAG rules These files have the fst2 format 10 8 4 The rul files These files list the various elg files that represent a set of ELAG rules A rul file consist of as many parts as there are elg files Each part consists of the list of ELAG grammars that correspond to a elg file Each filename is preceded by a tab followed by a line that contains the elg filename in angle brackets The lines starting with a tab serve as comments and are ignored by the Elag program The default file elag rul for French looks like this lt get from French version gt 10 9 Configuration files 10 9 1 The file Config Whenever the user modifies his preferences for a given language these modifications are saved in a text
18. This results from the fact that factorized lexical entries were exploded in order to treat each inflectional interpretation separately To refactorize these entries click on the implode button Clicking on the button explode shows you an exploded view of the text automaton If you click on the replace button the resulting automaton will become the new text automaton Thus if you use other grammars they will apply to the already partially disambiguated automaton which makes it possible to accumulate the effects of several grammars 7 34 Grammar collections Itis possible to gather several ELAG grammars into a grammar collection in order to apply them in one step The sets of ELAG grammars are described in 1st files They are managed through the window for compiling ELAG grammars figure 7 16 The label on the top left indicates the name of the current collection by default elag 1st The contents of this collection are displayed in the right part of the window To modify the name of the collection click on the browse button In the dialog box that appears enter the 1 st file name for the collection To add a grammar to the collection select it in the file explorer in the left frame and click on the gt gt button Once you selected all your grammars compile them by clicking on the lEntries which gather several different inflectional interpretations such as for example se PRO PpvLE 3ms 3fs 3mp 3fp 108 CHAPTER 7 TEXT AUTO
19. This option only has an effect if the option File Name is selected Frame draw a frame around the graph Right to Left invert the reading direction of the graph see an example in figure 5 22 You can reset the parameters to the default ones by clicking on Default If you click on OK only the current graph will be modified In order to modify the preferences for a language as a default click on Preferences in the Info menu and choose the tab Graph Representation The preferences configuration window has an extra option concerning antialiasing see figure 5 23 This option activates antialiasing by default for all graphs in the current lan guage Itis advised to not activate this option if your machine is not very fast 70 CHAPTER 5 LOCAL GRAMMARS Figure 5 23 Default preferences configuration 5 4 GRAPHS OUTSIDE OF UNITEX 71 5 4 Graphs outside of Unitex 5 4 1 Inserting a graph into a document In order to include a graph into a document you have to convert it to an image To do this activate antialiasing for the graph that interests you this is not obligatory but results in a better image quality In Windows Press Print Screen on your keyboard This key should be next to the F12 key Start the Paint program in the Windows Utilities menu Press lt Ctrl V gt Paint will tell you that the image in the clipboard is too large and asks if you want to enlargen the image Click on Yes You can
20. a cat returns the following list of lexical units A SPACE cat is a You will observe that tokenization is case sensitive A and a are two distinct tokens and that each token is listed only once Numbering these tokens with 0 to 5 the text can be represented by a sequence of numbers as described in the following table Indice ff Of 1 2 243 144 4 2 5 Corresponding A cat is a cat lexical unit Table 2 1 Representing the text A cat is a cat For more details see chapter 10 2 5 5 Applying dictionaries Applying dictionaries consists of building the subset of dictionaries consisting only of forms that are present in the text Thus the result of applying a English dictionary to the text Igor s father in law is ill produces a dictionary of the following simple words father N Hum s father V W Pls P2s Plp P2p P3p ill A ill ADV ill N s 26 CHAPTER 2 LOADING A TEXT Figure 2 11 Lexical units in an English text sorted by frequency in A in N s in PART in PREP is be V P3s is i N p law N s law V W P1s P2s Plp P2p P3p Sr NES as well as a dictionary of compound words consisting of a single entry father in law N NPN Hum zl s Since the sequence Igor is neither a simple English word nor a part of a compound word it is treated as an unknown word The application of dictionaries is done through the pro gram Dico The three files produced d1f for simple
21. a special st2 file that represents the text automaton In that file each sub graph represents a sentence automaton The areas reserved for the names of the sub graphs are used to store the sentences from which the sentence automata have been constructed With the exception of the first label which is always the empty word lt E gt the lables have to be either lexical units or entries from DELAF in braces Example Here the file that corresponds to the text He is drinking orange juice 00000000014 1 He is drinking orange juice Y 1 12 324 5 3 6 8 4 9 1 1 Hh Ch se se oe ce oe c lt He he N s p 4 He he PRO Nomin 3ms 4 lis be V P3s is i N p drinking drinking A 4 drinking drinking N s drinking drink V G orange orange A orange orange N Conc s 4 orange orange N s orange juice orange juice N XN z1 s juice juice N Conc s juice juice V W P1s P2s P1p P2p P3p 4 4 AP AP AP AP A AP AP AP AL V AP OP OP Ae OP Hh A 152 CHAPTER 10 FILE FORMATS 10 5 2 The file cursentence grf The file cursentence grf is generated by Unitex during the display of a sentence au tomaton The program FST2Gr f constructs a file grf from the file text st 2 that repre sents a sentence automaton 10 5 3 The file sentenceN grf Whenever the user modifies a sentence automaton that automaton is saved under the name sentenceN grf where N represents the number
22. are presented Chapter 7 introduces the concept of a text automaton and describes the properties of this notion This chapter also describes the operations on this object in particular how to dis ambiguate lexical items with the ELAG program Chapter 8 contains an introduction to lexical grammar tables followed by a description of the method of constructing grammars based on these tables Chapter 9 contains a detailed description of the different external programs that make up the Unitex system Chapter 10 contains descriptions of all file formats used in the system Chapter 1 Installing Unitex Unitex is a multi platform system that runs on Windows as well as on Linux or MacOS This chapter describes how to install and how to launch Unitex on any of these systems It also presents the procedures used to add new languages and to uninstall Unitex 1 1 Licenses Unitex is a free software This means that the sources of the programs are distributed with the software and that anyone can modify and redistribute them The code of the Unitex programs is under the LGPL licence except for the TRE library for dealing with regular expressions from Ville Laurikari which is under GPL licence The LGPL Licence is more permissive than the GPL licence because it makes it possible to use LGPL code in nonfree software From the point of view of the user there is no difference because in both cases the software can freely be u
23. dans de grands corpus le syst me AGLAE 2000 M moire de DEA 4 S bastien PAUMIER Some remarks on the application of a lexicon grammar In Lingvis tic Investigationes number 24 Amsterdam Philadelphia 2001 John Benjamins Publish ing Company 5 S bastien PAUMIER Some remarks on the application of a lexicon grammar http www nyu edu pages linguistics intex downloads Sebastien 20Paumier pd 2001 Online Proceedings of the 4th Intex workshop 6 S bastien PAUMIER UNITEX manuel d utilisation http www igm univ mlv fr unitex manuelunitex ps 2002 171
24. default selection 28 DELAC 29 DELACE 29 DELAF 29 32 131 139 156 DELAS 29 32 139 filters 41 format 29 granularity 96 of the text 44 priorities 40 refer to 44 sorting 37 text 26 verification 34 131 Dictionary compression 39 Dictionary compression 39 Dictionnaries of the text 95 reference to 76 directory personal work 12 text 129 D placer des groupes de mots 86 ELAG 100 ELAG tag sets 108 Epsilon see lt E gt Equivalent characters 37 Error detection in graphs 139 Error detection in the graphs 80 Errors in graphs 139 Errors in the graphs 80 Evaluation of the rate of ambiguity 105 Exclusion of grammatical and semantic codes 45 Exploring the paths of a grammar 81 External Program Elag 104 108 ElagComp 108 Dico 26 166 External program Elag 105 External Programs ElagComp 103 CheckDic 34 Compress 31 Convert 133 Elag 135 Evamb 135 Fst2Grf 137 PolyLex 28 Uni2Asc 56 External programs Asc2Uni 129 144 CheckDic 131 159 Compress 39 131 156 Concord 131 Dico 42 134 Extract 136 Flatten 77 136 Fst2Grf 120 Fst2Txt 138 Grf2Fst2 76 139 Inflect 39 139 Locate 139 MergeTextAutomaton 140 Normalization 129 Normalize 140 Reconstrucao 99 141 Reg2Grf 141 SortTxt 37 141 147 Table2Grf 142 TextAutomaton2Mft 142 Tokenize 24 142 Txt2Fst2 143 Uni2Asc 143 external programs elagcomp 117 Fst2Txt 23
25. dictionaries In order to verify the format of a dictionary you first open it by choosing Open in the DELA menu Unitex 1 2 current language is French Text FSGraph Lexicon Grammar Edit File Edition Windows Info Check Format Ctik Sort Dictionary Inflect Compress into FST Figure 3 1 DELA Menu Let s load the dictionary as in figure 3 2 ES E My Unitex English Delaitest dic agreeably ADY agreed INTJ agreed agree V i K 11is 12s 13s I1p I2p I3p ah aid N s Figure 3 2 Dictionary example In order to start the automatic verification click on Check Format in the DELA menu window like in figure 3 3 is opened Figure 3 3 Automatic verification of a dictionary 36 CHAPTER 3 DICTIONARIES F Check Results Line 1 no point found agreeably ADV Line 2 no comma found agreed INTJ Line 4 no grammatical code 002E 0041 0044 0049 0044 004E 0054 0056 0061 0062 0064 0065 0067 0068 i 0069 006C 0072 0079 Figure 3 4 Results of the automatic verification In this window you choose the dictionary type you want to verify The results of verify ing the dictionary in figure 3 2 are shown in figure 3 4 The first error is caused by a missing period The second by the fact that no comma was found after the end of an inflected form The third error indicates that the program didn t find any grammatical or semantic
26. figure 2 13 in which you can select the list of dictionaries to apply The list User resources lists all compressed dictionaries present in the directory current language of the user The dictionaries installed in the system are listed in the scroll list named Sys tem resources Use the combination lt Ctrl click gt to select multiple dictionaries The button Set Default allows you to define the current selection of dictionaries as the default This default selection will then be used during preprocessing if you activate the option Apply All default Dictionaries 28 CHAPTER 2 LOADING A TEXT Lexical Resources User resources Clear Apply Selected Resources Figure 2 13 Parameterizing the application of dictionaries 2 5 6 Analysis of compound words in German Norwegian and Russian In certain languages like Norwegian German and others it is possible to form new com pound words by concatenating together other words For example the word aftenblad mean ing evening journal is obtained by combining the words aften evening et blad journal The program PolyLex searches the list of unknown words after the application of dictio naries and tries to treat each of these words as a compound word If a word has at least one analysis as a compound word it is deleted from the list of unknown words and the lines produced for this word are appended to the text dictionary of simple words Chapter 3 Dic
27. file named Config which can be found in the directory of the current lan guage The file has the following syntax TEXT FONT NAME Courier New TEXT FONT STYLE 0Y TEXT FONT SIZE 104 CONCORDANCE FONT NAME Courier new CONCORDANCE FONT HTML SIZE 34 INPUT FONT NAME Times New Roman INPUT FONT STYLE 04 INPUT FONT SIZE 104 OUTPUT FONT NAME Times New Roman OUTPUT FONT STYLE 14 OUTPUT FONT SIZE 124 DATE trueY FILE NAME trueq PATH NAME falseY FRAME t rue RIGHT TO LEFT false BACKGROUND COLOR 167772154 FOREGROUND COLOR 04 AUXILIARY NODES COLOR 134875654 COMMENT NODES COLOR 167116804 SELECTED NODES COLOR 255 4 ANTIALIASING trueY 160 CHAPTER 10 FILE FORMATS HTML VIEWER Y MAX TEXT FILE SIZE 10240004 ICON BAR POSITION WestY The first three lines indicate the name the style and the size of the font used to display texts dictionaries lexical units sentences in text automata etc The parameters CONCORDANCE FONT NAME and CONCORDANCE FONT HTML SIZE de fine the name the size and the font to use when displaying concordances in HTML The size of the font has a value between 1 and 7 The parameters INPUT FONT and OUTPUT FONT define the name the style and the size of the fonts used for displaying the paths and the transductions of the graphs The following 10 parameters correspond to the parameters given in the headings of the graphs Table 10 3 describes the correspondences
28. followed by a verb If an element in an expression is optional it is sufficient to use the union of this element and the empty word epsilon Examples the little lt E gt cat recognizes the sequences the cat and the little cat lt E gt Anglo French Indian recognizes French Indian Anglo French and Anglo Indian 4 6 Kleene star The Kleene star represented by the character allows to recognize zero one or several occurrences of an expression The star must be placed on the right hand side of the element in question The expression this is very cold recognizes this is cold this is very cold this is very very cold etc The star has a higher priority than the other operators You have to use brackets in order to apply the star to a complex expression The expression 0 0 1 2 3 4 5 6 7 8 9 x recognizes a zero followed by a comma and by a possibly empty sequence of digits ATTENTION It is prohibited to search for the empty word with a regular expression If you try to search for 0 1 2 3 4 5 6 7 8 9 x the program will flag an error as shown in figure 4 3 4 7 MORPHOLOGICAL FILTERS 49 F ERROR Expression converted Compiling graph regexp Recursion detection started Resolving lt E gt conditions Recursion detection completed ERROR the main graph regexp recognizes lt E gt Figure 4 3 Error message when searching for the empty word 4 7 Morphological Filters It is po
29. glossanet and a text file with the name defined by the user of the program if the program has constructed a modified version of the text In html mode the occurrence is coded as a link The reference associated with this link is of the form lt a href X Y Z gt X and Y represent the beginning and ending positions of the occurrence in characters in the file name_of_file snt Z represents the number of the sentence in which the occurrence was found 9 4 Convert Convert src dest mode text_1 text_2 text_3 This program allows to change the text file encoding The src parameter indicates the input encoding The optional dest parameter indicates the output encoding By default 132 CHAPTER 9 USE OF EXTERNAL PROGRAMS the output encoding is LITTLE ENDIAN The possible values for these parameters are the following FRENCH ENGLISH GREEK THAI CZECH GERMAN SPANISH PORTUGUESE ITALIAN NORWEGIAN LATIN default windows 1252 windows 1250 windows 1257 windows 1251 windows 1254 windows 1258 Microsoft Windows 1252 Latin I code page Western Europe amp USA Windows 1250 code page Central Europe Microsoft Windows 1257 Code Page Baltic Countries Microsoft Windows 1251 code page Cyrillic Microsoft Windows 1254 code page Turc Microsoft Windows 1258 code page Viet Nam iso 8859 1 ISO 8859 1 code page Latin 1 Western Europe amp USA iso 8859 15 ISO 8859 15 code page Latin 9 Western
30. if those codes do not start with this character The option 38 CHAPTER 3 DICTIONARIES Remove class numbers is used to replace codes with the numbers used in the DELAS by codes without numbers Example V17 will be replaced by V Directory where inflectional FST2 are stored Emy UnitexiFrenchinflection Set v Add before inflectional codes if necessary vi Remove class numbers Cancel Infect Dictionary Figure 3 5 Configuration of automatic inflection Figure 3 6 shows an example of an inflectional grammar cheval chevaux mp Figure 3 6 Inflectional grammar N4 The paths describe the suffixes to add or to remove to get to an inflected form from a canonical form and the outputs text in bold under the boxes are the inflectional codes to add to a dictionary entry In our example two paths are possible The first doesn t modify the canonical form and adds the inflectional code ms The second deletes a letter with the L operator then adds the ux suffix and adds the inflectional code mp Three operators are possible e L left remove a letter from the entry e R right restore a letter to the entry In French many verbs of the first group are conjugated in the present singular of the third person form by removing the r of the infinitive and changing the 4th letter from the end to e peler pele acheter gt ach te g rer g re etc Instead of describing a inflection
31. in REPLACE mode constitutes the end of the variable NOUN and the box containing the transduction If a space is read in REPLACE mode it is erased because it is part of the text analyzed by the grammar In order to avoid the loss of this space it is therefore necessary to reinsert it by putting it into a transduction If the beginning or the end of variable is malformed end of a variable before its be ginning or absence of the beginning or end of a variable it will be ignored during the transductions If you want to respect text spaces the solution consists in making a difference between nouns that are followed by a space and other nouns Figure 6 24 shows such a grammar In the upper path there is a space after SNOUN SADJ Applying this grammar in REPLACE mode builds the concordance shown on Figure 6 25 You can see in this concordance that previous spaces have been left unchanged and that no extra space was inserted Note that 6 5 RULES FOR APPLYING TRANSDUCERS 89 the boxes containing and must immediately follow the lt N gt box Placing them after the SNOUN box would have no effect Figure 6 24 Caption missing There is no limit of the number of possible variables The variables can be nested and even overlapping as is shown in figure 6 26 90 CHAPTER 6 ADVANCED USE OF GRAPHS Figure 6 25 Caption missing 6 6 Applying graphs to texts This section only applies to syntactic graphs 6 6 1 Configuration of
32. list of all syntax errors found in the dictionary miss ing of the inflected or the canonical form the grammatical code empty lines etc Each error is described by the number of line it concerns a message describing the error and the contents of the line Here an example of a message Line 12451 no point found garden N s The second and third part display the list of grammatical codes and or semantic and inflectional codes respectively In order to prevent coding errors the program reports en codings that contain spaces tabs or non ASCII characters In addition to that if a Greek dictionary contains the code ADV or the character A and the Greek A is used instead of the Latin A the program reports the following warning ADV warning 1 suspect char 1 non ASCII char 0391 D V Non ASCII characters are indicated by their hexadecimal character number In the exam ple below the code 0301 represents the Greek A The spaces are indicated by the sequence SPACE Km s warning 1 suspect char 1 space K m SPACE s When the following dictionary is verified 1 2 et 3 INTJ abracadabra INTJ supercalifragilisticexpialidocious INTJ damned INTJ the following file CHECK_DIC TXT is obtained Line 1 unprotected comma in lemma 1 2 et 3 INTIF Line 2 no point found abracadabra INTJ 4 q All chars used in forms Y 1 0020 Y 0021 Y 002C Y 158 CHAPTER 10 FILE FORMATS x
33. makes them different from strict finite state transducers The program Flatten allows to transform a grammar FST2 in a finite state transducer whenever this is possible and to construct an approximation if not This function thus permits to obtain objects that are easier to manipulate and to which all classical algorithms on automata can be applied In order to compile and thus transform a grammar select the command Compile amp Flatten FST2 in the submenu Tools of the menu FSGraph The window of figure 6 5 allows you to configure the operation of approximation The box Flattening depth lets you specify the level of embedding of subgraphs This value represents the maximum depth up to which the calling of subgraphs will be replaced the subgraphs themselves The box Expected result grammar format allows to determine the behavior of the pro gram beyond the selected limit If you select the option Finite State Transducer the calls to 78 CHAPTER 6 ADVANCED USE OF GRAPHS Compile amp Flatten x Expected result grammar format equivalent FST2 subgraph calls may remain O Finite State Transducer can be just an approximation Flattening depth Maximum flattening depth ho Cancel Figure 6 5 Configuration of approximation of a grammar subgraphs will be ignored beyond the maximum depth This option guarantees that we ob tain a finite state transducer however possibly not equivalent to the origin
34. number which allows to represent texts without having to take into account the proprietary codes on different machines and or operating systems Unitex uses a two byte representation of the Unicode 3 0 standard called Unicode Little Endian for more details see 15 16 CHAPTER 2 LOADING A TEXT User spaumier Choose the language you want Figure 2 1 Language selection when starting Unitex The texts that come with Unitex are already in Unicode format If you try to open a text that is not in the Unicode format the program proposes to convert it automatically see figure 2 2 This conversion is based on the current language if you are working in French Uni tex proposes to convert your text assuming that it is coded using a French code page By default Unitex proposes to either replace the original text or to rename the original file by inserting old at the beginning of its extension For example if one has an ASCII file named balzac txt the conversion process will create a copy of this ASCII file named balzac old txt and will replace the contents of balzac txt with its equivalent in Uni code E My UnitexiFrenchiCorpus porte txt is not a Unicode Liitle Endian one Do you want to transcode it from FRENCH to Unicode Little Endian a Replace Rename source with suffix old Figure 2 2 Automatic conversion of a non Unicode text If the encoding suggested by default is not correct or if you wa
35. one of which is called App This directory contains a file called Unitex jar This file is the Java executable that launches the graphical interface You can double click on this icon to start the program To facilitate launching Unitex you may want to add a shortcut to this file on the desktop 1 4 Installation on Linux and Mac OS X In order to install Unitex on Linux it is recommended to have system administrator permis sions Decompress the file unitex_1 2 zipin a directory named Unitex by using the following command unzip Unitex_1 2 zip d Unitex Within the directory Unitex Src C start the compilation of Unitex with the com mand make install You can then create an alias in the following way alias unitex cd Unitex App java jar Unitex jar 15 Using Unitex the first time If you working with Windows the program will ask you to choose a personal working directory which you can change later To create a directory click on the icon showing a file see figure 1 3 If your are using Linux the program will automatically create a unitex directory in your HOME directory This directory allows you to save your personal data For each lan guage that you will be using the program will copy a root directory of that language to your peronal work directory except the dictionaries You can also modify your copy of the files without risking to damage the system files 1 6 Adding new languages There are two di
36. re l ment de ensemble P etre variables LettreMaj partie de sous ensemble cas particuliers variable cf P S Nombres Sentence grf E Thu Dec 19 14 40 48 CET 2002 Le point virgule est toujours un s parateur de phrase sy Figure 2 9 Sentence splitting grammar for French l ments de doivent tre pris en compte By default the space is optional between two boxes If you want to prohibit the pres ence of this separator you have to use the special separator Lower and upper case letters 2 5 PREPROCESSING A TEXT 23 are defined by an alphabet file see chapter 10 For more details on grammars see chap ter 5 The grammar used here is named Sentence fst2 and can be found in the following directory user home directory language Graphs Preprocessing Sentence This grammar is applied to a text with the Fst 2Txt program in MERGE mode This has the effect that the output produced by the grammar in this case the symbol S is inserted into the text This program takes a snt file and modifies it 2 5 3 Normalization of non ambiguous forms Certain forms present in texts can be normalized for example the English sequence I m is equivalent to I am You may want to replace these forms according to your own needs However you have to be careful that the forms normalized are unambiguous or that the removal of ambiguities has no undesirable consequences For instance if you want
37. sentence separators S e tokens total number of lexical units in the text The number preceding diff indi cates the number of different units e simple forms the total number of lexical units in the text that are composed of letters The number in parentheses represents the number of different lexical units that are composed of letters digits the total number of digits used in the text The number in parentheses indi cates the number of different digits used 10 at the most 10 10 4 The file concord n The file concord n is a text file in the directory of the text It contains information on the last search done on the text and looks like the following 6 matches 6 recognized units 0 004 of the text is covered The first line gives the number of found occurrences and the second the number of units covered by these occurrences The third line indicates the ratio between the covered units and the total number of units in the text Index 43 30 41 48 58 _ 114 cat 113 complete 113 discr 113 inflex 112 t 18 1 46 21 44 46 76 61 62 48 sr BU Oe 41 45 30 47 30 61 1 34 2 34 3 34 90 99 lt CDIC gt 44 DIC gt 44 46 E gt 21 44 46 48 57 74 76 MAJ gt 21 44 46 MIN gt 21 44 46 MOT gt 21 44 NB gt 21 44 46 PNC gt 21 PRE gt 21 44 46 lt SDIC gt 44 lt gt 21 74 31 es 125 a 125 A 33 ADV 33 Abst 33 NAAAA A A
38. that the recursive exploration must stop when the graph subname is encountered This parameter can be used several times in order to specify several stop graphs e p s f d s lists the paths of each subgraph of the grammar f default lists the paths of the main grammar d lists the paths by adding the nesting depth of calls to subgraphs e c SS 0xXXXX replaces the symbol SS when it appears between parentheses by a Unicode character 0xXXXX e s L R specifies the left 1 and right r delimiters which will surround the items By default these delimiters are empty e s0 Str if the outputs of the grammar are taken into account this parameter specifies the sequence Str which will separate an input from its output By default there is no separator e f a SS if outputs of the grammar are taken into account this parameter specifies the format of the generated lines in0 inl out0 outl s orin0O out0 inl outl a The default value is s e v this parameter produces the posting of information messages e r s l x L R this parameter specifies how cycles should be presented L and R specify delimiters In the graph in figure 9 1 the results shown were obtained with the delimiter settings L and R x rs il fait tr s CO CO tr s CO specifies a label generated by the rs il fait tres Loc0 LocOtr sLocObeau LocO specifies a label gene rs il fait tr s tr s i
39. the date at the bottom of the graph if x is y not if it is n 10 3 GRAPHS 147 e DFILE x puts the name of the file at the bottom of the graph depending on whether x is y or n em DDIR x puts the complete path of the file at the bottom of the graph depending on whether x is y or n This option takes effect only if the parameter DFILE has the value y e DRIG x draws the graph from right to left or left to right depending on whether x is y Or n e DRST x this line is ignored by Unitex It is conserved to ensure the compatibility with Intex graphs e FITS x this line is ignored by Unitex It is conserved to ensure the compatibility with Intex graphs e PORIENT x this line is ignored by Unitex It is conserved to ensure the compatibility with Intex graphs e this line is ignored by Unitex It serves to specify the end of the header information The following lines give the contents and the position of the boxes in the graph The following lines correspond to a graph recognizing a number 34 lt E gt 84 248 1 2 4 272 248 0 Y s 1 2 3 4 5 6 7 8 9 0 172 248 1 1 The first line indicates the number of boxes in the graph immediately followed by a newline This number can not be lower than 2 since a graph always has an initial and a final state The following lines define the boxes of the graph The boxes are numbered starting at 0 By convention state 0 is the initial state and state 1 is the final state
40. thousands of occurrences it is advisable to display it in a web browser Internet Explorer Mozilla Netscape etc instead Check the box Use a web browser to view the concordance cf figure 4 6 This option is activated by default if the number of occurrences is greater than 3000 You can configure which web browser to use by clicking on Preferences in the menu Info Click on the tab Text Presentation and select the program to use in the field Html Viewer cf figure 4 7 Preferences for French Text Presentation Text Font Courier New 10 Concordance Font Courier new 12 Html Viewer C Program Filesimozilla orgiMozillalmozilla exe Set Maximum Text File Size 5120 Kbytes Figure 4 7 Selection of a web browser for displaying concordances 54 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS EJ concordance E My UnitexiFrenchiCorpusi80jours_snticoncord htmi depuis la Soci t de 1 Armonica jusqu ans la capitale de l Angleterre depuis disait qu il ressemblait Byron _ par ppuy es sur les genoux le corps droit sur la galerie circulaire au dessus de tomologique fond e principalement dans t les mille propos qui circulaient dans ade les mains appuy es sur les genoux s taient r guli rement pay s vue par s ne pouvaient dire et Mr Fogg tait t ou d jeunait c taient les cuisines demment pour jouer non pour gagner 5 James Forster le con
41. words dlc for compound words and 2 5 PREPROCESSING A TEXT 27 err for unknown words are placed in the text dictionary The dl f and dic files are called text dictionaries As soon as the dictionary look up is finished Unitex displays the sorted lists of simple compound and unknown words found in a new window Figure 2 12 shows the result for an English text Word Lists in E My Unitex EnglishiCorpustivanhoe_sntl DLF 13286 simple word lexical entries ERR 414 unknown simple words a DET Dind s a Nis Aaron N PR Hum abandoned abandoned abandon V K Iis 1I2s abate V W P1s P2s Pip P2p P3 abated abate V K 11s 12s 13s abbey N Cone s abbot N Hum s abbots abbot N Hum p abide V W Pis P2s Pip P2p P3 abiding amp abiding N s abiding abide V G 4 EEE PR DLC 219 compound lexical entries absolute necessity N XN 21 sja act of violence N NPN z21 s sf agnus castus N XN NX Conc z21 all around i Di z1 all comers N XN z1l p all in i z21 as usual i asi z1l as was AtasV 21 ass s ears ass s ear N NsNt 21 at a loss A 21 banqueting hall N XN 21 s best friend N XN Hum 21 s best man N XN Hum 21 s better acquainted 4i 21 Figure 2 12 Result after applying dictionaries to an English text It is also possible to apply dictionaries without preprocessing the text In order to do this click on Apply Lexical Resources in the Text menu Unitex then opens a window see
42. 15 R sultat de l application de la grammaire de la figure 7 14 106 CHAPTER 7 TEXT AUTOMATA 7 3 2 Compiling ELAG Grammars Before being able to be applied to a text automaton an ELAG grammar must be compiled in a rul file This operation is carried out via the Elag Rules command in the Text menu which opens the windows shown in figure 7 16 Elag Rule Compilation me se j home olive unitex French Elag NewGrams SEfst2 Look In C1Elag y alla a BE C3 19juin Cf NewGrams Ly Naz fst2 El Annellsept A olive B normalisatio CS Annelast E AN fst2 E regle fst2 Ef louvain y elle fst2 E regleD fst2 Cf NewFolder E NAfst2 E reglec fst2 FileName Files of Type Compiled Elag rules v Figure 7 16 Fen tre de compilation des grammaires ELAG If the frame on the right already contains grammars which you don t wish to use you can withdraw them with the lt lt button Then select your grammar in the file explorer located in the left frame and click on the gt gt button to add it to the list in the right frame Then click on the compile button This will launch the program ElagComp which will compile the selected grammar and create a file named elag rul If you selected your grammar in the right frame you can search patterns which it rec ognizes by clicking on the locate button This opens the window Locate Pattern and auto matically enters a gra
43. 44 Lu a ts dame et 153 107 Dictionanes lt 24 2 965 40 S084 A AR ER rs SN A ere 154 1071 We tee ss sens hha ee Gn ee cine Be Be Dk A a eS we 155 107 2 Theni neS esagero ean ce he Se chs at ee a EN 155 10 7 3 The file CHECK_DIC TXT 157 10 8 ELAG Files 158 1081 The SERRES sun cet 6d due tous Dae GR 158 10 82 The Jst files sse nu sui rea lu din a de ne 6 3 158 108 3 The CI GS s ke ho GSAS a esters 159 108 4 Theal eS os aot e Se he ee a es haw balk aw oe DRE Se 159 109 Copiguranon Mes soria sidad ES OG 159 UL Thee CONS e scr bh hee hee Ke se ee RAE ES 159 10 92 Ae Die syst Edel Denda OES Ho ee RES ESS 160 10 9 3 The file user dic def 161 10 94 Ihe ASH Le L 44 ASS sin hs RES rn Res 161 10 10 Various other files 24 4 14 4 4 4 4 qu 4 ua due eue 44e 161 10 10 1 The files dlfn dlcneterrn 161 10102 The tle Stat diens ei se 4 See domine Led Lane da da 161 10103 The Nle SAS sio ON a LR Ed a aa Sp da 161 10 104 The l concorde 4 usant dela nd Ne ses a 162 CONTENTS Introduction Unitex is a collection of programs developped for the analysis of texts in natural languages by using linguistic resources and tools These resources consist of electronic dictionaries grammars and lexical grammar tables initially developed for French by Maurice Gross and his students at the Labora
44. APTER 5 LOCAL GRAMMARS Chapter 6 Advanced use of graphs 6 1 Types of graphs Unitex can work with four types of graphs that correspond to the following uses automatic inflection of dictionaries preprocessing of texts normalization of text automata and search for patterns These different types of graphs are not interpreted in the same way by Unitex Certain operations like the transduction are allowed for some types and forbidden for oth ers In addition the special symbols are not the same depending on the type of the graph This section presents each type of graph and shows their peculiarities 6 1 1 Inflection graphs An inflection graph describes the morphological variation that is associated with a word class by assigning inflectional codes to each variant The paths of such a graph describe the modifications that have to be applied to the canonical forms so that the transduction contains the inflectional information that will be produced cheval chevaux Figure 6 1 Example of an inflectional grammar The paths may contain operators and letters The possible operators are represented by the characters L R and C All letters that are not operators are characters The only 73 74 CHAPTER 6 ADVANCED USE OF GRAPHS allowed special symbol is the empty word lt E gt It is not possible to refer to dictionaries in an inflection graph It is also impossible to reference subgraphs Transductions are concatenated in order t
45. Automata finite state 56 text 137 Automate du texte 143 Automatic inflection 37 73 automatic inflection 139 Automaton acyclic 95 minimal 40 of the text 45 95 Text compact form 136 developed form 136 text 75 142 texte 140 Axiom 55 Box Alignement 66 Boxes alignement 66 connecting 58 Creating 57 Deleting 60 Selection 60 sorting lines 65 brackets 48 Case seeRespect of lowercase uppercase 76 Case sensitivity 50 Case sensitivity 44 INDEX Clitics normalisation 141 normalization 98 Collection of graphs 83 Colors Configuration 67 Comment in a dictionary 30 Comments in a graph 58 Compilation of a graph 76 Compilation of graphs 139 Compiling ELAG grammars 103 Compounds free in German 140 free in Norvegian 140 Compressing dictionaries 131 Compression of dictionaries 141 Concatenation of regular expressions 47 concatenation of regular expressions 43 Concordance 51 92 131 Conservation of better paths 99 143 Constraints on grammars 79 Contexts concordance 52 92 132 copy of a list 63 Copy 60 62 64 Copying Lists 62 Corpus see Texts Creating a Box 57 Cut 64 Degree of ambiguity 96 DELA 20 29 DELAC 29 DELACEF 29 DELAF 29 32 156 DELAS 29 32 Derivation 55 Dictionaries application of 134 applying 25 40 automatic inflection 37 139 codes used within 32 165 Comments in 30 compressing 131 compression 141 Contents 32
46. EX Priority of the leftmost match 86 of the longest match 86 Programmes externes ElagComp 135 ExploseFst2 136 Fst2List 137 Rate of ambiguity 105 Rational Expressions 56 Reconstruction of the text automaton 140 Recursive Transition Networks 56 Reference to dictionnaries 76 References to the dictionionaries 44 Regular Expressions 49 Regular expressions 43 141 REPLACE 85 91 138 139 155 Repository text 21 Resolving Ambiguities 103 Respect des minuscules majuscules 75 of lowercase uppercase 74 76 of spaces 76 RTN 56 Rule upper case and lower case letters 41 white space 42 Rules for transducer application 84 rewriting 55 Search for patterns 89 Searching For Patterns 50 Selecting the Language 15 Separator of phrases 47 Separators 20 of sentences 142 sentence 140 152 163 Seperators phrase 21 Shortest matches 50 90 139 Sorting 141 a dictionary 37 169 concordances 132 lines of a box 65 of concordances 52 92 Space obligatory 44 prohibited 44 State Final 57 Init 57 Symbols non terminal 55 special 63 terminal 55 Syntactical properties 123 Syntax Diagrams 56 Text automata 137 automaton 140 142 automaton of the 45 cutting into lexical units 142 directory of 129 modification 92 131 normalisation 140 normalisation of the automaton 75 Normalization 20 normalization of the automaton 97 Phrase Detection 21 preprocessing 19 74 Reposit
47. Europe amp USA iso 8859 2 ISO 8859 2 code page Latin 2 Eastern and Central Europe iso 8859 3 ISO 859 3 code page Latin 3 Southern Europe iso 8859 4 ISO 859 4 code page Latin 4 Northern Europe iso 8859 5 ISO 8859 5 code page Cyrillic iso 8859 7 ISO 8859 7 code page Greek iso 8859 9 ISO 8859 9 code page Latin 5 Turkish iso 8859 10 ISO 8859 10 code page Latin 6 Nordic next step NextStep code page LITTLE ENDIAN BIG ENDIAN NOTE There is an additional mode for the dest parameter with the value UTF 8 which indicates to the program that it must convert the files from Unicode Little Endian into UTF 8 files The parameter mode specifies how to manage the source and destination file names The possible values are as follows r conversion deletes the source files ps PFX the source files are renamed with the prefix pfx toto txt pfxtoto txt pd PFX the destination files destination are renamed with the prefix PFX s s SFX the source files are renamed with the suffix SFX toto txt gt totosfx txt sd SFX the destination files are renamed with the suffix SFX The parameters text_i are the names of the files to be converted 9 5 DICO 133 9 5 Dico Dico text alphabet dic_1 dic_2 This program applies dictionaries to a text The text has to be split up into lexical units by the program Tokenize The dictionaries need to be compressed with the program Compress text represents t
48. GRAMMARS Derivation 3 rewriting S to e S aS gt aaS aa We call the set of words recognized by a grammar the grammar of the language The languages recognized by algebraic grammars are called algebraic languages 5 12 Extended algebraic grammars Extended algebraic grammars are algebraic grammars where the members on the right side of the rule are not just sequences of symbols but rational expressions Thus the grammar that recognizes a sequence of an arbitrary number of a s can be written as a grammar con sisting of one rule S a These grammars also called recursive transition networks RTN or syntax diagrams are suited for a user friendly graphical representation Indeed the right member of a rule can be represented as a graph whose name is the left member of the rule However Unitex grammars are not exactly extended algebraic grammars since they contain the notion of transduction This notion which is derived from the field of finite state automata enables a grammar to produce some output With an eye towards clarity we will use the terms grammar or graph When a grammar produces outputs we will use the term transducer as an extension of the definition of a transducer in the area of finite state automata 5 2 Editing graphs 5 2 1 Import of Intex graphs In order to be able to use Intex graphs in Unitex they have to be converted to Unicode The conversion procedure is the same as the one for texts s
49. ICAL AMBIGUITIES WITH ELAG 103 7 3 Resolving Lexical Ambiguities with ELAG The ELAG program allows for applying grammars for ambiguity removal to the text au tomaton This powerful mechanism allows for everybody to write rules on his own independent from already existing rules This chapter shortly presents the grammar formalism used by ELAG and describes how the program works For more details the reader may refer to and 7 3 1 Grammars For Resolving Ambiguities The grammars used by ELAG have a special syntax They consist of two parts which we call if and then parts The if part of an ELAG grammar is divided in two parts which are divided by a box containing the lt gt The then part is divided the same way using the lt gt symbol The meaning of a grammar is like the following In the text automaton if a section of the if part is recognized then it must also be recog nized by the then part of the grammar or it will be withdrawn from the text automaton mi tu se trouve apres un verbe a la 2e personne du singulier suivi par un tiret alors c est un pronom et non pas le participe passe de taie lt PRO PpvIL 2s gt lt V K gt lt PRO PpvIL2s gt Figure 7 12 Exemple de grammaire ELAG Figure 7 12 shows an example of a grammar The if part recognizes a verb in the 2nd person singular followed by a dash and tu either as a pronoun or as a past participle of the verb faire The then p
50. MATA La porte du car se ferme automatiquement 1 sentence Sentence 1 Reset Sentence Graph Rebuild FST Text close elag frame Implose Implose Replace porte porter Figure 7 18 Fen tre de l automate du texte s par e en deux compile button This will create a rul file bearing the name indicated at the bottom right the name of the file is obtained by replacing the 1st by the rul extension You can now apply your grammar collection As explained above click on the elag but ton in the text automaton window When the dialog asks for the ru1 file to use click on the browse button and select your collection The resulting automaton is identical to that which would have been obtained by applying each grammar successively 7 3 5 Window For ELAG Processing At the time of disambiguation the El ag program is launched in a processing window which makes it possible to see the messages printed by the program during its execution For example when the text automaton contains symbols which do not correspond to the set of ELAG labels see the following section a message indicates the nature of the error In the same way when a sentence is rejected all possible analyses were eliminated by grammars a message indicates the number of the sentence That makes it possible to locate the source of the problems quickly Evaluation of
51. NCED USE OF GRAPHS x Source directory Set Resulting GRF grammar Figure 6 14 Building a Graph Collection In the field Source Directory select the root directory which you want to explore in our example the directory Dicos In the field Resulting GRF grammar enter the name of the produced grammar CAUTION Do not place the output grammar in the tree structure which you want to explore because in this case the program will try to read and to write simultaneously in this file which will cause a crash When you click on OK the program will copy the graphs to the directory of the output grammar and will create subgraphs corresponding to the various sub directories as one can see in figure 6 15 which shows the output graph generated for our example One can obsereve that one box contains the calls with subgraphs corresponding to sub directories here directories Banque and Nourriture and that the other box calls all the graphs which were in the directory here the graph truc grf Grammars corresponding to sub directories Grammars corresponding to graphs Figure 6 15 Main graph of a graph collection 6 5 RULES FOR APPLYING TRANSDUCERS 85 6 5 Rules for applying transducers This section describes the rules for the application of transducers along with the operaitons of preprocessing and the search for patterns The following does not apply to inflection graphs and normalization graphs f
52. OMATA containing only the character _ underscore So for example if we consider that the following lines extracted from the section describing the verbs W K lt genre gt lt nombre gt They make it possible to declare that verbs in base form indicated by the code W do not have other inflectional features while the forms in past participle code K are also alotted with a gender and a number Description of the inflextional codes The principal function of the discr part is to divide the labels into subcategories having similar morphological behavior These subcategories are then used to falititate writing the complete part For the legibility of the ELAG grammars it is desirable that the elements of the same subcategory all have the same inflectional behavior in this case the complete part is made up of only one line per subcategory Let us consider for example the following lines extracts of the pronoun description Pdem lt genre gt lt nombre gt PpvIl lt genre gt lt nombre gt lt pers gt PpvPr These lines mean e all the demonstrative pronouns PRO Pdem gt have indication of gender and number and any other e personal pronouns in nominative lt PRO Ppv11 gt are labelled morphologically by person gender and number 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 115 e the prepositional pronouns en y do not have any inflectional feature All combinations of inflectional features a
53. ONS This program carries out a lexicographical sorting of the lines of the file text text represents the complete path of the file to sort The possible options are e y delete duplicates e n keep doubles e r sort in descending order e o file sort using the alphabet of the order defined by the file file If this param eter is missing the sorting is done according to the order of the Unicode characters e 1 file save the number of lines of the result file in the file file e thai option for sorting a Thai text The sort operation modifies the file text By default the sorting is performed in the order of the Unicode characters removing duplicates 9 25 TABLE2GRF 141 9 25 Table2Grf Table2Grf table grf result grf pattern This program automatically generates graphs from a lexicon grammar table and the template graph grf The name of the produced main graph of the grammar is result grf If the parameter pattern if specified all the produced subgraphs will be named accord ing this pattern In order to have unambiguous names we recommend to include in the parameter remember that will be replaced by the line number of the entry in the table For instance if you set the pattern parameter to subgraph grf subgraphs names will be in the form subgraph 0013 grf If the parameter pattern is not specified subgraphs name are of the form result_i grf where result grf specifies t
54. TITLE Sir Guilbert excepting the c s not in my memory Sir Palmer said Sir Brian de Bois TITLE Sir Guilbert scornfully this the roads are so unsafe the escort of Sir Brian de Bois TITLE Sir Guilbert is not to be desp e and I gave her to be a handmaiden to Sir Brian de Bois TITLE Sir Guilbert after the fashio ilants were Knights of the Temple and Sir Brian de BoisGuilbert TITLE 5ir well knows the tru rusader If I have offended replied Sir Brian TITLE 5ir I crave your pardon that is y j bica IPA y gt j Les Jai A AP sl Plain dia Figure 6 20 Concordance obtained by the application of the graph Tit leName Transductions with variables can be used to move groups of words In fact the applica tion of a transducer in REPLACE mode inserts only the produced sequences into the text In order to inverse two groups of words it is sufficient to store them into variables and pro duce a transduction with these variables in the desired order Thus the application of the transducer in figure 6 21 in REPLACE mode to the text Ivanhoe results in the concordance of figure 6 22 The presence of a space to the right of each occurrence in the concordance of figure 6 22 is due to the insertion of a space after the NOUN SADJ in the transduction Without this space the result of the transduction would be glued to the right context cf figure 6 23 In fact the program Locate always considers the possibility of a facultative
55. The contents of the final state is always empty Each box in the graph is defined by a line that has the following format contents X Y N transitions Y contents is a sequence of characters enclosed in quotation marks that represents the con tents of the box This sequence can sometimes be preceded by an s if the graph is imported from Intex this character is then ignored by Unitex The contents of the sequence is the text that has been entered in the editing line for graphs The following table shows the encoding of two special sequences that are not encoded in the same way as they are entered into the files grf 148 CHAPTER 10 FILE FORMATS Sequence in the graph editor Sequence in the file gr f Table 10 2 Encoding of special sequences NOTE The characters between lt and gt or between and are not being interpreted as a line separator Thus the character in the sequence 1e lt A Conc gt is not interpreted like a line separator since the pattern lt A Conc gt is interpreted with priority X and Y represent the coordinates of the box in pixels Figure 10 1 shows how these coordinates are interpreted by Unitex 0 0 ye Y Y Figure 10 1 Interpretation of the coordinates of boxes N represents the number of transitions that leave the box This number is always 0 for the final state The transitions are defined by the numbers of boxes at which they point Every line of the box de
56. UNITEX USER MANUAL Universit de Marne la Vall e http www igm univ mlv fr unitex unitexQ univ mlv fr S bastien Paumier January 2003 English translation by the local grammar group at the CIS Ludwig Maximilians Universit t Munich Oct 2003 Wolfgang Flury Franz Guenthner Friederike Malchok Clemens Marschner Sebastian Nagel Johannes Stiehler http www cis uni muenchen de Contents Introduction 1 Installing Unitex Del ECES e acen sise Son wl ee EL hee DO Woke PR Da ee A ee BE use 1 2 The Java runtime environment 1 3 Installation on Windows 1 4 Installation on Linux and MacOSX 15 Wsme Unitex Me frst Ume 2 4 4 40 Ra he de HE que de EOE we EG 1 6 Adding new languages 226566468 os eee be RR seres es 1 7 Uninstalling Unitex s lt lt ce ss bee hehe ee adh re 2 Loading a text 21 ad o xd geek aes EP oi ee hante M a dar 22 TEXTOS da a a ab dan a hate A e ee d 29 EdiGNgtek S 204 20 Ged eS SS a a g RE SESS RA GA 24 Opening ateb 25 47 22684 AE gpi b d a AA 29 Preprocessing a Ext ori A A 25 1 Normalization of separators os sas un hu na tpat is 202 DPEN MOSEMENES ev a a ee oe DANS 2 5 3 Normalization of non ambiguous forms 2 5 4 Splitting a text into lexical units 299 Appling dicH nari s s sa sawka teg a CE Hew Ss 2 5 6 Analysis of compound words in German
57. UTOMATA Chapter 8 Lexicon Grammar The tables of the lexicon grammar are a compact way of representing the syntactical prop erties of the elements of a language Using the mechanism of template graphs it is possible to automatically construct local grammars from these tables In the first part of the chapter the formalism of the tables is presented The second part describes the template graphs and mechanism of automatically generating graphs starting from a lexicon grammar table 8 1 The lexicon grammar tables The lexicon grammar is a methodology developed by Maurice Gross based on the following principle every verb has almost unique syntactical properties Due to this fact these prop erties need to be systematically described since itis impossible to predict the exact behavior of a verb These descriptions are represented by matrices where the rows correspond to the verbs and the columns to the syntactical properties The considered properties are formal properties such as the number and nature of allowed complements of the verb and the dif ferent transformations the verb can undergo passivization nominalization extraposition etc The matrices mostly called tables are binary a sign at the intersection of a row and a column of a property if the verb has that property a sign if not This type of description has equally been applied to adjectives predicative nouns ad verbs as well as figurative expressions all in mul
58. Unitex provides a means to copy lists To use this select the list in your text editor and copy it using lt Ctrl C gt or the copy function integrated in your editor Then create a box in your graph and press lt Ctrl V gt or use the Paste command in the Edit menu to paste it into the box A window as in figure 5 13 opens 5 2 EDITING GRAPHS 63 Choose your left and right contexts tem Figure 5 13 Selecting a context for copying a list This window allows you to define the left and right contexts that will automatically be used for each term of the list By default these contexts are empty If you use the contexts lt and V gt with the following list eat sleep drink play read you will get the box in figure 5 14 lt manger V gt lt dormir V gt lt boire Y gt O lt jouer Y gt lt lire V gt Figure 5 14 Box resulting from copying a list and applying contexts 5 2 8 Special Symbols The Unitex graph editor interprets the following symbol in a special manner ME AE gt Table 5 1 summarizes the meaning of these symbols for Unitex as well as the places where these characters are recognized in the texts 64 CHAPTER 5 LOCAL GRAMMARS quotation marks mark sequences that must not be in a terpreted by Unitex and whose case must be taken verbatim separates different lines within the boxes introduces a call to a subgraph E z gt i Table 5 1 encoding of spe
59. a the fo ok ds A Aw ee Sh Ae dk be ES 141 9 26 TextAutomaton2Mft 141 DIE TORRES carare ru repte ee Ee es eS a es bh ae eee ES 141 ee le oo B S AA ace ee en a ee ok 2A Ae ee ee OS 142 File formats 143 10 1 Unicode Little Endian encoding 244 os aa SETS SS 143 102 SIP Re Le nu oe oR be Ra ERROR EN RE EE ES 144 VIP APRA aa AA AR AA AAA 144 10 2 2 Sort dalphabet lt kok same de RA AAA 145 UNS ARAS eare ia A ERS ad R NT e A 145 INS POMAR ado eas REA AAA A A 145 CONTENTS 7 IR PO MMM eea Ta RE eA ad A E N Ne a do de 148 DO TRS 2 Da o A Re i E e AAA OH AE G 150 10 4 1 txtfiles 150 10 4 2 snt Files di duc de RO ue de into eue een 0 150 10 439 Filetexteod oic ne 4 eae a ei wee 150 10 4 4 The file tokens txt 150 10 45 The files tok_by_alph txt and tok_by_freq txt 150 1046 Th file cnterpos s ina e css he Da de ne 151 10 5 Text Automaton 20202 nicas ad a a RAG As 151 10 5 1 The Te texts p iii a ee a eS 151 1052 The le CUSCO cir a ed A 152 10 53 The filesentenceN ori o so morma die a a OE RARA 152 10 5 4 The file cursentence txt 152 106 Concordances 1 io oia eu ss hd a durer 4 ee pars 152 10 6 1 Thetileconcordand soes 2 1442 4 eue su sa amer eds 152 10 6 2 Th filecomcord txt s 4 veau be du ua du su sw ambre 153 10 6 3 The file concord html 4 4 444
60. al grammar On the contrary the option equivalent FST2 indicates that the program should allow for sub graph calls beyond the limited depth This option guarantees the strict equivalence of the result with the original grammar but does not necessarily produce a finite state transducer This option can be used for optimizing certain grammars A message indicates at the end of the approximation process if the result is a finite state transducer or an FST2 grammar and in the case of a transducer if it is equivalent to the original grammar cf figure 6 6 6 2 3 Constraints on grammars With the exception of inflection grammars a grammar can never have an empty path This means that the principal path of a grammar must not recognize the empty word but this does not prevent a subgraph of that grammar from recognizing epsilon Itis not possible to associate a transduction with a call to a subgraph Such transductions are ignored by Unitex It is therefore necessary to use an empty box that is situated to the left of the call to the subgraph in order to specify the transduction cf figure 6 7 The grammars must not contain infinite loops because the Unitex programs cannot ter minate the exploration of such a grammar These infinite loops can originate from transitions that are labeled by the empty word or from recursive calls to subgraphs The infinite loops due to transitions with the empty word can have two origins of which the first is illus
61. al suffix for each verb LLLL le LLLL te et LLLLere the R operator can be used to describe it in one step LLLL RR 3 5 COMPRESSION 39 e C copy duplicates a letter in the entry and moves everything on its right by one position For example let us assume that we want to automatically generate the French adjectives ending with able from the nouns In cases like regrettable orr quisitionnable we see a duplication of the final consonant of the noun To avoid writing an inflectional graph for every possible final consonant one can use the C operator to duplicate any final consonant The inflection program Inflect traverses all paths of the inflectional grammar and tries all possible forms In order to avoid having to replace the names of inflectional gram mars by the real grammatical codes in the dictionary used the program replaces these names by the longest prefixes made of letters Thus N4 is replaced by N By choosing the inflec tional grammar names carefully one can construct a ready to use dictionary Let s have a look at the dictionary we get after the DELAS inflection in our exapmle E My UnitexFrench Delawelasflx dic ocaux bocal N Conc mp ocal bocal N Cone ms chevaux cheval N anl imp cheval cheval N anl ims locaux local N mp local local N ms Figure 3 7 Result of automatic inflection 3 5 Compression Unitex applies compressed dictionaries to the text The compression reduces the size of the
62. ambiguity removal The evaluation of the rate of ambiguity is not based solely on the average numebr of inter pretations per word In order to get a more representative measure the system also takes into account the various combinations of words While ambiguities are resolved the el ag 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 109 program calculates the number of possible analyses in the text automaton before and after the modification which corresponds to the number of possible paths through the automa ton Based on this value the program computes the average ambiguity by sentence and word It is this last measure which is used to represent the rate of ambiguities of the text be cause it does not vary with the size of the corpus nor with the number of sentences within The formula applied is 7 log number of paths taux d ambiguit s exp tezt length The relationship between the rate of ambiguities before and after applying the grammars gives a measure of their efficiency All this information is displayed in the ELAG processing window 7 3 6 Description Of The Tag Sets The Elag and ElagComp require a formal description of the set of labels of the dictionaries used This description consists roughly speaking of an enumeration of all the grammatical categories present in the dictionaries with for each one of it the syntactic and inflectional code list which are associated for them and a description of their possible
63. arators that are different from the ones used in western languages Spaces can be forbidden optional or mandatory In order to better cope with these particularities Unitex splits texts in a language dependent way Thus languages like French are treated as follows A lexical unit can be e the phrase separator S e alexical tag aujourd hui ADV e a sequence of letters the letters are defined in the language alphabet file e a non word character if it is a newline it is replaced by a space For other languages splitting is done on a character by character basis except for the phrase separator S and lexical tags This simple splitting is fundamental for the use of Unitex but limits the optimization of search operations for patterns 2 5 PREPROCESSING A TEXT 25 Regardless of the mechanism used the newlines in a text are replaced by spaces Splitting is done by the Tokenize program This program creates several files that are saved in the text directory e tokens txt contains the list of lexical units in the order in which they are found in the text e text cod contains the position table every number in this table corresponds to the index of a lexical unit in the file tokens txt e tok_by_freq txt contains the list of lexical units sorted by frequency e tok_by_alph txt contains the list of lexical units in alphabetical order e stats n contains some statistics about the text Splitting the text A cat is
64. art imposes that tu is then regarded as a pronoun Figure 7 13 shows the result of the application of this grammar on the sentence Feras tu cela bient t One can see in the automaton at the bottom that the path corresponding to tu past participle was eliminated CHAPTER 7 TEXT AUTOMATA Feras tu cela bient t Figure 7 13 R sultat de l application de la grammaire de la figure 7 12 suivi par il elle ou on doit tre pr c d par un verbe Figure 7 14 Utilisation du point de synchronisation 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 105 Point of synchronization The if and then parts of an ELAG grammar are divided into two parts by the lt gt in the if part and lt gt in the then part These symbols form a pointof synchronization This makes it possible to write rules in which the if and then constraints are not necessarily aligned as it is for example the case in figure 7 14 This grammar is interpreted in the following way if a dash is found followed by il elle or on then this dash must be preceded by a verb possibly followed by t So if one considers the sentence of the figure 7 15 beggining with Est il one can see that all non verb interpretations of Est were removed FST Text 3658 sentences Est il donc si rare que les meilleurs marcheurs des lignes transoc aniennes prouv nt des retards de deux ou trois jours Sentence aa Reset Sentence Graph Figure 7
65. asse ie EOE RE RH ES 123 8 2 Conversion of a table into graphs 4 Li s 4 ea ene ea vases ss 123 8 2 1 Principle of template graphs ceros wes 123 8 2 2 Format of the table 124 Gen We Cpe eae e e e Le E ela gE a ee ee oe ee eS 125 8 2 4 Automatic generation of graphs 125 Use of external programs 129 RE tao cs a o BP hee a ec aa ee BE we ee 129 o AA IN 129 Br se da ss ass desde 284 Rees ee nait a bd bane 130 A e sred aene sine es di dam Slew SER Ra ha 131 S MN a Sage SNS a A ee es os 133 OG Blap css Jen o Gores Bee CRESS OR AR A A oe es AA es 133 OF POR eo AR A A Oe ok ak A ee eS 134 ls E O RR AN 134 Oo ExplogeFst2 saciar ra AA RR oH ER ES 134 ONO E es can ae ask A See ee a ld lea e a br fs 134 OL PAR a Gh Awl xt ee thee dc eae eRe ee hh ha a eA 135 AE O A ssh SA Ss ee eee Ge aside a ee Sa es 135 GAS USPS eme cua ai ew ed Loin a ee ee den eR ee 135 Oe PEINE daa Oe OEE Bee ee ROPER a 137 BU Mere Sie RO Po Te ee Due 137 910 MEE Se a nat ape se ea Rene AAA RAS 137 A A 138 LS LOCAS s ania aeea aea a ae de a de 138 9 19 Merge IextAUtOMmaAtON rr A RO A A ES 139 O20 NOrmal e c kw ds wlan ae a a a a ee ES 139 9 21 VOLES h S44 e4 ane PRES E PEERS ee RE es ds 139 22 RECONS OCA lt 5 Me bee de Ba ee ee ee ae PEER A eee ee 139 A oon Suk a d ee Oe Oe RO oe Se Ee eae eee Re eS 140 Oe SOMME ecg ee ee PS SE BE a ee a DR ek A ee De ew A 140 eee E e
66. ated with each of these lines has the following form lt a href X Y Z gt X and Y represent the start and end position of the occurrence in characters in the file name_of_text snt Z represents the number of the sentence in which this occurrence appears All spaces that are at the left and right edges of lines are encoded as non breaking space amp nbsp in HTML which allows the preservation of the alignment of the utterances even if one of them one that is at the beginning of the file has a left context with spaces NOTE if the concordance has been constructed with the parameter glossanet the HTML file obtains the same structure except for the links In these concordances the occur rences are real links pointing at the web server of the GlossaNet application For more infor mation on GlossaNet consult the link on the Unitex web site http www igm univ mlv fr uni 154 CHAPTER 10 FILE FORMATS Here an example of a file lt html lang en gt Y lt head gt q lt meta http equiv Content Type content text html charset UTF 8 gt 4 lt title gt 6 matches lt title gt Y lt head gt lt body gt 4 lt table border 0 width 100 gt lt td nowrap gt Y lt font face Courier new size 3 gt Y on there lt a href 116 124 2 gt extended lt a gt nbsp i nbsp lt br gt Y amp nbsp extended lt a href 125 127 2 gt in lt a gt amp nbsp ancient nbsp lt br gt amp nbsp Scott S lt a href 32 34 2 gt IN lt a gt
67. ation of the results When the search is finished the window of figure 4 5 appears showing the number of matched occurrences the number of recognized lexical entities and the ratio between this number and the total number of lexical units in the text After having clicked on OK you will see window 4 6 appear which allows to configure the presentation of the matched occurrences You can also open this window by clicking on Display Located Sequences in the menu Text The list of occurrences is called a concordance The box Modify text offers the possibility to replace the matched occurrences with the generated outputs This possibility will be examined in chapter 6 52 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS Result Info Fa 200 matches 563 recognized units 0 273 of the text is covered Figure 4 5 Search results Display indexed sequences il Extract matching units p Extract unmatching units Concordance presentation _ Use a web browser to view the concordance better for more than 2000 matches Show Matching Sequences in Context Lengths of Contexts Sort According to Left Col ao chars Center Left Col v Right Col 55 chars Build concordance Figure 4 6 Configuration of the presentation of the found occurrences The box Extract units allows to create a text file with all the sentences that do or do not contain matched
68. ber between a name and an adjective which suits This grammar will preserve the correct analysis of sentences like Les personnes de bonne humeur m insupportent Is is however recommended to limit the use of the operator because it harms the legibility of the grammars It is preferable to distinguish the labels which accept various inflectional combinations by means of discriminating subcategories defined in the discr part Optional Codes Tje optional syntactic and semantic codes are declared in the cat part They can be used in ELAG grammars like the other codes The difference is that these codes do not intervene to decide if a label must be rejected or not This grammar is not completely correct because it eliminates for example the correct analysis of the sentence J ai re u des coups de fil de ma m re hallucinants 116 CHAPTER 7 TEXT AUTOMATA lt gt Figure 7 19 Grammaire ELAG v rifiant l accord en genre et en nombre entre un nom et l adjectif qui le suit In fact optional codes are independent of the other codes such as for example the at tribute of the language level z1 z2 or z3 In the samme manner for inflectional codes it is also possible to deny an inflectional attribute by writing the character right before the name of the attribute Thus with our example file the lt A gauche f gt recognizes all female adjectives which do not have the g All codes which are not declar
69. cessary to adapt the granularity of the dictionaries to the intended use For each lexical unit of the sentence Unitex searches for all possible interpretations in the dictionary of the simple words of the text Afterwards all lexical units that have an interpre tation in the dictionary of the composite words of the text are sought All the combinations of their interpretations constitute the sentence automaton 7 2 CONSTRUCTION 97 Figure 7 2 Overlap between a compound word and a combination of simple words NOTE If the text contains lexical labels e g out of date A z1 these labels are reproduced identically in the automaton without trying to decompose the sequences which they represent In each box the first line contains the inflected form found in the text and the second line contains the canonical form if it is different The other information is coded below the box cf section 7 4 1 The spaces that separate the lexical units are not copied into the automaton save the spaces inside composite words The case of lexical units is conserved For example if the word Here is encountered the capital letter is preserved cf figure 7 1 This choice allows to keep this information during the transition to the text automaton which could be useful for applications where case is important such as recognition of proper names 7 2 2 Normalization of ambiguous forms During construction of the automaton it is possible to eff
70. cfg Under Linux Unitex expects the personal directory of the user to be called unitex and expects it to be in his root directory HOME Under Windows it is not always possible to associate a directory with a user per default To compensate for that Unitex creates a cfg file for each user that contains the path to his personal directory This file is saved under the name user login cfgin the sub directory of the system Unitex Users ATTENTION THIS FILE IS NOT IN UNICODE AND THE PATH OF THE PERSONAL DIRECTORY IS NOT FOLLOWED BY A NEWLINE 10 10 Various other files For each text Unitex creates multiple files that contain information that is displayed in the graphical interface This section describes these files 10 10 1 The files dlf n dlc n et err n These three files are text files that are stored in the text directory They contain the number of lines of the files 41f dlc and err respectively These numbers are followed by a newline 10 10 2 The file stat_dic n This file is a text file in the directory of the text It has three lines that contain the number of lines of the files d1 f dlc and err 10 10 3 The file stats n This file is in the text directory and contains a line in the following form 3949 sentence delimiters 169394 9428 diff tokens 73788 9399 forms 438 10 digits The numbers indicated are interpreted in the following way simple 162 CHAPTER 10 FILE FORMATS e sentence delimiters number of
71. ch as we bring to the shrine of the ADJ Blessed Virgin Well you have said enough port Gurth the son of Beowulph is the ADJ born thrall of Cedric of Rotherwood Beside t Figure 6 17 Concordance obtained in MERGE mode with the transducer of figure 6 16 6 5 2 Application while advancing through the text During the preprocessing operations the text is modified as it is being read In order to avoid the risk of infinite loops it is necessary that the sequences that are produced by a transducer will not be re analyzed by the same one Therefore whenever a sequence is 86 CHAPTER 6 ADVANCED USE OF GRAPHS inserted into the text the application of the transducer is continued after that sequence This rule only applies to preprocessing transducers because during the application of syntactic graphs the transductions do not modify the processed text but a concordance file which is different from the text 6 5 3 Priority of the leftmost match During the application of a local grammar the collected occurrences are all indexed During the construction of the concordance all these occurrences are presented cf figure 6 18 red by the river Don there extended in ancient times a large forest covering the greater pa atered by the river Don there extended in ancient times a large forest covering the greater he river Don there extended in ancient times a large forest covering the greater part of th Figure 6 18 Occurrence
72. cial characters in the graph editor indicates the start of a transduction within a box 5 2 9 Toolbar Commands The toolbar to the left of the graphs contains short cuts for certain commands and allows to manipulate boxes of a graph by using some utilities This toolbar may be moved by clicking on the rough zone It may also be dissociated from the graph and appear in an separate window see figure 5 15 In this case closing this window puts the toolbar back at its initial position Each graph has its own toolbar LE ji e buse Me Z Figure 5 15 Toolbar The first two icons are shortcuts for saving and compiling the graph The five following correspond to the Copy Cut Paste Redo and Undo operations The last icon showing a key is a shortcut to open the window with the graph display options The other 6 icons correspond to edit commands for boxes The first one a white arrow corresponds to the boxes normal edit mode The 5 others correspond to specific utilities In order to use a utility click on the corresponding icon The mouse cursor changes its form and mouse clicks are then interpreted in a particular fashion What follows is a description of these utiltities from left to right e creating boxes creates a box at the empty place where the mouse was clicked e deleting boxes deletes the box that you click on 5 3 DISPLAY OPTIONS 65 e connect boxes to another box using this utility you select one o
73. codes 3 3 Sorting Unitex uses the dictionaries without having to worry about the order of the entries When displaying them it is sometimes preferable to sort the dictionaries The sorting depends on a number of criteria first of all on the language of the text Therefore the sorting of a Thai dictionary is done according to an order different from the alphabetical order So different in fact that Unitex uses a sorting procedure developed specifically for Thai 9 For European languages the sorting is usually done in terms of the lexicographical order although there are some variants Certain languages like French treat some characters as 3 4 AUTOMATIC INFLECTION 37 equivalent For example the difference between the characters e and is ignored if one wants to compare the words manger etmang s because the contexts r and s allow to decide the order The difference is only taken into account when the contexts are identical as they are when comparing p che and peche To allow for such phenomena the sort program SortTxt uses a file which defines the equivalence of characters This file is named Alphabet_ sort txt and can be found in the user directory for the current language By default the first lines of this file for French look like this AAAAaaaa Bb CCEC Dd E e Characters in the same line are considered equivalent if the context permits If two equi valent characters must be compared they are sorted in
74. combinations This information is described in the file named tagset def tagset def file Here is an extract of the tagset def file used for French NAME francais POS ADV POS PRO inflex pers 1 23 110 CHAPTER 7 TEXT AUTOMATA genre m f nombre s p diser subcat Pind Pdem PpvIL PpvLUI PpvLE Ton PpvPR PronQ Dnom Ppossls complete Pind lt genre gt lt nombre gt Pdem lt genre gt lt nombre gt Ppossls lt genre gt lt nombre gt Pposslp lt genre gt lt nombre gt Pposs2s lt genre gt lt nombre gt Pposs2p lt genre gt lt nombre gt Pposs3s lt genre gt lt nombre gt Pposs3p lt genre gt lt nombre gt PpvIL lt genre gt lt nombre gt lt pers gt PpvLE lt genre gt lt nombre gt lt pers gt PpvLUI lt genre gt lt nombre gt lt pers gt Ton lt genre gt lt nombre gt lt pers gt lui elle moi PpvPR en y PronQ o qui que quoi Dnom rien POS A adjectifs 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 111 inflex genre m f nombre s p cat gauche g droite d complete lt genre gt lt nombre gt pour de bonne humeur A au bord des larmes A par exemple POS V inflex temps CFIJKPSTWYGX pers 1 23 genre m f nombre s p complete W G C lt pers gt lt nombre gt 112 CHAPTER 7 TEXT AUTOMATA F lt pers gt lt nombre gt I lt pers gt lt nombre gt J lt pers gt
75. e lt E gt symbol that represents the empty word epsilon Replace this 58 CHAPTER 5 LOCAL GRAMMARS symbol by the text 1e 1a 1 1es and press the enter key You see that the box now con tains four lines see figure 5 4 The character serves as a separator The box is displayed in the form of red text lines since itis not connected to another one at the moment We often use this type of boxes to insert comments into a graph tua oheko el Figure 5 3 Creating a box OH vablxis l Figure 5 4 Box containing le la l les To connect a box to another one first click on the source box followed by a click on the target box If there already exists a transition between two boxes it is deleted It is also possible to use this operation by clicking first on the target box and then on the source box while pressing Shift In our example after connecting the box to the init and the final states of the graph we get a graph as in figure 5 5 5 2 EDITING GRAPHS 59 Figure 5 5 Graph that recognizes pronouns in French NOTE If you double click a box you connect this box to itself see figure 5 6 To undo this double click on the same box a second time apr s midi soir Figure 5 6 Box connected to itself Click on Save as in the FSGraph menu to save the graph By default Uni
76. e lt V gt matches all entries having the grammatical code V 4 3 PATTERNS 45 e am be V or lt am be V gt matches all the entries having am as inflected form be as canonical form and the grammatical code V This kind of pattern is only of interest if applied to the text automaton where all the ambiguities of the words are explicit While executing a search on the text that pattern matches the same as the simple lexical unit am 4 3 3 Grammatical and semantic constraints The reference to the dictionary V in these examples is elementary It is possible to express more complex patterns by using several grammatical or semantic codes separated by the character An entry of the dictionary is then only found if it has all the codes that are present in the pattern The pattern lt N z1 gt thus recognizes the entries broderies broderie N zl fp capitales europ ennes capital urop enne N NA Conc HumColl z1 fp but not Descartes Ren Descartes N Hum NPropre ms habitu A z1 ms It is possible to exclude codes by preceding them with the character instead of In or der to be recognized an entry has to contain all the codes authorized by the pattern and none of the prohibited codes The pattern lt A z3 gt thus recognizes all the adjectives that do not have the code z3 cf table 3 2 If you want to refer to a code containing the character you have to escape this character by preceding it with a 1 Thus the pattern l
77. e first two bytes indicate if the state is final as well as the number of transitions that leave it The highest bit is 0 if the state is final 1 if not The other 15 bits encode the number of transitions Example a non final state with 17 transitions is encoded by the hexadecimal sequence 8011 if the state is final the three following bytes encode the index in the inf file of the compressed form to be used to reconstruct the dictionary lines for this inflected form Example if the state refers to the compressed form with the index 25133 the corre sponding hexadecimal sequence is 00622D each outgoing transition is then encoded in 5 bytes The first 2 bytes encode the char acter that labels the transition and the three following encode the byte position of the result state in the bin file The transitions of a state are encoded next to each other Example a transition that is labeled with the letter A pointing at the state of which the description starts at byte 50106 is represented by the hexadecimal sequence 004100C3BA By convention the first state of the automaton is the initial state 10 7 2 The inf files A inf file is a text file that describes the compressed files that are associated with a bin file Here an example of a inf file 00000000064 _10 0 0 7 N4 PREPY _3 PREPY PREP _3 PREPY 1 1 N Hum mp4 3er 1 N AN Hum fsY The first line of the file indicates the number of compressed forms that it contains Each
78. e personal work directory Chapter 2 Loading a text One of the main functionalities of Unitex is being able to search for expressions within a text For this texts have to undergo a set of preprocessing steps that normalize non ambiguous forms and split the text in sentences Once these operations are performed the electronic dictionaries are applied to the texts Then one can search more effectively in the texts by using grammars This chapter describes the different steps for text preprocessing 2 1 Selecting a language When starting Unitex the program asks you to choose the language in which you want to work see figure 2 1 The languages displayed are the ones that are present in the system directory Unitex and those that are installed in your personal working directory If you use a language for the first time Unitex copies the system directory for this language to your personal directory except for the dictionaries Choosing the language allows Unitex to find certain files for example the alphabet file You can change the language at any time by choosing Change Language in the Text menu If you change the language the program will close all windows related to the cur rent text if there are any The active language is indicated in the title bar of the graphical interface 2 2 Text formats Unitex works with Unicode texts Unicode is a standard that describes a universal character code Each character is given a unique
79. e recognize sequences MERGE mode e R the transductions have replaced the recognized sequences REPLACE mode Each occurrence is described in one line The lines start with the start and end position of the occurrence These positions are given in lexical units If the file has the heading line 1 the end position of each occurrence is immediately followed by a newline Otherwise it is followed by a space and a sequence of characters In REPLACE mode that sequence corresponds to the transduction produced for the rec ognized sequence In MERGE mode it represents the recognized sequences into which the transductions have been inserted In MERGE or REPLACE mode this sequence is displayed in the concordance If the transductions have been ignored the contents of the occurrence is extracted from the text file 10 6 2 The file concord txt The file concord txt is a text file that represents a concordance Each occurrence is en coded in a line that is composed of three character sequences separated by a tab repre senting the left context the occurrence possibly modified by transductions and the right context 10 6 3 The file concord html The concord html file is an HTML file that represents a concordance This file is encoded in UTF 8 The title of the page is the number of occurrences it contains The lines of the concor dance are encoded as lines where the occurrences are considered to be hypertext lines The reference associ
80. e the automaton of that sentence by clicking on the botton Reset Sentence Graph cf figure 7 24 FST Text 2344 sentences Ivanhoe by Sir Walter Scott Sentence 1 Reset Sentence Graph Rebuild FST Text Elag Frame E fw franca N ProperNoun PREP N ProperNoun Figure 7 24 Modified sentence automaton During the construction of the text automaton all the modified sentence graphs in the text file are erased NOTE You can reconstruct the text automaton and keep your manual modifications In order to do that click on the button Rebuild FST Text All sentences that have modifica tions are then replaced in the text automaton with their modified versions The new text automaton is then automatically reloaded 7 4 3 Parameters of presentation The sentence automata are subject to the same presentation options as the graphs They use the same colors and fonts as well as the antialiasing effect In order to configure the appearance of the sentence automata you modify the general configuration by clicking on Preferences in the menu Info For further details refer to the section 5 3 5 You can also print a sentence automaton by clicking on Print in the menu FSGraph or by pressing lt Ctrl P gt Make sure that the printer s page orientation is set to landscape mode To configure this parameter click on Page Setup in the menu FSGraph 122 CHAPTER 7 TEXT A
81. ect a normalization of ambiguous forms by applying a normalization grammar This grammar has to be called Norm f st 2 and must be placed in your personal folder in the subfolder Graphs Normalization of the desired language The normalization grammars for ambiguous forms are described in section 6 1 3 If a sequence of the text is recognized by the normalization grammar all the interpre tations that are described by the grammar are inserted into the text automaton Figure 7 4 98 CHAPTER 7 TEXT AUTOMATA DET DetQ s p PRO RelQ s p Figure 7 3 Double entry for which as a determinator shows the extract of the grammar used for French that makes the ambiguity of the sequence 1 explicit fla le PRO PpvLE z1 3fs Figure 7 4 Normalization of the sequence 1 If this grammar is applied to a French sentence containing the sequence 1 a sentence automaton that is similar to the one in figure 7 5 is obtained You can see that the four rules for rewriting the sequence 1 have been applied which has added four labels to the automaton These labels are concurrent with the two preex isting paths for the sequence 1 The normalization at the time of the construction of the automaton allows to add paths to the automaton but not to erase paths When the dis ambiguation functionality will be available it will allow to eliminate the paths that have become superfluous 7 2 3 Normalization of clitical pronouns in Portugu
82. ection XXX 6 1 6 Template graphs The template graphs are meta graphs that allow to generate a family of graphs starting from a lexical grammar table It is possible to construct model graphs for all possible kinds of graphs The construction and use of model graphs will be explained in chapter 8 6 2 Compilation of a grammar 6 2 1 Compilation of a graph The compilation is the operation that converts the format gr to a format that can be ma nipulated more easily by the Unitex programs In order to compile a graph you open it and then click on Compile FST2 in the submenu Tools of the menu FSGraph Unitex then starts the program Grf2Fst2 You can keep track of its execution in a window cf figure 6 4 6 2 COMPILATION OF A GRAMMAR 77 Compiling graph Det Compiling graph DetSimple Recursion detection started Resolving lt E gt conditions Checking lt E gt dependancies Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Cannot open the graph DetSimple grf Figure 6 4 Compilation window If the graph references subgraphs those are automatically compiled The result is a st 2 file that contains all the graphs that make up a grammar The grammar is then ready to be used by the different Unitex programs 6 2 2 Approximation with a finite state transducer The FST2 format conserves the architecture in subgraphs of the grammars which is what
83. ed in the tagset def file are ignored by ELAG If a dictionary entry contains such a code ELAG will produce a warn ing and will withdraw the code for the entry Consequently if two concurrent entries differ in the original text automaton only by not declared codes these entries will become indistinguable by the programs and will thus be unified in only one entry in the resulting automaton Thus the set of labels described in the file SSverbtagset de f can be enough to reduce the ambiguity by factorizing words which differ only by not declared codes an this independently of the applied grammars For example in the most complete version of the French dictionary each individual use of a verb is characterized 3This code indicates that the adjective must appear on the left of the nound to which it refers to as it is the case for bel 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 117 by a reference towards the lexicon grammar table which characterizes it We have con sidered until now that these informations are more relevant to syntax than to lexical analysis and we thus don t have them integrated into the description of the sets of labels They are thus automatically eliminated at the time when the text automaton is loaded which reduces the rate of ambiguities In order to distinguish the effects bound to the set of labels from those of the ELAG grammars it is advised to procees to a preliminiary stage of standardizati
84. ed to construct an inverse dictionary which is necessary for the program Reconstrucao 9 3 Concord Concord index font fontsize left right order mode alph thai This program takes an index file of the concordance produced by the program Locate and produces a concordance It is also possible to produce a modified text version taking into account the transductions associated with the occurrences Here is the description of the parameters index name of the concordance file It is necessary to specify the entire file path since Unitex uses it to determine for which text the concordance is to be constructed font name of the typeface if the concordance is in HTML format This value is ig nored if the concordance is not in HTML format fontsize size of the typeface if the concordance is in HTML format Like the pa rameter font it is also ignored if the concordance is not in HTML format left number of characters to the left of the occurrences In Thai mode this means the number of non diacritic characters right number of characters non diacritic in Thai mode to the right of the occur rences If the occurrence is shorter than this value the concordance line is displayed so that the right context is equal to right If the occurrence has a length longer than the characters defined by right it is nevertheless saved as whole order indicates the mode to be used for sorting the lines of the concordance The possible values are
85. eds to be enclosed in quota tion marks so it will not be considered as multiple parameters 9 1 CheckDic CheckDic dictionary type This program carries out the verification of the format of a dictionary of type DELAS or DELAF The parameter dict ionnary corresponds to the name of the dictionary that is to be verified The parameter t ype can take the value DELAS or DELAF depending on the format of the dictionary to be verified The program checks the syntax of the lines of the dictionary It also creates a list of all characters occuring in the inflected and canonical forms of words the list of grammatical and syntactic codes as well as the list of inflectional codes used The results of the verifica tion are stored in a file called CHECK_DIC TXT 9 2 Compress Compress dictionary flip 129 130 CHAPTER 9 USE OF EXTERNAL PROGRAMS This program takes a DELAF dictionary as a parameter and compresses it The compres sion of a dictionary dico dic produces two files dico bin a binary file containing the minimum automaton of the inflected forms of the dictionary dico inf a text file containing the compressed forms allowing the reconstruction of the dictionary lines from the inflected formes contained in the automaton For more details on the format of these files see chapter 10 The optional parameter flip indicates that the inflected and canonical forms should be inversed in the compressed dictionary This option is us
86. ee section 2 2 If you re using Microsoft Word to perform this conversion make sure that the graph always has the grf extension after the conversion since it happens that the txt extension is automatically appended If a t xt extension was appended it has to be removed ATTENTION A graph converted to Unicode that was used in Unitex cannot be used in Intex any longer In order to use it again in Intex you have to convert the text to ASCII for example using the Uni2Asc program In addition to this you have to open the graph in a text editor and replace the first line Unigraph by the following line FSGraph 4 0 5 2 EDITING GRAPHS 57 5 2 2 Creating a graph In order to create a graph click on New in the FSGraph menu You will then see the window coming up as in figure 5 2 The symbol in arrow form is the init state of the graph The round symbol with a square is the final state of the graph The grammar only recognizes expressions that are described along the paths between init and final state A Unitex 1 2 current language is French Figure 5 2 Empty graph In order to create a box click inside the window while pressing the Ctrl key A blue rectangle will appear that symbolizes the empty box that was created see figure 5 3 After creating the box it is automatically selected You see the contents of that box in the text field at the top of the window The newly created box contains th
87. eed to be compiled before using them 6 1 4 Syntactic graphs The syntactic graphs often called local grammars allow to describe syntactic patterns that can then be searched in the texts Of all kinds of graphs these have the greatest expressional power because they allow to refer to dictionaries Lower case upper case variants may be used according to the principle described above It is still possible to enforce respect of case by enclosing an expression in quotes The use of quotes also allows to enforce the respect of spaces In fact Unitex by default assumes that a space is possible between two boxes In order to enforce the presence of a space you have to enclose it in quotes For prohibiting the presence of a space you have to use the special symbol The syntactic graphs can reference subgraphs cf section 5 2 3 They also have trans ductions including transductions with variables The produced sequences are interpreted as strings of characters that will be inserted in the concordances or in the text if you want to modify it cf section 6 6 3 The special symbols that are supported by the syntactic graphs are the same that are usable in the regular expressions cf section 4 3 1 It is not obligatory to compile the syntactic graphs before using them for pattern search ing If a graph is not compiled the system will compile it automatically 6 15 ELAG Grammars The syntax of grammars to resolve ambiguities est presented in s
88. emove Files E My UnitexFrenchiCorpusichimie txt E My Unitex FrenchiCorpus essai txt Transcode Cancel Figure 2 3 File conversion To obtain a text in the right format you can also use a text processor like the free soft ware from OpenOffice org or Microsoft Word and save your document with the format Unicode text By default the encoding proposed on a PC is always Unicode Little Endian The texts thus obtained do not contain any formatting information anymore fonts colors etc and are ready to be used with Unitex 2 3 Editing texts You also have the possibility of using the text editor integrated into Unitex accessible via the Open command in the File Edition menu This editor offers search and replace functionalities for the texts and dictionaries handled by Unitex To use it click on the Find icon You will then see a window divided into three parts The Find part corresponds to the usual search operations If you open a text split into sentences you will have the possi bility to search by the number of a sentence in the Find Sentence part Lastly the Search 18 CHAPTER 2 LOADING A TEXT Enregistrer sous Enregistrer dans fa spaumier gt El El Ei gimp 1 2 Mltools Java a Param tres locaux Jjavaws Personnel jpi_cache Pr f r s Aglae SendTo Cookies Temporary Internet Files Donn es d applications Menu D marrer Nom de fichier ceci est un es
89. enerated This main graph is a graph that refers to all graphs that are going to be generated When launching a search in a text with that graph all generated graphs are simultaneously applied The field Named of produced subgraphs is used to set the name of each graph that will be generated It is a good idea to enter a name containing because for each line of the table will be replaced the line number which guarantees that each graph name will be 8 2 CONVERSION OF A TABLE INTO GRAPHS 127 F Entable31H txt lacquie site adouber lagioter agoniser archa ser jarquer arriver atermoyer Ibadauder baisser bambocher bander barouder batifo ler b cher b tifier bigler bo iter boitiller bouffo Spr Ge PED se AA ENI tt ee ae RR RR raja PPP HEP EP ETE EE ET EPET ri EI E E EE C E EEIE ITET ET E ENE USA A A OS gt K Compile Lexicon Grammar to GRF Reference Graph in GRF format Resulting GRF grammar se Name of produced subgraphs Figure 8 6 Configuration of the automatic generation of graphs unique For example if the main graph is called TestGraph gr and if subgraphs are called TestGraph_ grf the graph generated from the 16th line of the line of the table will be named TestGraph_0016 grf Figures 8 7 and 8 8 show two graphs generated by applying the template grap
90. entation C Use a web browser to view the concordance better for more than 2000 matches Show Matching Sequences in Context Lengths of Contexts Sort According to Left Col 40 chars Center Left Col z Right Col ss chars Build concordance Figure 6 28 Configuration for displaying the encountered occurrences The concordance is produced in the form of an HTML file You can parameterize Unitex so that the concordances can be read using a web browser cf section 4 8 2 If you display the concordances with the window provided by Unitex you can access a recognized sequence in the text by clicking on the occurrence If the text window is not 6 6 APPLYING GRAPHS TO TEXTS 93 iconified and the text is not too long to be displayed you see the selected sequence appear cf figure 6 29 Concordance E My Unitex EnglishiCorpustivanhoe_snticoncord html 1 match 8 Enable links Allow concordance edition jas Here haunted of yore the fabulous Dragon of Wantley here were fought many of the most E My Unitex English Corpus ivanhoe snt 2343 sentence delimiters 186614 9301 diff tokens 83776 9275 simple forms 25 9 digits 81965 13285 diff simple words 371 219 compound words 1805 414 unknown tokens Ivanhoe by Sir Walter Scott 5 IN THAT PLEASANT DISTRICT of merry England which is watered by the river Don th ere extended in ancient times a large forest cove
91. es to the variables you use These names can contain non accentuated lower case and upper case letters between A and z digits and the character _ underscore In order to define the end of the zone that is stored in a variable you have to create a box that contains the name of the variable enclosed in the characters and and for the end of a variable In order to use a variable in a transduction its name must be preceded by the Character cf figure 6 19 6 5 RULES FOR APPLYING TRANSDUCERS 87 Variables are global This means that you can to define a variable in a graph and reference it in another as is illustrated in the graphs of figure 6 19 TITLE title Figure 6 19 Definition of a variable in a subgraph If the graph Tit leName is applied in MERGE mode to the text Ivanhoe the following concordance is obtained Concordance E My UnitexiEnglishiCorpuslivanhoe_snticoncord html 8 Enable links Allow concordance edition ged up his shoulders and was silent S Prince John TITLE Prince resumed his retreat from the s name said the hermit his name is Sir Anthony of Scrabelstone TITLE Sir as if I would he royal pledge again passed round To Sir Athelstane of Coningsburgh TITLE Sir There was n ed me Whoever shall call thee Saxon Sir Baron TITLE Sir replied Cedric offended at a mo e of importance to say lady answered Sir Brian de Bois
92. ese In Portuguese the verbs in the future tense and in the conditional can be modified by the insertion of one or two clitical pronouns between the root and the suffix of the verb For example the sequence dir me ao they will tell me corresponds to the complete verbal form dir o associated with the pronoun me In order to be able to manipulate this rewritten form 7 2 CONSTRUCTION 99 accumulation des accumulation de NDET Dnoml4 DET 21 ms fs Figure 7 5 Automaton that has been normalized with the grammar of figure 7 4 it is necessary to introduce it into the text automaton in parallel to the original form Thus the user could search one or the other form The figures 7 6 and 7 7 show the automaton of a sentence after the normalization of the clitics 3543 sentences Os benfeitores Dir se ia uma galeria de afogados todos so lenes secos hirtos de l bios finos e ar de cerim nia Sentence 1 2857 Reset Sentence Graph Rebuild FST Text Elag Frame El V Ils Ls 14s L3s PRO Pes R4ms R4fs R4mp R4fp Figure 7 6 Non normalized phrase automaton 100 CHAPTER 7 TEXT AUTOMATA Os benfeitores Dir se ia uma galeria de afogados todos solenes s hirtos de labios finos e ar de cerim nia se eu PRO Pes R4ms R4fs R4mp V Cls Cds C3s diria dizer V Cls Cds C3s PRO Pes R4ms R4fs R4mp R4fp Figure 7 7 Normalized phrase au
93. esigned to be applied before the construction of an automaton for a Portugese text The parameter alph specifies the al phabet file to use The file concord represents a concordance which has to been pro duced by the application in MERGE mode to the considered text of a grammar that ex tracts all forms to normalize This grammar is called V Pro Suf and is stored in the directory Portuguese Graphs Normalization The parameter dic specifies which dictionary to use to find the canonical forms that are associated with the roots of the verbs reverse_dic specifies the inverse dictionary to use to find the forms in future and con ditional starting from canonical forms These two dictionaries have to be in bin format and reverse_dic has to be obtained by compressing the dictionary of verbs in future and conditional with the parameter flip see section 9 2 The parameter pro specifies the grammar of reentry of the pronoms to use res specifies the file gr into which the nor malization rules are to written 9 23 Reg2Grf Reg2Grf file This program constructs a file gr f corresponding to the regular expression in file file The parameter file represents the complete path to the file containing the regular expres sion This file needs to be a Unicode text file The program takes into account all characters up to the first newline The result file is called regexp grf and is saved in the same direc tory as fic 9 24 SortTxt SortTxt text OPTI
94. ex Chapter 4 Searching with regular expressions This chapter describes how to search for simple patterns in a text by using regular expres sions 4 1 Definition The goal of this chapter is not to give an introduction on formal languages but to show how to use regular expressions in Unitex in order to search for simple patterns Readers who are interested in a more formal presentation can consult the many works that discuss regular expression patterns A regular expression can be e a lexical unit livre or a pattern lt smoke V gt e the concatenation of two regular expressions he smokes e the union of two regular expressions Pierre Paul e the Kleene star of a regular expression finish 4 2 Lexical units In a regular expression a lexical unit is a sequence of letters The symbols period plus star less than as well as the opening and closing parentheses have a special meaning It is therefore necessary to precede them with an escape character if you want to search for them Here are some examples of valid lexical units cat 3 1415 1984 S 43 44 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS By default Unitex is set up to let lower case patterns also find upper case matches It is possibe to enforce case sensitive matching using quotation marks Thus peter recognizes only the form peter and not Peter or PETER NOTE in order to make a space obligatory you needs to be enclosed in q
95. explicitly clicking on Apply Lexical Resources in the Text menu see section 3 6 We will now describe the rules for applying dictionaries in detail 3 6 1 Priorities The priority rule says that if a word in a text is found in a dictionary this word will not be taken into account by dictionaries with lower priority This allows for eliminating certain ambiguities when applying dictionaries For exam ple the word par has a nominal interpretation in the golf domain If you don t want to use this reading it is sufficient to create a filter dictionary containing only the entry par PREP and to apply this with highest priority This way even if dictionaries of simple words con tain a different entry this will be ignored given the priority rule 3 6 APPLYING DICTIONARIES 41 85 completed 88 completed 92 completed 96 completed 100 completed Minimization done Binary file 991031 bytes 611990 lines read 4671 INF entries created Figure 3 9 Results of a compression There are three priority levels The dictionaries whose names without extension end with have the highest priority those that end with have the lowest one All other dictionaries are applied with medium priority The order in which dictionaries with the same priority are applied is not defined On the command line the command Dico ex snt alph txt countries bin cities bin rivers bin regions bin will apply the dictionaries
96. extracted e thai optional parameter necessary for searching a Thai text e space optional parameter indicating that the search should be performed beyond spaces This parameter should only be used to carry out morphological searches This program saves the references to the found occurrences in a file called concord ind The number of occurrences the number of units coverered by those occurrences as well as the percentage of recognized units within the text are saved in a file called concord n These two files are stored in the directory of the text 9 19 MERGETEXTAUTOMATON 139 9 19 MergeTextAutomaton MergeTextAutomaton automaton This program reconstructs the text automaton aut omat on taking into account the modi fication manually conducted In addition to that if the program finds a file sentencenN gr in the same directory as automaton it replaces the automaton of sentence N with the one represented by sentenceN grf The file automaton is replaced by the new text automa ton The old text automaton is backed up in a file called text fst2 bck 9 20 Normalize Normalize txt This program carries out a normalization of text separators The separators are space tab and newline Every sequence of separators that contains at least one newline is replaced by a unique newline All other sequences of separators are replaced by a single space This program also verifies the syntax of the lexical tables present in the text Al
97. f perform ing the following operations Normalization of separators identification of lexical units normalization of non ambiguous forms splitting into sentences and the application of dic tionaries If you choose not to preprocess the text it will nevertheless be normalized and lexical units will be looked up since these operations are necessary for all further Unitex operations It is always possible to carry out the preprocessing later by clicking on Prepro cess Text in the Text menu If you choose to preprocess the text Unitex proposes to parameterize it as in the window shown in figure 2 8 Preprocessing amp Lexical parsing x Preprocessing wi Apply FST2 in MERGE mode E My UnitexiFrenchiG raphs Prepracessing Sentence _ Se v Apply FST2 in REPLACE mo E My UnitextFrenchiGraphs PrepracessinglReplacelR Set Tokenizing The text is automatically tokenized This operation is language dependant so that Unitex can handle languages with special spacing rules Lexical F Parsing v Apply Al default Dictionaries C Analyse unknown words as free compound words ps Cancel but tokenize text this option is available only for German Norwegian amp Russian 7 Construct Text Automaton Cancel and close text Figure 2 8 Preprocessing Window The option Apply FST2 in MERGE mode is used to split the text into sentences The option Apply FST2 in REPLACE mode is u
98. factorized lexical entries 104 Fichier conc fst2 103 fst2 135 siat 109 File INDEX rul 103 105 135 fst2 76 bin 131 134 157 162 cfg 163 dic 131 Est 2 51 120 139 150 grf 51 81 120 141 147 html 133 int dol 157 snt 21 140 142 143 145 152 txt 92 133 145 152 Alphabet txt 146 CHECK_DIC TXT 131 159 Config 160 Sentence fst2 22 Unitex jar 12 13 concord html 155 concord ind 140 154 concord n 140 164 concord txt 155 cursentence grf 137 154 cursentence txt 137 154 dic 134 163 dic n 163 dl f 134 163 di n 163 enter pos 143 153 err 134 163 err n 163 regexp grf 141 stat_dic n 134 163 stats n 25 143 163 system _ dic def 162 text cod 24 143 152 text fst2 137 143 153 text fst2 bck 140 tok_by_alph txt 25 143 152 tok_by_freq txt 25 143 152 tokens txt 24 143 152 unitex zip 12 user_dic def 162 alphabet 15 22 24 34 132 138 139 142 143 format of 145 INDEX HTML 53 92 131 text 145 File Conversion 15 File formats 145 File gr 139 Files 1st 104 rul 103 185 tagset def 109 116 117 bin 39 dic 35 39 inf 40 Alphabet_sort txt 37 CHECK_DIC TXT 34 alphabet 42 dic 26 37 dif 26 37 err 26 37 Text largest size 19 texte 19 files tagset def 115 Form canonical 29 inflected 29 GlossaNet 132 155 Grammaires de lev e d ambiguit
99. fferent ways to add languages If you want to add a language that is to be accessible by all users you have to copy the corresponding directory to the Unitex system directory for which you will need to have the access rights this might mean that you need 1 7 UNINSTALLING UNITEX 13 CSN x Welcome paumier To use Unitex you must choose a private directory to store your data that you can change later if you want Click on OK to choose your directory Figure 1 1 First usage on Windows Welcome paumier Your private Unitex directory where you can store your own data is fhome thesards paumier unitex Figure 1 2 First usage on Linux to ask your system administrator to doit On the other hand if the language is only used by a single user he can also copy the directory to his working directory He can work with this language without this language being shown to other users 1 7 Uninstalling Unitex No matter which operating system you are working with it is sufficient to delete the Unitex directory to completely delete all the program files Under Windows you may have to delete the shortcut to Unitex jar if you have created one on your desktop The same has to be done on Linux if you have created an alias 14 CHAPTER 1 INSTALLING UNITEX Choose your private directory RE EI EE Cr ation d un nouveau dossier CiDocuments and SettingsipaumieriMes documentsiNew Folder Figure 1 3 Creating th
100. finition ends with a newline 10 3 2 Format fst2 An fst2 file is a text file that describes a set of graphs Here an example of an st 2 file 00000000024 1 NPY 114 a ae 10 3 GRAPHS 149 334 t Y 9 2 Adj4 6151414 tq fq lt E gt Y Sthe DETS lt A gt ADIJ4 S lt N gt 4 niceY pretty smal14 fT The first line represents the number of graphs that are encoded in the file The beginning of each graph is identified by a line that indicates the number and the name of the graph 1 NP and 2 Adj in the file above The following lines describe the states of the graph If the state is final the line starts with the character t and with the character if not For each state the list of transitions is a possibly empty sequence of entity pairs e the first entity indicates the number of the label or the sub graph corresponding to the transition The labels are numbered starting at 0 The sub graphs are represented by negative entities which explains why the numbers preceeding the names of the graphs are negativ e the second entity represents the number of the result state after the transition In each graph the states are numbered starting at 0 By convention the state 0 of a graph is its initial state Each definition line of a state terminates with a space The end of each graph is marked by a line containing an followed by a space The labels are defined after the last graph If the line begin
101. flected form of the verb suivre as opposed the form of the verb tre e lt facteur N Hum gt all nominal entries that have facteur as canonical form and that do not have the semantic code Hum e lt ADV gt all words that are not adverbs e lt MOT gt all symbols that are not letters except for the phrase separator cf figure 4 2 44 CONCATENATION 47 Concordance E My Unitex FrenchiCorpus 80jours_snt concord html 200 matches 8 Enable links Allow concor J a a u apa gt Cds 000 usgu a J eu al srl sp La lecture de ce journal occupa Phileas Fogg jusqu trois heures quarante cing et celle du St le d jeuner avec adjonction de royal british sauce S A six heures moins vingt le gentle d salon et s absorba dans la lecture du Morning Chronicle 5 Une demi heure plus tard divers et s absorba dans la lecture du Morning Chronicle 5 Ume demi heure plus tard divers membres demi heure plus tard divers membres du Reform Club faisaient leur entr e et s approchaient de C taient les partenaires habituels de Mr Phileas Fogg comme lui enrag s joueurs de whist taient les partenaires habituels de Mr Phileas Fogg comme lui enrag s joueurs de whist l in es partenaires habituels de Mr Phileas Fogg comme lui enrag s joueurs de whist l ing nieur nieur Andrew Stuart les banquiers John Sullivan et Samuel Fallentin le brasseur Thomas Flanag les banquiers John Sullivan et Samuel Fallentin le brasseur Thomas F
102. g di apparut ean n en d plaise monsieur r pondit E S En l ann e 1872 la maison portant es Sciences r unis qui est plac e sous les habitudes invariables du locataire s aucun des comptoirs de la Cit S Ni e linge en toile de Saxe 5 c taient 45 S il dinait ou d jeunait c taient ogg carr ment assis dans son fauteuil ptoirs de la Cit S Ni les bassins ni urs succulentes r serves 5 c taient dat 4 la parade les mains appuy es sur un extr me confort 5 D ailleurs avec ule _ appareil compliqu qui indiquait principalement dans le but de d truire S Son seul passe temps tait de lire les heures les minutes les secondes h s comme ceux d un soldat la parade entleman aussi myst rieux compt t parmi ent il avait fait fortune c est ce que de mots brefs et clairs il redressait eil compliqu qui indiquait les heures la Soc t entomologique fond e principalement dans le la Soci t de 1 Armonica jusqu la Soci t entomologiq la t te car il tait irr prochable quant aux pieds _ la t te haute regardait marcher l aiguille de la pendu laquelle s arrondit un d me vitraux bleus que suppor le but de d truire les insectes nuisibles S Phileas F le club au sujet des voyageurs perdus ou gar s 4 S il le corps droit la t te haute regardait marcher l aigu le d bit de son compte courant invariablement cr diteur le der er lequel il convint de s adresser pour l app a
103. h of fig ure 8 2 to table 31H Figure 8 9 shows the resulting main graph 128 CHAPTER 8 LEXICON GRAMMAR NO tre V ant le verbe n 7 ne v rifie pas la propri t de la colonne A Figure 8 7 Graph generated for the verb archa ser le verbe n 11 v rifie la propri t de la colonne A lt badauder V gt NO V vers N Figure 8 8 Graph generated for the verb badauder TestGraph_0119 TestGraph_0120 TestGraph_0121 TestGraph_0122 TestGraph 0133 TestGraph_0124 TestGraph_0125 TestGraph_0126 TestGraph_0127 TestGraph_0128 TestGraph_0129 TestGraph_0130 TestGraph_0131 Figure 8 9 Main graph referring to all generated graphs Chapter 9 Use of external programs This chapter presents the use of the different programs of which Unitex is composed These programs which can be found in the directory Unitex App are automatically called by the interface It is possible to see the commands that have been executed by clicking on the menu Info on the Console It is also possible to see the options of the different programs by selecting Help on commands in the menu Info ATTENTION a number of programs use the text directory my_text_snt This directory is created by the graphical interface after the normalization of the text If you work with the command line you have to create the directory manually before the execution of the program Normalize ATTENTION 2 whenever a parameter contains spaces it ne
104. haracters P 3 and s at the same time However the code P3p of E does contain the characters P and 3 The code P3 is included in at least one code of E the pattern M thus recognizes the entry E The order of the characters inside an inflectional code is without importance 4 3 5 Negation of a pattern Itis possible to negate a pattern by placing the character immediately after the character lt Negation is possible with the patterns lt MOT gt lt MIN gt lt MAJ gt lt PRE gt lt DIC gt as well as with the patterns that carry grammatical semantic of inflectional codes i e lt V z3 P3 gt The patterns and are each the negation of the other The pattern lt MOT gt recognizes all lexical units that do not consist of letters except for the phrase separator The negation is interpreted in a special way in the patterns lt DIC gt lt MIN gt lt MAJ gt and lt PRE gt Instead of recognizing all forms that are not recognized by the pattern with out negation these patterns find only forms that are sequences of letters Thus the pattern lt DIC gt allows to find all unknown words in a text These unknown forms are mostly proper name neologisms and spelling errors Here are some examples of patterns that mix the different types of constraints e lt A Hum fs gt a non human adjective in feminine singular e lt lire V P F gt the verb lire in present tense or future e lt suis suivre V gt the word suis as in
105. he complete file path without omitting the extension snt dic_i represents the file path of a dictionary The dictionary must have the extension bin It is possible to give priorities to the dictionaries For details see section 3 6 1 The program Dico produces the following four files and saves them in the directory of the text e d1f dictionary of simple words in the text e dic dictionary of compound words in the text e err list of unknown words in the text e stat_dic n file containing the number of simple words the number of compound words and the number of unknown words in the text NOTE the files d1f dlc and err are not sorted Use the program SortTxt to sort them 9 6 Elag Elag txtauto 1 lang g rules o output d dir This program takes a text automaton txtauto and applies disambiguation rules to it The parameters are as follows e txtauto the text automaton in st2 format e lang the configuration file for ELAG for the language considered e rule the file of rules compiled in the rul format e output the output text automaton e dir this optional parameter indicates the directory in which rules ELAG are to be found 134 CHAPTER 9 USE OF EXTERNAL PROGRAMS 9 7 ElagComp ElagComp r ruleslist g grammar 1 lang o output d to rulesdir This program compiles an ELAG grammar whose name is grammar or all the grammars specified in the file ruleslist The result is stored in a file o
106. he result main graph 9 26 TextAutomaton2Mft TextAutomaton2Mft text fst2 This program takes a text automaton text fst2 as a parameter and constructs the equivalent in the mft format of Intex The produced file is called text mft and is encoded in Unicode 9 27 Tokenize Tokenize text alphabet char_by_char This program cuts the text into lexical units The parameter text represents the com plete path of the text file without omitting the extension snt The parameter alphabet represents the complete path of the alphabet definition file of the language of the text The optional parameter char_by_ char indicates whether the program is applied character by character with the exception of the sentence separator S which is considered to be a single unit Without this parameter set the program considers a unit to be either a sequence of letters the letters are defined by the file alphabet or a character which is not a letter or the sentence separator S or a lexical label aujourd hui ADV The program codes each unit as a whole The list of units is saved in a text file called tokens txt The sequence of codes representing the units now allows the coding of the text This sequence is saved in a binary file named text cod The program also produces the following four files e tok_by_freq txt text file containing the units ordered by frequency e tok_by_alph txt text file containing the units ordered alphabetically e s
107. icates the number of characters to remove The sequence 0 0 7 indicates that the sequence 007 should be appended The digits are preceded by the character so they will not be confused with the number of characters to remove Whenever the two forms have the same number of units the units are compressed two by two If the two units are composed of a space or a hyphen the compressed form of the unit is the unit itself as in the following line 1 1 N Hum mp This allows to maintain a certain visibility in the inf file whenever the dictionary con tains composed words Whenever at least one of the units is neither a space nor a hyphen the compressed form is composed of a number of characters to remove followed by the sequence of characters to append Thus the dictionary line premi re partie premier parti N AN Hum fs is encoded by the line 3er 1 N AN Hum fs The code 3er indicates that 3 characters are to be removed from the sequence premi re and the characters er are to be appended to obtain premier The 1 indicates that only one character needs to be removed from partie to obtain parti The number 0 is used whenever it needs to be indicated that no letter should be removed 10 7 DICTIONARIES 157 10 7 3 The file CHECK_DIC TXT This file is produced by the dictionary verification program CheckDic It is a text file that contains information about the analysed dictionary and has four parts The first part is the possibly empty
108. iles that represent the graphs that are constructed by the user In fact in a normal graph the lines of a box are separated by the symbol In the graph of a sentence each box is either a lexical unit without label or a dictionary entry enclosed by curly brackets If the box only contains an unlabeled lexical unit this appears alone in the box If the box contains a dictionary entry the inflected form is displayed followed by the canonical form if it is different The grammatical and inflectional information is displayed below the box as in the transductions Figure 7 23 shows the graph obtained for the first sentence of Ivanhoe The words Ivanhoe Walter and Scott are considered unknown words The word by corresponds to two en tries in the dictionary The word Sir corresponds to two dictionary entries as well but since the canonical form of these entries is si r it is displayed because it differs from the inflected form by a lower case letter V W Pls P2s Plp P2p P3p Figure 7 23 Automaton of the first sentence of Ivanhoe 74 2 Modify the text automaton It is possible to manually modify the sentence automaton You can add or erase boxes or transitions When a graph is modified it is saved to the text file sentenceN grf where N represents the number of the sentence 7 4 MANIPULATION OF TEXT AUTOMATA 121 When you select a sentence if a modified graph exists for this sentence this one is dis played You can then reinitializ
109. in the following order ex snt is the text to which the dictio naries are applied and alph txt is the alphabet file used 1 cities bin 2 regions bin 3 riverts bin 4 countries bin 3 6 2 Application rules for dictionaries Besides the priority rule the application of dictionaries respects upper case letters and spaces The upper case rule is as follows e if there is an upper case letter in the dictionary then an upper case letter has to be in the text 42 CHAPTER 3 DICTIONARIES e if a lower case letter is in the dictionary there can be either an upper or lower case letter in the text Thus the entry peter N s will match the words peter Peter et PETER while Peter N firstName only recognizes Peter and PETER Lower and upper case letters are defined in the alphabet file passed to the Dico as a parameter Respecting white space is a very simple rule For each sequence in the text to be recog nized by a dictionary entry it has to have exactly the same number of spaces For example if the dictionary contains aujourd hui ADV the sequence Aujourd hui will not be recognized because of the space that follows the apostrophe 3 7 Bibliography The table XXX gives some references for electronic dictionaries with simple and compound words For more details see the references page on the Unitex website XXXcaption Some bibliographical references for electronic dictionaries Thttp www igm univ mlv fr unit
110. ine The following lines define the parameter values of the graph presentation SIZE x y defines the width x and the height y of a graph in pixels FONT name xyz defines the font used for displaying the contents of the boxes name represents the name of the mode x indicates if the text should be in bold face or not If x is B it indicates that it should be bold For non bold face x should be a space In the same way y has the value I if the text should be italic a space if not z represents the size of the text OFONT name xyz defines the mode used for displaying the transductions The pa rameters name x y and z are defined in the same way as FONT BCOLOR x defines the background color of the graph x represents the color in RGB format FCOLOR x defines the desing color of the graph x represents the color in RGB format ACOLOR x defines the color used for drawing the lines of the boxes that correspond to the calls of sub graphs x represents the color in RGB format SCOLOR x defines the color used for writing in the comment box the boxes that are not linked to any others x represents the color in RGB format CCOLOR x defines the color used for drawing the selected boxes x represents the color in RGB format DBOXES x this line is ignored by Unitex It is conserved to ensure the compatibility with Intex graphs DFRAME x draws a frame around the graph if x is y not if it is n DDATE x puts
111. inflected form of an entry to describe an abbreviation and the canonical form to provide the complete form DNA DeoxyriboNucleic Acid ACRONYM LADL Laboratoire d Automatique Documentaire et Linguistique ACRONYM UN United Nations ACRONYM 3 2 Verfication of the dictionary format When dictionaries become larger it becomes tiresome to verify them by hand Unitex con tains the program CheckDic that automatically verifies the format of DELAF and DELAS dictionaries This program verifies the syntax of the entries For each malformed entry the program outputs the line number the contents of the line and the type of error The results are saved in the file CHECK_DIC TXT which is displayed when the verification is finished In addi tion to eventual error messages the file also contains the list of all characters used in the inflectional and canonical forms the list of grammatical and semantic codes and the list 3 2 VERFICATION OF THE DICTIONARY FORMAT 35 of inflectional codes used The character list makes it possible to verify that the characters used in the dictionary are consistent with those in the alphabet file of the language Each character is followed by its value in hexadecimal notation These code lists can be used to verify that there are no typing errors in the codes of the dictionary The program works with non compressed dictionaries i e the files in text format The general convention is to use the dic extension for these
112. itting the text into sentences The result of the normalization of a text named my_text txt is a file in the same directory as the t xt file and is named my_text snt NOTE When the text is preprocessed using the graphical interface a directory named my_text_snt is created immediately after normalization This directory called text direc tory contains all the data associated with this text 2 5 2 Splitting into sentences Splitting texts into sentences is an important preprocessing step since this helps in determin ing the units for linguistic processing The splitting is used by the text automaton construc tion program In contrast to what one might think detecting sentence boundaries is not a trivial problem Consider the following text The family urgently called Dr Martin The full stop that follows Dr is followed by a word beginning with a capital letter Thus it may be considered as the end of the sentence which would be wrong To avoid the kind of problems caused by the ambiguous use of punctuation grammars are used to describe the different contexts for the end of a sentence Figure 2 9 shows an example grammar for sentence splitting When a path of the grammar recognizes a sequence in the text and when this path pro duces the sentence separator symbol sS this symbol is inserted into the text The path shown at the top of figure 2 9 recognizes the sequence consisting of a question mark and a word beginning with a capi
113. l fait tr s beau 9 14 FST2TXT 137 ira Lu fron 9 Figure 9 1 Graph with Cycle 9 14 Fst2Txt Fst2Txt text fst2 alphabet mode char_by_char This program applies a transducer to a text at the preprocessing stage when the text has not been split into lexical units yet The parameters of the program are the following e text the text file to modify with the extension snt e fst2 the transducer to apply e alphabet the alphabet file of the language of the text e mode the application mode of the transducer The two possible modes are merge and replace e char_by_char this optional parameter permits the application of the transducer in character by character mode This option is used for texts in Asian languages This program modifies the text file given as a parameter 9 15 Grf2Fst2 Grf2Fst2 graph y n This program compiles a grammar into a file fst2 for more details see section 6 2 The parameter graph denotes the complete path of the main graph of the grammar without omitting the extension gr The second parameter is optional It indicates to the program whether the grammar needs to be checked for errors or not Per default the program carries out this error check The result is a file that carries the same name as the graph passed to the program as a parameter but with the extension fst 2 This file is saved in the same folder as graph 9 16 ImploseFst2 textauto o out This program calculate
114. l se quences in curly brackets are either the sentence delimiter S or a valid line of DELAF aujourd hui ADV If the program finds curly brackets that are used differently it gives a warning and replaces them by square brackets and The parameter txt repre sents the complete path of the text file The program creates a modified version of the text that is saved in a file with the extension snt 9 21 PolyLex PolyLex lang alph dic list out info This program takes a file with unknown words list and tries to analyze each of the words as a compound obtained by combining simple words The words that have at least one analysis are removed from the file of unknown words and the dictionary lines that correspond to the analysis are appended to the file out The parameter 1 ang determines the language to use The two possible values are GERMAN and NORWEGIAN The parameter alph represents the alphabet file to use The parameter dic specifies which dictinary to consult for the analysis The parameter out specifies the file to which the produced dictionary lines are to be written if that file already exists the produced lines are appended at the end of the file The optional parameter info specifies a text file in which the information about the analysis will be written 9 22 Reconstrucao Reconstrucao alph concord dic reverse_dic pro res 140 CHAPTER 9 USE OF EXTERNAL PROGRAMS This program generates a normalization grammar d
115. lanagan Gauthier Ralph et Samuel Fallentin le brasseur Thomas Flanagan Gauthier Ralph un des administrateurs de la nance Eh bien Ralph demanda Thomas Flanagan o en est cette affaire de vol _ Eh bien r iles ont t envoy s en Am rique et en Europe dans tous les principaux ports d embarquement e oustrait cinquante cing mille livres en bank notes 1 million 375 000 francs _ Non r pondit _ C est donc un industriel dit John Sullivan _ Le Morning Chronicle assure que c est un ge re Figure 4 1 Result of the search for lt DIC gt Concordance E My UnitexiEnglishiCorpus ivanhoe_snticoncord html 8 Enable links Allow concordance edition ngland which is watered by the river Don there extended in ancient times a large forest cover extended in ancient times a large forest covering the greater part of the beautiful hills and field and the pleasant town of Doncaster The remains of this extensive wood are still to be be seen at the noble seats of Wentworth of Warncliffe Park and around Rotherham 3 Here hau e seats of Wentworth of Warncliffe Park and around Rotherham 5 Here haunted of yore the fab of Warncliffe Park and around Rotherham Here haunted of yore the fabulous Dragon of Wantle d of yore the fabulous Dragon of Wantley 5 here were fought many of the most desperate battle ttles during the Civil Wars of the Roses 5 and here also flourished in ancient times those ba ent times those bands of gal
116. lant outlaws whose deeds have been rendered so popular in English been rendered so popular in English song Such being our chief scene the date of our story lish song 5 Such being our chief scene the date of our story refers to a period towards the owards the end of the reign of Richard I when his return from his long captivity had become a Figure 4 2 Result of a search for the pattern lt MOT gt 4 4 Concatenation There are three ways to concatenate regular expressions The first consists in using the concatenation operator which is represented by the period Thus the expression lt DET gt lt N gt recognizes a determiner followed by a noun The space can also be used for concatena tion The following expression the lt A gt cat recognizes the lexical unit the followed by an adjective and the lexical unit cat Finally it is possible to omit the period and the space before an opening bracket or the character lt as 48 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS well as after a closing bracket or after the character gt The brackets are used as delimiters of a regular expression All of the following expression are equivalent the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat 4 5 Union The union of regular expressions is expressed by putting the character between them The expression I you he she it we they lt V gt recognizes a pronoun
117. le a copy of this graph is constructed where the variables are replaced with the contents of the cell at the intersection of the column and the line in question If a cell of the table contains the sign the corresponding variable is replaced by lt E gt If the cell contains the sign the box containing the corresponding variable is removed at the same time making the paths through that box unavailable In all other cases the variable is replaced by the contents of the cell 8 2 2 Format of the table The lexicon grammar tables are usually represented with the aid of a spreadsheet like OpenOf fice org Calc To make them usable with Unitex the tables have to be encoded in Unicode text format in accordance with the following convention the columns need to be separated by a tab and the lines by a newline To convert a table with OpenOffice org Calc save it in text format extension csv The program then allows to parameterize the saving of the file with a window like the one in figure XXX Select Unicode and tab as separator for columns and leave the field text separator empty During the generation of the graphs Unitex skips the first line considering it to be the headings of the columns It is therefore necessary to ensure that the headings of the columns occupy exactly one line If there is no line for the heading the first line of a table should be ignored and if there are multiple heading lines from the second line on they will be
118. le at the moment to search for patterns on the text automaton nor to use rules in order to eliminate ambiguities 7 1 Displaying text automata The text automaton can express all possible lexical interpretations of the words These differ ent interpretations are the different entries presented in the dictionary of the text Figure 7 1 shows the automaton of the fourth sentence of the text Ivanhoe You can see in figure 7 1 that the word Here has three interpretations here adjective adverb and noun haunted two adjective and verb etc All the possible combinations are expressed because each interpretation of each word is connected to all the interpretations of the following and preceding words In case of an overlap between a compound word and a sequence of simple words the automaton contains a path that is labeled by the composite word parallel to the paths that express the combinations of simple words This is illustrated in figure 7 2 where the com posite word courts of lawis overlapping with a combination of simple words By construction the automaton of the text doesn t contain any loops One says that the text automaton is acyclic NOTE the term text automaton is an abuse of the language In fact there is an automa ton for each sentence of the text Therefore the combination of all these automata corre sponds to the automaton of the text Therefore the term text automaton is used even if this object is not really manipulated f
119. le boxes When the boxes are selected you can move them by clicking and dragging the cursor without releasing the button In order to cancel the selection click on an empty area of the graph If you click on a box all boxes of the selection will be connected to it You can perform a copy paste using several boxes Select them and press lt Ctrl C gt or click on Copy in the Edit menu The selection is now in the Unitex clipboard You can then paste this selection by pressing lt Ctrl V gt or by selecting Paste in the Edit menu NOTE You can paste a multiple selection into a different graph than the one where you copied it from In order to delete boxes select them and delete the text that they contain Delete the text presented in the text field above the window and press the enter key The init and final states cannot be deleted 5 2 EDITING GRAPHS 61 matin midi Figure 5 9 Copy Paste of a multiple selection 5 2 5 Transducers A transduction is an output associated with a box To insert a transduction use the special Character All characters to the right of it will be part of the transduction Thus the text untdeux trois nombre results in a box like in figure 5 10 nombre Figure 5 10 Example of a transduction The transduction associated with a box is represented in bold text below it 5 2 6 Using Variables It is possible to select parts of a recognized text by a grammar using variab
120. les To associate a variable var1 with parts of a grammar use the special symbols var1 and varl to define the beginning and the end of the part to store Create two boxes containing one varl and the second var1 These boxes must not contain anything but the variable 62 CHAPTER 5 LOCAL GRAMMARS name preceded by and followed by a parenthesis Then link these boxes to the zone of the grammar to store In the graph in figure 5 11 you see a sequence beginning with an upper case letter after Mister or Mr This sequence will be stored in a variable named var1 i eel varl varl Figure 5 11 Using the variable var1 The variable names may contain letters without accents upper or lower case numbers or the _ underscore character Unitex distinguishes between uppercase and lowercase characters When a variable is defined you can use it in transductions by preceding its name with The grammar in figure 5 12 recognizes a date formed by a month and a year and produces the same date as an output but in the order year month janvier fevrier mars avril mai juin Eeg he juillet JH Jar y ao t dus ut septembre mois annee octobre novembre d cembre Figure 5 12 Inverting month and year in a date 5 2 7 Copying Lists It can be practical to perform a copy paste operation on a list of words or expressions from a text editor to a box in a graph In order to avoid having to copy every term manually
121. lt nombre gt P lt pers gt lt nombre gt S lt pers gt lt nombre gt T lt pers gt lt nombre gt X ls euss duss puiss fuss Je Y lp Y 2 lt nombre gt K lt genre gt lt nombre gt The symbol indicates that the remainder of the line is a comment A comment can appear at any place in the file The file always starts with the word NAME followed by an identifier francais for example This is followed by the sections POS for Part Of Speech for each grammatical category Each section describes the structure of the labels of the lexical entries belonging to the grammatical category concerned Each section is composed of 4 parts which are all optional e inflex this part enumerates the inflectional codes belonging to the grammatical cat egory For example the codes 1 2 3 which indicate the person of the entry are relevant for pronouns but not for adjectives Each line describes an inflexional attribute gender time etc and is made up of the attribute name followed by the character and the values which it can take For example the following line declares an attribute pers being able to take the values 1 2 or 3 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 113 pers 1 2 3 e cat this part declares the syntactic and semantic attributes which can be allotted to the entries belonging to the grammatical category concerned Each line describes an at tribute and the value
122. n constructed for all simple verbs in French as a way of describing their relevant properties Experience has shown that every word has a quasi unique behavior and these tables are a way of presenting the grammar of every element in the lexicon hence the name lexicon grammar for this linguistic theory Unitex offers a way to directly construct grammars from these tables Unitex can be viewed as tool with which one can put these linguistic resources to use Its technical characteristics are its portability modularity the possibility of dealing with languages that use special writing systems e g many Asian languages and its openness thanks to its open source distribution Its linguistic characteristics are the ones that have motivated the elaboration of these resources the precision the completeness and the taking into account of frozen expressions most notably those which concern the enumeration of compound words 10 CONTENTS The first chapter describes how to install and run Unitex Chapter 2 presents the different steps in the analysis of a text Chapter 3 describes the formalism of the DELA electronic dictionaries and the different operations that can be applied to them Chapters 4 and 5 present different means for making text searches more effective Chapter 5 describes in detail how the graph editor is used Chapter 6 is concerned with the different possible applications of grammars The particu larities of each type of grammar
123. nd discriminants which appear in the dictio anries must be described in the file tagset def or else the corresponding entries will be restrected by ELAG If words of the same subcategory differ by their inflectional features it is necessary to write several lines into the complete part The disadvantage of this method of description is that it becomes diffi cult to make the distinction between such words in an ELAG grammar If one considers the description given by the previous example certen adjevtives of French take a gender and a number whereas others to not have any inflectional feature This is for example the case with fixed sequences like de bonne humeur which have a syntactic behavior very close to that of adjectives Such sequences were thus integrated into the French dictionary as invariable adjectives without inflectional features The problem is that if one wants to refer exclusively to this type of adjectives in a disam biguation grammar the symbol lt A gt is not appropriate since it will recognize all adjectives To circumvent this difficulty it is possible to deny an inflectional attribute by writing the character right before one of the possible values for thsi attribute Thus the symbol lt A m p gt recognizes all the adjectives which have neither gender nor a number Using this opera tor it is now possible to write grammars like those in figure 7 19 which imposes the agreemend in gender and num
124. now edit the screen image Select the area that interests you To do so switch to the select mode by clicking on the dashed rectangle symbol in the upper left corner of the window You can now select the area of the image using the mouse When you have selected the zone press lt Ctrl C gt Your selection is now in the clipboard you can now just go to your document and press lt Ctrl V gt to paste your image In Linux Take a screen capture for example using the program xv Edit your image at once using a graphic editor for example TheGimp and paste your image in your document in the same way as in Windows 5 4 2 Printing a Graph You can print a graph by clicking on Print in the FSGraph menu or by pressing lt Ctrl P gt ATTENTION You should make sure that the page orientation parameter portrait or landscape corresponds to the orientation of your graph You can specify the printing preferences by clicking on Page Setup in the FSGraph menu You can also print all open graphs by clicking on Print All For those who want to get a vector graphic small and scalable 1 Use the Unitex Print Graph menu and print the graph to a Postscript file 2 Clean the Postscript by typing gs sDEVICE pswrite dNOPAUSE dBATCH sOutputFile clean ps graph ps in your shell Know you get a smaller file Have a look on it using gv 3 Now you can convert the graph with convert into various image formats 72 CH
125. nsiderably the compile time of a grammar Generally a grammar having many then parts can be rewritten with one or two then parts without a loss of legibility It is for example the case of the grammar in figure 7 21 which imposes a constraint between a verb and the pronoun which follows it E postpos bad grf lt gt Has H H lt a gt lt PRO Ppwl z lt gt pHevids gt H H lt gt lt PRO Ppvi lt PRO PpviL lt PRO PpvLE gt lt PRO PpvLUl gt lt PRO PpvPR gt Figure 7 21 Grammaire ELAG v rifiant l accord entre verbe et pronom As one can see in figure 7 22 one can write an equivalent grammar by factorizing all the then party into only one The two grammars will have exactly the same effect on the text automaton but the second one will be compiled much more quickly Utilizing lexical symbols It is beter to use lemmas only when it is abolutely necessary That is particularly true for 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 119 postpos good grf Unsaveo lt vs formes interrogatives a 5 Xx El E E 2 El ig lt PRO PpvLLl gt lt PRO PpvPR gt lt PRO Ton gt lt PRO PpvLE gt E Figure 7 22 Grammaire ELAG optimis e v rifiant l accord entre verbe et pronom grammatical words when their subcategories carry almost as much of inf
126. nt to rename the file differently than with the suffix old you can use the Transcode Files command in the File Edition menu This command enables you to choose source and target encodings of TUnitex also proposes to automatically convert graphs and dictionaries that are not in Unicode Little Endian 2 3 EDITING TEXTS 17 the documents to be converted see figure 2 3 By default the proposed source encoding is that which corresponds to the current language and the destination encoding is Unicode Little Endian You can modify these choices by selecting any source and target encoding Thus if you wish you can convert your data into other encodings as for example UTF 8 in order for instance to create web pages The button Add Files enables you to select the files to be converted The button Remove Files makes it possible to remove a list of files erroneously selected The button Transcode will start the conversion of all the selected files If an error occurs when the file is processed for example a file which is already in Unicode the conversion continues with the next file Transcode Files Source encoding Destination encoding Replace Rename source with prefix i Rename source with suffix FRENCH GERMAN Ki GREEK Name destination with suffix ITALIAN Prefixisuffix NORWEGIAN 8 Name destination with prefix e junicode Selected files Add Files E My UnkaxiFrenchiCorpusivatiment ixt R
127. o produce a string of characters This string is then appended to the line of the produced dictionary cf chapter 3 4 The transductions with variables do not make sense in an inflection graph The contents of an inflection graph are manipulated without a change of case the low ercase letters stay lowercase the same for the uppercase letters Besides the connection of two boxes is exactly equivalent to the concatenation of their contents together with the concatenation of their transductions cf figure 6 2 ces deux chemins sont stactement equivalente Figure 6 2 Two equivalent paths in an inflection grammar The inflection graphs have to be compiled before being used by the inflection program 6 1 2 Preprocessing graphs Preprocessing graphs are meant to be applied to texts before they are tokenized into lexi cal units These graphs can be used for inserting or replacing sequences in the texts The two normal uses of these graphs are normalization of non ambiguous forms and sentence boundary recognition The interpretation of these graphs in Unitex is very close to that of syntactic graphs used by the search for patterns The differences are the following e you can use the special symbol lt gt that recognizes a newline e it is impossible to refer to dictionaries e it is necessary to compile these graphs before they can be used for preprosessing op erations The figures 2 9 and 2 10 show examples of preprocessing gra
128. of the sentence 10 5 4 The file cursentence txt During the extraction of the sentence automaton the text of the sentence is saved in the file called cursentence txt That file is used by Unitex to display the text of the sentence under the automaton That file contains the text of the sentence followed by a newline 10 6 Concordances 10 6 1 The file concord ind The file concord indis the index of the occurrences found by the program Locate during the application of a grammar It is a text file that contains the starting and end position of each occurrence possibly accompanied by a sequence of letters if the concordance has been obtained by taking into account the possible transductions of the grammar Here an example of a file M9 59 63 the ADJ greater part 67 71 the beautiful hills 87 91 the pleasant town 123 127 the noble seatsY 157 161 the fabulous Dragon 189 193 the Civil WarsY 455 459 the feeble interferencel 463 467 the English Council 568 572 the national convulsions 592 596 the inferior gentry 628 632 the English constitution 698 702 the petty kings 815 819 the certain hazard 898 902 the great Barons 940 944 the very edge The first line indicates in which transduction mode the concordance has been constructed The three possible values are 10 6 CONCORDANCES 153 e 1 the transductions have been ignored e M the transductions have been inserted into th
129. on without reading any text The fact that none of these two graphs has labels between the initial state and the call to the subgraph is crucial In fact if there were at least one label different from epsilon between the beginning of the graph Det and the call to DetCompose this would mean that the Unitex programs exploring the graph Det would have to read the pattern described by that label in the text before calling Det Compose recursively In this case the programs would loop infinitely only if they recognized the pattern an infinite number of times in the text which is impossible lt DET gt DetCompose Figure 6 10 Infinite loop caused by two graphs calling each other 6 3 EXPLORING GRAMMAR PATHS 81 6 2 4 Error detection In order to keep the programs from blocking or crashing Unitex automatically detects errors during graph compilation The graph compiler verifies that the principal graph does not recognize the empty word and searches for all possible forms of infinite loops When an error is encountered an error message is displayed in the compilation window Figure 6 11 shows the message that appears if one tries to compile the graph Det of figure 6 10 Compiling graph DetCompose Recursion detection started Resolving lt E gt conditions Checking lt E gt dependancies Looking for lt E gt loops Looking for infinite recursions Recursion detection completed ERROR Det calls DetCompose that recalls the gra
130. on for each sentence If this is not the case the program arbitrarily cuts the text into sequences of 2000 lexical units and produces an automaton for each of these sequences The result is a file called text st 2 which is saved in the directory of the text Chapter 10 File formats This chapter presents the formats of files read or generated by Unitex The formats of the DELAS and DELAF dictionaries have already been presented in sections 3 1 1 and 3 1 2 NOTE in this chapter the symbol 4 represents the newline symbol Unless otherwise indicated all text files described in this chapter are encoded in Unicode Little Endian 10 1 Unicode Little Endian encoding All text files processed by Unitex have to be encoded in Unicode Little Endian This en coding allows the representation of 65536 characters by coding each of them in 2 bytes In Little Endian the bytes are in lo byte hi byte order If this order is reversed we speak of Big Endian A text file encoded in Unicode Little Endian starts with the special character with the hexadecimal value FEFF The newline symbols have to be encoded by the two characters 000D and 000A Consider the following text UnitexY P versionY Here its representation in Unicode Little Endian ate Uf n il El ex 5 oe SA OR A EE Table 10 1 Hexadecimal representation of a Unicode text The hi bytes and lo bytes have been reversed which explains why the start character is encoded as
131. on of the text automaton before applying disambiguation grammars to it This normalization is carried out by applying to the text autoaton a grammar not impos ing any constraint like that of figure 7 20 Note that this grammar is normally present in the Unitex distribution and precompiled in the file norm rul Figure 7 20 Grammaire ELAG n exprimant aucune contrainte The result of applying these grammars is that the original is cleaned of all the codes which are either not described in the file tagset def or do not conforme to this description because of unknown grammatical categories r invalid combinations of inflectional fea tures By then replacing the text automaton by this normalized automaton one can be sure that later modifica tions of the automaton will be only due to the effects of ELAG grammars 7 3 7 Grammar Optimization The compilation of ELAG grammars carried out by the elagcomp program consists in building an automaton whose language is the whole of the sequences of lexi cal entries or lexical interpretations of a sentence which are not refected by grammars This task is complex and can take up much time It is 118 CHAPTER 7 TEXT AUTOMATA however possible to appreciably accelerate it by observing certain principles at the time of writing gramars Limiting the number of branches in the then part It is recommended to limit the number of then parts of a grammar to a minimum This can reduce co
132. or ambiguous forms 6 5 1 Insertion to the left of the matched pattern When a transducer is applied in REPLACE mode the output replaces the sequences that have been read in the text In MERGE mode the output is inserted to the left of the recog nized sequences Look at the transducer in figure 6 16 Had E 2 0 Ad Figure 6 16 Example of a transducer If this transducer is applied to the novel Ivanhoe by Sir Walter Scott in MERGE mode the following concordance is obtained Concordance E My UnitexlEnglishiCorpuslivanhoe_snticoncord html Enable links Allow concordance edition isade composed of pointed beams which the ADJ adjacent forest supplied defended the outer a depreciation of the outlaws with whom the ADJ adjacent forest abounded or by the violence o e same principles may be still seen in the ADJ antique Colleges of Oxford or Cambridge 5 Ma truce to thine insolence fellow said the ADJ armed rider breaking in on his prattle with a thou beest a man take a turn round the ADJ back o the hill to gain the wind on them 5 ge forest covering the greater part of the ADJ beautiful hills and valleys which lie between dmitted His mantle and hood were of the ADJ best Flanders cloth and fell in ample and no broach the oldest wine cask 5 place the ADJ best mead the mightiest ale the richest mora at violence Then sad relief from the ADJ bleak coast that hears The German Ocean roar than su
133. or practical reasons 95 96 CHAPTER 7 TEXT AUTOMATA E FsT Text 2344 sentences Here haunted of yore the fabu s Dragon of Wantley Sentence 47 Reset Sentence Graph Rebuild FST Text Elag Frame nsunted gt DET Dd f s p haunted haunt Figure 7 1 Example of the automaton of a sentence 7 2 Construction In order to construct the text automaton open the text then click on Construct FST Text in the menu Text One should first split the text at sentence boundaries and apply the dic tionaries If sentence boundary detection is not applied the construction program will split the text arbitrarily in sequences of 2000 lexical units instead of constructing one automaton per sentence If the dictionaries are not applied the sentence automaton that you obtain will consist of only one path made up of unknown words 7 2 1 Construction Rules For Text Automata The sentence automata are constructed starting from the text dictionaries The obtained degree of ambiguity is therefore directly linked to the granularity of the descriptions of the used dictionaries From the sentence automaton in figure 7 3 you can conclude that the word which has been coded twice as a determiner in two subcategories of the category DET This granularity of descriptions will not be of any use if you are not interested in the grammatical category of this word It is therefore ne
134. ormation than the lemmas themselves If you despite everything use a lema in a symbol it is recommended to specify its syn tactic semantic and inflectional features as much as possible For example with the dictionaries provided for French it is preferable to replace symbols like lt je PRO 1s gt lt je PRO PpvIL 1s gt and lt je PRO gt with the symbol lt PRO Ppv11 1s gt Indeed all these symbols are identical insofar as they can recognize only the single entry of the dictionary je PRO PpvIL 1ms 1 s How ever as the program cannot deduce this information automatically if all these features are not specified the program will consider nonexisting labels such as lt je PRO 3p gt lt je PRO PronQ gt etc in vain 120 CHAPTER 7 TEXT AUTOMATA 7 4 Manipulation of text automata 7 4 1 Displaying sentence automata As we have seen above the text automaton is in fact the collection of the sentence automata of a text This structure can be represented using the format st 2 used for representing the compiled grammars This format does not allow to directly display the sentence automata It is therefore necessary to use the program Fst 2Gr for converting the sentence automaton into a graph that can be displayed This program is called automatically when you select a sentence in order to generate the corresponding gr file The generated grf files are not interpreted in the same manner as the grf f
135. ory 21 splitting up in lexical units 23 Texte automate du 143 Texts formats 15 Tokens see Lexical Units Toolbar 64 Transducer 56 rules for application 84 Transducers 61 with variables 61 Transduction 56 68 associated to a subgraph 79 with variables 86 Types of graphs 73 170 Underscore 62 86 Unicode 15 56 65 133 143 145 Union of ratinal expressions 43 Union of regular expression 48 Uppercase seeRespect of lowercase uppercase 76 UTF 8 132 134 144 155 Variable names 62 Variables in graphs 86 in template graphs 125 within graphs 61 verb L 38 verb R 38 Verification of a dictionary format 131 Verification of the dictionary format 34 Web browser 53 92 Window for ELAG Processing 105 Words composed 44 Composite free in Norvegian 28 composite 26 with space or dash 31 compounds free in German 140 free in Norvegian 140 simple 25 44 Unknown 26 unknown 46 Zoom 65 INDEX Bibliography 1 M CONSTANT T NAKAMURA and S PAUMIER L h ritage des g nes MG la localisation des auxiliaires en francais Actes du 21e colloque international Grammaires et Lexiques Compar s 2 S bastien PAUMIER Nouvelles m thodes pour la recherche d expressions dans de grands corpus In Anne Dister editor Revue Informatique et Statistique dans les Sciences Humaines volume Actes des 3 mes Journ es INTEX pages 289 295 2000 3 S bastien PAUMIER Recherche d expressions
136. ph Det Figure 6 11 Error message when trying to compile Det If you have started a pattern search be selecting a graph of the format grf and Unitex discovers an error the operation is automatically interrupted 6 3 Exploring grammar paths It is possible to generate the paths recognized by a grammar for example to verify that it correctly generates the expected forms For that open the main graph of your grammar and ensure that the graph window is the active window the active window has a blue title bar while the inactive windows have a gray title bar Now go to the menu FSGraph and then to the Tools menu and click on Explore Graph paths The Window of figure 6 12 appears The upper box contains the name of the main graph of the grammar to be explored The following options relate to the outputs of the grammar e Ignore outputs the outputs are ignored 82 CHAPTER 6 ADVANCED USE OF GRAPHS Explore graph paths Graph E my UnitexiFrenchiGraphsiglace grt Ignore outputs Separate inputs and outputs Merge inputs and outputs v Maximum number of sequences 100 co Cancel Figure 6 12 Exploring the paths of a grammar e Separate inputs and outputs the outputs are displayed after the inputs a b c A B C e merge inputs and outputs Every output is posted immediately after the input to which it corresponds a A b B c C If the option Maximum number of sequences is activated the
137. ph name ending with conc fst2 This graphs corresponds to the if part of the grammar You can thus obtain the occurrences of the text to which the grammar will apply NOTE The conc fst2 file used to locate the then part of a grammar is generated at the time when ELAG grammars are compiled by means of the compile button It is thus necessary initially to have your grammar compiled before searching using the locate button 7 3 3 Resolving Ambiguities Once you have compiled your grammar into an elag rul file you can apply it to a text automaton In the text automaton window click on the elag button A dialog box will appear which asks for the the rul file to use see figure 7 17 As the default file is likely to be elag rul simply click on OK This will launch the Elag which will resolve the ambiguities Once the program has finished you can view the resulting automaton by clicking on the Elag Frame button As you can 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 107 E FST Text La porte du car se ferme automatiquement 1 sentence Sentence 1 Rebuild FST Text Open Elag Frame Implose Figure 7 17 Fen tre de l automate du texte see in figure 7 18 the windows is separated into two parts The original text automaton can be seen on the top and the result at the bottom Don t be surprised if the automaton shown at the bottom seems more complicated
138. phs 6 1 TYPES OF GRAPHS 75 6 13 Graphs for normalizing the text automaton The graphs for normalization of the text automaton allow to normalize ambiguous forms In fact they can describe several labels for the same form These labels are then inserted into the text automaton thus making the ambiguities explicit Figure 6 3 shows an extract of the normalization graph used for French de DET H1 Figure 6 3 Extract of the normalisation graph used for French The paths describe the forms that have to be normalized Lower case and upper case variants are taken into account according to the following principle uppercase letters in the graph only recognize uppercase letters in the text automaton lowercase letters can recognize the lowercase and uppercase letters The transductions represent the sequence of the labels that will be inserted into the text automaton These labels can be dictionary entries or strings of characters The labels that represent entries of the dictionary have to respect the format for entries of a DELAF and are enclosed by the symbols and The transductions with variables do not make sense in this kind of graph 76 CHAPTER 6 ADVANCED USE OF GRAPHS It is possible to reference subgraphs It is not possible to reference dictionaries in order to describe the forms to normalize The only special symbol that is recognized in this type of graph is the empty word lt E gt The graphs for normalizing ambiguous forms n
139. ptional and may be intro duced by the character These comments are left out when the dictionaries are com pressed IMPORTANT REMARK It is possible to use the full stop and the comma within a dic tionary entry In order to do this they have to be escaped using the character 1 000 one thousand NUMBER United Nations U N ACRONYM ATTENTION Each character is taken into account within a dictionary line For example if you insert spaces they are considered to be a part of the information In the following line hath have V P3s old form of has The space that precedes the character will be considered to be one of the 4 inflectional codes P 3 s and space 3 1 THE DELA DICTIONARIES 31 It is possible to insert comments into a DELAF or DELAS dictionary by starting the line with a character Example in the next entry the backslash escapes the comma 1 000 one thousand NUMBER Compound words with spaces or dashes Certain compound words like acorn shell can be written using spaces or dashes In order to avoid duplicating the entries it is possible to use the character At the time when the dictionary is compressed the Compress program verifies for each line if the inflected or canonical form contains a non escaped character If this is the case the programm replaces this by two entries The one where the character is replaced by a space and one where it is replaced by a dash Thus the following ent
140. r is no the program extracts all sentences that do not 9 11 FLATTEN 135 contain any occurrences in the concordance The parameter text represents the complete path of the text file without omitting the extension snt The parameter concordance represents the complete path of the concordance file without omitting the extension ind The parameter result represents the name of the file in which the extracted sentences are to be saved The result file is a text file that contains all extracted sentences one sentence per line 9 11 Flatten Flatten fst2 type depth This program takes any grammar as its parameter and tries to transform it into a finite state transducer The parameter fst 2 specifies the grammar to transform The parameter type specifies which kind of grammar the result grammar should be If this parameter is FST the grammar is unfolded to maximum depth and is truncated if there are further calls to sub graphs The result is a grammar in fst 2 format that does only contain a single finite state transducer If the parameter is RTN the calls to sub graphs that could remain after the transformation are left as they are The result is therefore a finite state transducer in the best case and an optimized grammar strictly equivalent to the original grammar otherwise The optional parameter depth specifies the maximum depth of nest of the sub graphs that are generated by the program The default value is 10 9 12 Fst2Grf Fs
141. r more boxes and connect it or them to another one In contrast to the normal mode the connections are inserted to the box where the mouse button was released on e connect boxes to another box in the opposite direction this utility performs the same operation as the one described above but connects the boxes to the one clicked on in opposite direction e open a sub graph opens a sub graph when you click on a grey line within a box 5 3 Display options 5 3 1 Sorting the lines of a box You can sort the contents of a box by selecting it and clicking on Sort Node Label in the Tools submenu of the FSGraph menu This sort operation doesn t use the SortTxt program It uses a basic sort mechanism that sorts the lines of the box according to the order of the characters in the Unicode encoding 5 3 2 Zoom The Zoom submenu allows you to choose the zoom scale that is applied to display the graph Figure 5 16 Zoom Sub Menu The option Fit in screen stretches or shrinks the graph in order to fit it into the screen The option Fit in window adjusts the graph for it to be displayed completely in the win dow 66 CHAPTER 5 LOCAL GRAMMARS 5 3 3 Antialiasing Antialiasing is a shading effect that avoids pixelisation effects You can activate this effect by clicking on Antialiasing in the Format sub menu Figure 5 17 shows one graph displayed normally the graph on top and with antialiasing the graph at the bottom
142. rdance In order to do that choose a file name in the field Modify text in the window of figure 6 28 This file has to have the extension txt If you want to modify the current text you have to choose the corresponding txt file If you choose another file name the current text will not be affected Click on the GO button to start the modification of the text The precedence rules that are applied during these operations are described in section 3 6 2 After this operation the resulting file is a copy of the text in which all transductions have been taken into account The normalization operations and the splitting into lexical units are automatically applied to this text file The existing text dictionaries are not modified Thus if you have chosen to modify the current text the modifications will be effective immediately You can then start new searches on the text ATTENTION if you have chosen to apply your graph ignoring the transductions all occurrences will be erased from the text 94 CHAPTER 6 ADVANCED USE OF GRAPHS Chapter 7 Text automata Natural languages contain lots of lexical ambiguities The text automaton is an effective and visual means of representing these ambiguties Each sentence of a text is represented by an automaton the paths of which express all possible interpretations This chapter presents the text automata the details of their construction and the opera tions that can be applied It is not possib
143. rde manger l office la poissonnerie la laiterie Le jeu tait pour lui un combat une lutte contre une d Le nouveau domestique gt dit il S Un gar on g d une 1 ou venu Jean Passepartout un surnom qui m est le num ro 7 de Saville row Burlington Gardens _ maison le patronage direct de Sa Gracieuse Majest 3 Il n ap le service s y r duisait peu 3 Toutefois Phileas F les bassins ni les docks de Londres n avaient jamais re les cristaux moule perdu du club qui contenaient son les cuisines le garde manger l office la poissonneri les deux pieds rapproch s comme ceux d un soldat la p les docks de Londres n avaient jamais re u un navire ay les domestiques du club graves personnages en habit no les enoux le corps droit la t te haute regardait ma les habitudes invariables du locataire le service s y les heures les minutes les secondes les jours les q les insectes nuisibles S Phileas Fogg tait membre du les journaux et de jouer au whist 5 A ce jeu du silen les jours les quanti mes et l ann e A onze heures les mains appuy es sur les genoux le corps droit la t les membres de cette honorable association on r pondra les mieux inform s ne pouvaient dire et Mr Fogg tait les mille propos qui circulaient dans le club au sujet les minutes les secondes les jours les quanti mes et Figure 4 8 Example concordance Chapter 5 Local grammars Local grammars are a powerful tool to represen
144. ring the greater part of the beau tiful hills and valleys which lie between Sheffield and the pleasant town of Doncas ter 3 The remains of this extensive wood are still to be seen at the noble seats of Wentworth of Warncliffe Park and around Rotherham S Here haunted of yore the fabulous Dragon of Wantley 3 here were fought many of the most desperate battles during the Civil Wars of the Roses 3 and here also flourished in ancient times t hose bands of gallant outlaws whose deeds have been rendered so popular in English song 5 Such being our chief scene the date of our story refers to a period tova rds the end of the reign of Richard I when his return from his long captivity had become an event rather wished than hoped for by his despairing subjects who were in the meantime subjected to every species of subordinate oppression 5 The nobles whose power had become exorbitant during the reign of Stephen and w hom the prudence of Henry the Second had scarce reduced to some degree of subjectio n to the crown had now resumed their ancient license in its utmost extent 5 desp Figure 6 29 Selection of an occurrence in the text Furthermore if the text automaton has been constructed and if the corresponding win dow is not iconified clicking on an occurrence selects the automaton of the sentence that contains this occurrence 6 6 3 Modification of the text You can choose to modify the text instead of constructing a conco
145. roduced sequences The third mode ignores all transductions This latter mode is used by default After you have selected the parameters click on SEARCH to start the search 6 6 2 Concordance The result of a search is an index file that contains the positions of all enountered occur rences The window of figure 6 28 lets you choose whether to construct a concordance or modify the text 92 CHAPTER 6 ADVANCED USE OF GRAPHS E Locate Pattern Locate pattern in the form of Regular expression Gr index Grammar outputs Shortest matches Are not taken into account Longest matches Merge with input text All matches Replace recognized sequences Search limitation a Stop after 2001 matches SEARCH Index all utterances in text Figure 6 27 Window for pattern search In order to display a concordance you have to click on the botton Build concordance You can parameterize the size of left and right contexts in characters You can also choose the sorting mode that will be applied to the lines of the concordance in the menu Sort According to For further details on the parameters of concordance construction refer to the section 4 8 2 Display indexed sequences Modify text Resulting file Set File Extract units SetFie Extract matching units Extract unmatching units Concordance pres
146. ry acorn shells acorn shell N p is replaced by the following entries acorn shells acorn shell N p acorn shells acorn shell N p NOTE If you want to keep an entry that includes the character escape it using like in the following example E mc2 FORMULA This replacement is done when the dictionary is compressed In the compressed dictio nary the escaped characters are replaced by simple As such if a dictionary containing the following lines is compressed E mc2 FORMULA acorn shell N s and if the dictionary is applied to the following text Formulas like E mc2 have nothing to do with acorn shells you will get the following lines in the dictionary of compound words of the text E mc2 FORMULA acorn shells N p 32 CHAPTER 3 DICTIONARIES Entry Factorization Several entries containing the same inflectional and canonical forms can be combined into a single one if they have the same grammatical and semantic codes Among other things this allows us to combine identical conjugations for a verb bottle V W Pls P2s Plp P2p P3p If the grammatical and semantic information differ one has to create distinct entries bottle N Conc s bottle V W Pls P2s Plp P2p P3p Certain entries that have the same grammatical and semantic entries can have different senses as it is the case for the French word po le that describes a stove or a net in the mas culine sense and a kitchen instr
147. s are collected into concordance On the other hand if you modify a text instead of constructing a concordance it is nec essary to choose among these occurrences the one that will be taken into account Unitex applies the following prioritisation rule for that purpose the leftmost sequence is used If this rule is applied to the three occurrrences of the preceding concordance the occur rence in ancient overlaps with ancient times The first is retained because this is the leftmost occurrence and ancient times is eliminated The following occurrence of times a is no longer in conflict with ancient times and can therefore appear in the result Don there extended in ancient times a large forest The rule of priority of the leftmost match is applied only when the text is modified be it during preprocessing or after the application of a syntactic graph cf section 6 6 3 6 5 4 Priority of the longest match During the application of a syntactic graph it is possible to choose if the priority should be given to the shortest or the longest sequences or if all sequences should be retained During preprocessing the priority is always given to the longest sequences 6 5 5 Transductions with variables As we have seen in section 5 2 6 it is possible to use variables to store the text that has been analyzed by a grammar These variables can be used in the preprocessing graphs and in the syntactic graphs You have to give nam
148. s the compact form of the text automaton and stores it in out 138 CHAPTER 9 USE OF EXTERNAL PROGRAMS 9 17 Inflect Inflect delas result dir This program carries out the automatic inflection of a DELAS dictionary The parameter delas indicates the name of the dictionary to inflect The parameter result indicates the name of the dictionary to be generated The parameter di r indicates the complete file path of the directory in which the inflection transducers are that the de las dictionary refers to The result of the inflection is a DELAF dictionary saved under the name indicated by the parameter result 9 18 Locate Locate text fst2 alphabet s l a i m r n thai space This program applies a grammar to a text and costructs an index file of the found occur rences The following are its parameters e text complete path of the text file without omitting the extension snt e fst2 complete path of the grammar without omitting the extension fst2 e alphabet complete path of the alphabet file e s 1 a parameter indicating whether the search should be carried out in mode shortest matches s longest matches 1 or all matches a e i m r parameter indicating the application mode of the transductions mode MERGE m or mode REPLACE r i indicates that the program should not take into account transductions e n parameter indicating how many occurences to search for The value a11 indicates that all occurrences need to be
149. s which it can take The codes declared for the same attribute must be exclusive In other words an entry cannot take more than one value for the same attribute On the other hand labels can exist which don t take any value for a given attribute For example to define the attribute niveau_de_langue which can take the values z1 z2 and z3 the following line can be written niveau_de_langue zl z2 z3 e discr this part consists of a declaration of a unique attribute The syntax is the same as in the cat part and the attribute described here must not be repeated This part allows for dividing the grammatical category in discriminating sub categories in which the entries have similar inflectional attributes For pronouns for example a person indicator is assigned to entries that are part of the personal pronoun sub category but not to relative pronouns These dependencies are described in the complete part e complete In this part the morphological tags of the words in the current grammatical category are described Each line describes a valid combination of inflectional codes by their discriminating sub category if such a category was declared If an attribute name is specified in angle brackets lt and gt this siginifies that any value of this attribute might occur It is possible as well to declare that an entry does not take any inflexional feature by means of a line 114 CHAPTER 7 TEXT AUT
150. s with the character the contents of the label is to be searched regardless of case This information is only useful if the label is a word If the line starts with a case variants are taken into account If a label carries a transduction the input and output sequences are separated by the character example the DET By convention the first label is always the empty word lt E gt even if that label is never used for any transition The end of the file is indicated by a line containing the character f followed by a newline 150 CHAPTER 10 FILE FORMATS 10 4 Texts This section presents the different files used to represent texts 10 4 1 txt files The txt files are text files encoded in Unicode Little Endian These files should not contain any opening or closing braces except for those used to mark a sentence separator S or a valid lexical label aujourd hui ADV The newline needs to be encoded with the two special characters with the hexadecimal values 000D and 000A 10 4 2 snt Files The snt files are txt files that have been processed by Unitex These files should not contain any tabs They should also not contain multiple consecutive spaces or newlines The only allowed braces in the snt files are those of the sentence separator S and those of lexical labels aujourd hui ADV 10 4 3 File text cod The file text cod is a binary file containing a sequence of entities that represent the text
151. sai ype definen ree weeds o Texte MS DOS avec sauts de ligne Texte unicode HTML Document Word 6 0 95 Fichier secondaire WordPerfect 5 1 ou 5 2 Figure 2 4 Saving in Unicode with Microsoft Word Dictionary part visible in figure 2 5 enables you to carry out operations concerning the electronic dictionaries In particular you can search by specifying if it concerns inflected terms lemmas the grammatical and semantic and or the inflecional codes Thus if you want to search for all the verbs which have the semantic feature t which indicates transitiv ity it is enough to search for t by clicking on Grammatical code You will get the matching entries without confusion with all the other occurrences of the letter t I as Find Find Sentence Dictionary Search Find what to Find Next Replace Po O Replace Next Occurrences 0 Replace Options Count occurrences Search from begining v Grammatical code _ Canonical form Replace All Search up _ Inflected form _ Flexional code Close a Search down Figure 2 5 Searching for the semantic feature t in an electronic dictionary 24 OPENING A TEXT 19 2 4 Opening a text Unitex deals with two types of text files The files with the extension snt are text files preprocessed by Unitex which are ready to be manipulated by the different system functions The files ending with txt are raw files To use a text open the
152. sed and distributed 1 2 The Java runtime environment Unitex consists of a graphical interface written in Java and external programs written in C C This mixture of programming languages is responsible for a fast and portable ap plication that runs on different operating systems Before you can use the graphical interface you first have to install the runtime environment usually called or For the graphical mode Unitex needs Java version 1 4 or later If you have an older version of Java Unitex will stop once you have selected the working language You can download the virtual machine for your operating system for free from the Sun Microsystems web site at the following address http java sun comlIf you re working on Linux or if you are using a Windows version with personal user accounts you will have to ask your system administrator to install Java 11 12 CHAPTER 1 INSTALLING UNITEX 1 3 Installation on Windows If Unitex is to be installed on a multi user Windows machine it is recommended that the systems administrator performs the installation If you are the only user on your machine you can perform the installation yourself Decompress the file unitex_1 2 zip You can download this file from the follow ing address http www igm univ mlv fr unitex into a directory Unitex that should preferably be created within the Program Files folder After decompressing the file the Unitex directory contains several subdirectories
153. sed to make replacements in the text espe cially for the normalization of non ambiguous forms With the option Apply All default Dictionaries you can apply dictionaries in the DELA format Dictionnaires Electroniques du LADL The option Analyze unknown words as free compound words is used in Nor wegian for correctly analyzing compound words constructed via concatenation of simple forms Finally the option Construct Text Automaton is used to build the text automaton This option is deactivated by default because it consumes a large amount of memory and disk space if the text is too large The construction of the text automaton is described in chapter 7 NOTE If you click on Cancel but tokenize text the program will carry out the normal ization of separators and look up the lexical units Click on Cancel and close text to cancel the operation 2 5 PREPROCESSING A TEXT 21 2 5 1 Normalization of separators The standard separators are the space the tab and the newline characters There can be several separators following each other but since this isn t useful for linguistic analyses separators are normalized according to the following rules e separators that contain at least one newline are replaced by a single newline e all other sequences of separators are replaced by a space The distinction between space and newline is maintained at this point because the pres ence of newlines may have an effect on the process of spl
154. space be tween two boxes In the present case the program tries to read a space between the box that 88 CHAPTER 6 ADVANCED USE OF GRAPHS EH A ADJ ADJ NOUN NOUN MOLY SBEY Figure 6 21 Inversion of words using two variables Concordance E My Unitex English Corpus ivanhoe_snticoncord html ldiery flung their gnarled arms over a carpet thick of the most delicious green sward 5 in s E ants to add weight as it were to the chains feudal with which they were loaded 5 At court dress and appearance of that wild and character rustic which belonged to the woodlands of t n this singular gorget was engraved in characters Saxon an inscription of the following purp the nobility and the sufferings of the classes inferior arose from the consequences of the C bottom and in stopping the course of a brook small which glided smoothly round the foot of ti of Richard I when his return from his captivity longhad become an event rather wished than hi ldiery flung their gnarled arms over a carpet thickof the most delicious green sward 5 in sq ants to add weight as it were to the chains feudalwith which they were loaded 5 At court dress and appearance of that wild and character rustic which belonged to the woodlands of tk n this singular gorget was engraved in characters Saxon an inscription of the following purp the nobilit and the sufferings of the classes inferior arose from the conse Figure 6 23 Spacing problem
155. specified number will be the maximum number of generated paths If the option is not selected all paths will be generated Here you see what is created for the graph in figure 6 13 with default settings ignoring outputs limit 100 paths lt NB gt lt boule gt de glace la pistache lt NB gt lt boule gt de glace la fraise lt NB gt lt boule gt de glace la vanille lt NB gt lt boule gt de glace vanille lt NB gt lt boule gt de glace fraise lt NB gt lt boule gt de glace pistache lt NB gt lt boule gt de pistache lt NB gt lt boule gt de fraise 6 4 GRAPH COLLECTIONS 83 lt NB gt lt boule gt de vanille glace la pistache glace la fraise glace la vanille glace vanille glace fraise glace pistache gt mas ae Figure 6 13 Sample graph 6 4 Graph Collections It can happen that one wants to apply several grammars located in the same directory For that it is possible to automatically build a grammar starting from a tree structure of files Let us suppose for example that one has the following tree structure e Dicos Banque carte grf Nourriture x eau gr f pain grf truc grf If one wants to gather all these grammars in only one one can doit with the Build Graph Collection command in the FSGraph Tools sub menu One configures this operation by means of the window seen in figure 6 14 84 CHAPTER 6 ADVA
156. ssible to apply morphological filters to the lexemes found For that it is necessary to immediately follow the lexeme found by a pattern in double angle brackets motif lexical lt lt morphological pattern gt gt The morphological patterns are expressed as regular expressions in POSIX format see for the detailed syntax Here are some examples of elementary filters e lt lt ss gt gt contains ss e lt lt a gt gt begins with a e lt lt ez gt gt ends with ez e lt lt a s gt gt contains a followed by any character followed by s e lt lt a s gt gt contains a followed by a sequence of any character followed by s e lt lt ss tt gt gt contains ss or tt e lt lt aeiouy gt gt contains a non accentuated vowel e lt lt aeiouy 3 5 gt gt contains a sequence of non accentuated vowels whose length is between 3 and 5 e lt lt e gt gt contains followed by an optional e e lt lt ss e gt gt contains ss followed by an optional character which is not e It is possible to combine these elementary filters to form more complex filters e lt lt ai ble gt gt ends with able or ible 50 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS e lt lt anti pro gt gt begins with anti or pro followed by an optional dash e lt lt rst aeiouy 2 gt gt a word formed by 2 or more sequences beginning with r s ort followed by a non accentuated vowel e lt lt 1 1
157. sub menu of the FSGraph menu which opens the window as in figure 5 21 The font parameters are 68 CHAPTER 5 LOCAL GRAMMARS news mamar En NN sont sont gt Unsaved ceci est un ampi Y Figure 5 20 Example of using the grid e Input Font used within the boxes and in the text area where the contents of the boxes is edited e Output font used for the attached transductions The color parameters are e Background the background color e Foreground the color used for the text and for the box display e Auxiliary Nodes the color used for calls to sub graphs e Selected Nodes the color used for selected boxes 5 3 DISPLAY OPTIONS 69 Display Colors lt vi Date Background MM Set v File Name Foreground Set i Pathname Auxiliary Nodes Set y Frame Selected Nodes Set T RighttoLeft Comment Nodes set Fonts 7 Input Times New Roman 10 Output Times New Roman Gras 12 Cancel Figure 5 21 Configuring the display options of a graph e Comment Nodes the color used for boxes that are not connected to others The other parameters are e Date display of the current date in the lower left corner of the graph e File Name display of the graph name in the lower left corner of the graph Pathname display of the graph name along with its complete path in the lower left corner of the graph
158. t N faux ami gt could recognize all entries of the dictionaries containing the codes N and faux ami The order in which the codes appear in the pattern is not important The three following patterns are equivalent lt N Hum zl1 gt lt z1 N Hum gt lt Hum z1 N gt NOTE it is not possible to use a pattern that only has prohibited codes lt N gt and lt A z1 gt are thus incorrect patterns 4 3 4 Inflectional constraints It is also possible to specify constraints about the inflectional codes These constraints have to be preceded by at least one grammatical or semantic code They are represented as in flectional codes present in the dictionaries Here are some examples of patterns using inflec tional constraints e lt A m gt recognizes a masculine adjective 46 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS e lt A mp f gt recognizes a masculine plural or a feminin adjective e lt V 2 3 gt recognizes a verb in the 2nd or 3rd person that excludes all tenses that have neither a 2nd or 3rd person infinitive past participle and present participle as well as the tenses that are conjugated in the first person In order to let a dictionary entry E be recognized by pattern M it is necessary that at least one inflectional code of E contains all the characters of an inflectional code of M Consider the following example E pretext V W P1s P2s P1p P2p P3p M lt V P3s P3 gt No inflectional code of E contains the c
159. t the majority of linguistic phenomena The first section presents the formalism in which these grammars are represented Then we will see how to construct and present grammars using Unitex 5 1 The Local grammar formalism 5 1 1 Algebraic grammars Unitex grammars are variants of algebraic grammars also known as context free grammars An algebraic grammar consists of rewriting rules Below you see a grammar that matches any number of a characters S aS S The symbols to the left of the rules are called non terminal symbols since they can be replaced Symbols that cannot be replaced by other rules are called terminal symbols The items at the right side are sequences of non terminal and terminal symbols The epsilon symbol e designates the empty word In the grammar above S is a non terminal symbol and a a terminal symbol S can be rewritten as either an a followed by a S or as the empty word The operation of rewriting by applying a rule is called derivation We say that a grammar recognizes a word if there exists a sequence of derivations that produces that word The non terminal that is the starting point of the first derivation is called an axiom The grammar above also recognizes the word aa since we can derive this word according to the axiom S by applying the following derivations Derivation 1 rewriting the axiom to aS S as Derivation 2 rewriting S at the right side of aS S aS aas 55 56 CHAPTER 5 LOCAL
160. t xt file by clicking on Open in the Text menu Unitex 1 2 current language is French DELA FSGraph Lexicon Grammar Edit File Edition Windows Info Preprocess Text Change Language Apply Lexical Resources Locate Pattern Ctrl Display Located Sequences Elag Rules Construct FST Text Close Text Quit Unitex Figure 2 6 Text Menu Choose the file type Raw Unicode Texts and select your text Look In corpus ca e Hej onde93_snt Ci test_snt E chimie txt DB Je 94_extrait_snt Ci transd_snt E Corpus muttiplex 3 txt y te aths_snt O 80jours txt E coupe_davis txt DO 1m hographe_snt 80jours_2 txt E essaitxt D rte_snt E batiment txt D fatten txt C orti st anne_snt D CES txt D Jeudi ascuxt D po File Name Files of Type Raw Unicode Texts Core ese Figure 2 7 Opening a Unicode text Files larger than 5 MBytes are not displayed The message This file is too large to be displayed Use a word processor to view it is displayed in the win dow This limit applies to all open text files the list of lexical units dictionaries etc To modify this limit use the menu Info gt Preferences and set the new value for Maximum Text File Size in the tab Text Presentation see 4 7 page 53 20 CHAPTER 2 LOADING A TEXT 2 5 Preprocessing a text After a text is selected Unitex offers to preprocess it Text preprocessing consists o
161. t2Grf text_automaton sentence This program extracts an automaton of a sentence in grf format from the automaton of a text The parameter text_automaton represents the complete path of the automaton file of the text from which a sentence is to be extracted This file is called text fst2 and is stored in the directory of the text The parameter sentence indicates the number of the sentence to extract The program produces the following two files and saves them in the directory of the text e cursentence grf graph representing the automaton of the sentence e cursentence txt text file containing the sentence 9 13 Fst2List Fst2List o out p s f d a t s m f s a s L R sO Str v r s l x L IR M 1 line I subname x c SS 0xxxx fnar 136 CHAPTER 9 USE OF EXTERNAL PROGRAMS This program takes a file fst2 and produces the list of the sequences recognized by this grammar The parameters are as follows e fname name of the grammar with the extension st 2 e o out the name of the output file By default this file is named 1st txt e a t s m specifies if a or not t the outputs of the grammar are taken into account s indicates that there is one initial state while m indicates that there are several this mode is useful in Korean By default this parameter is set to a s e 1 line maximum number of lines to be written in the output file e i subname indicates
162. tal letter and inserts the symbol S between the question mark and the following word The following text What time is it Eight o clock will be converted to What time is it S Eight o clock A grammar for splitting can use the following special symbols 22 CHAPTER 2 LOADING A TEXT e lt E gt empty word or epsilon Recognizes the empty sequence e lt MOT gt recognizes any sequence of letters e lt MIN gt recognizes any sequence of letters in lower case e lt MAJ gt recognizes any sequence of letters in upper case e lt PRE gt recognizes any sequence of letters that begins with an upper case letter e lt NB gt recognizes any sequence of digits 1234 is recognized but not 1 234 e lt PNC gt recognizes the punctuation symbols and the inverted exclama tion points and question marks in Spanish and some Asian punctuation letters e lt gt recognizes a newline prohibits the presence of a space lt MAj gt r gle no 1 fases lt NB gt ex J P Dupont 1 1 icttremaj of ex Prof J Dupont OT M LettreMaj 18 lt MAJ gt lt gt Abr viation ou sigle ex S N C F ERE Ne pas prendre le final qui peut tre un s parateur de phrases C ees 2 LettreMaj LettreMaj MotsComposesAvecMaj caract res singulier et pluriel ensembles lettres parties de sous ensembles caract
163. tats n text file containing information on the number of sentence separators the number of units the number of simple words and the number of numbers 142 CHAPTER 9 USE OF EXTERNAL PROGRAMS e enter pos binary file containing the list of newline positions in the text The coded representation of the text does not contain newlines but spaces Since a newline counts for two characters and the space for a single one it is necessary to know where there are newlines in the text if the posistions of the calcualted occurrences by the program Locate are to be synchronized with the text file For this the file enter pos is used by the program Concord Thanks to this when clicking on an occurrence in a concor dance it is correctly selected in the text All produced files are saved in the directory of the text 9 28 Txt2Fst2 Txt2Fst2 text alphabet clean norm This program constructs an automaton of a text The parameter text represents the complete path of a text file without omitting the extension snt The parameter alphabet represents the complete path of the alphabet file of the language of the text The optional parameter clean indicates whether the principle of conservation of the best paths see section 7 2 4 should be applied If the parameter norm is specified it is interpreted as the name of a normalization grammar that is to be applied to the text automaton If the text is split into sentences the program constructs an automat
164. ter in capitalized and non capitalized form This form is used to define an Asian punctuation mark For certain languages like French it is possible that a lower case letter corresponds to multiple upper case letters like for example which can have the upper case form E or To express this it suffices to use multiple lines The inverse is equally true a capitalized letter can correspond to multiple lower case letters Thus the E can be the capitalization of e e or Here an excerpt of the French alphabet file which defines the different letters e EeY E q 10 3 GRAPHS 145 10 2 2 Sorted alphabet The sorted alphabet text file defines the sorting priorities of the letters of a language with which to sort with the program SortTxt Each line of that file defines a group of letters If a group of letters A is defined before a group of letters B every letter of group A is of lower priority than every letter in group B The letters of a group are only distinguished if necessary For example if the group of letters e has been defined the word bahi should be considered smaller than estuaire and also smaller than t Since the letters that follow e and allow a clas sification of the words it is not necessary to compare the letters e and since they are of the same group On the other hand if the words chant s and chantes are to be sorted chantes should be considered as smaller It is
165. tex proposes to save the graph in the sub directory Graphs in your per sonal folder You can see if the graph was modified after the last saving if the title contains the text Unsaved 5 2 3 Sub Graphs In order to call a sub graph its name is inserted into a box and preceded by the character If you enter the text alpha betatgamma e Greek delta grf into a box you get a box similar to the one in figure 5 7 You can indicate the complete name of the graph e grec delta grf or simply the name in the access path beta in this case the the sub graph is expected to be in the same directory as the graph that references it 60 CHAPTER 5 LOCAL GRAMMARS alpha beta gamma elgrecidelta grf Figure 5 7 Graph that calls sub graphs beta and delta Calls to these sub graphs are represented in the boxes by gray lines On Windows you can open a sub graph by clicking on the gray line while pressing the Alt key On Linux the combination lt Alt Click gt is intercepted by the system In order to open a sub graph click on its name by pressing the left and the right mouse button simultaneously 5 2 4 Manipulating boxes You can select several boxes using the mouse In order to do so click and drag the mouse without releasing the button When you release the button all boxes touched by the selec tion rectangle will be selected and are displayed in white on blue ground Monsieur M Figure 5 8 Selecting multip
166. the order they appear in from left to right As can be seen from the extract above there is no difference between lower and upper case Accents and the c dille character are ignored as well To sort a dictionary open it and then click on Sort Dictionary in the DELA menu By default the program always looks for the file Alphabet_sort txt If that file doesn t exist the sorting is done according to the character indices in the Unicode encoding By modifying that file you can define your own sorting order Remark After applying the dictionaries to a text the files d1 f dlc and err are automati cally sorted using this program 3 4 Automatic inflection As described in section 3 1 2 a line in a DELAS consists of a canonical form and a sequence of grammatical or semantic codes aviatrix N4 Hum matrix N4 Math radix N4 The first code is interpreted as the name of the grammar used to inflect the canonical form These inflectional grammars have to be compiled see chapter 5 In the example above all entries will be inflected by a grammar named N4 In order to inflect a dictionary click on Inflect in the DELA menu The window in figure 3 5 allows to specify the directory in which inflectional grammars are found By default the subdirectory Inflection of the directory for the current language is used The option Add before inflectional code if necessary automatically inserts a charac ter before the inflectional codes
167. the search In order to apply a graph to a text you open the text then click on Locate Pattern in the Text menu or press lt Ctrl L gt You can then configure your search in the window shown in figure 6 27 In the field Locate pattern in the form of choose Graph and select your graph by 6 6 APPLYING GRAPHS TO TEXTS 91 2 NumeroMois JourNumero vendredi JourNumero samedi dimanche septmebre Numero Mois octobre novembre decembre Figure 6 26 Nesting of variables clicking on the Set button You can choose a graph in grf format Unicode Graphs or a compiled graph in fst2 format Unicode Compiled Graphs If your graph is in grf format Unitex will compile it automatically before starting the search The Index field allows to select the recognition mode e Shortest matches give precedence to the shortest matches e Longest matches give precedence to the longest sequences This is the default mode e All matches give out all recognized sequences The field Search limitation allows to limit the search to a certain number of occurrence By default the search is limited to the 200 first occurrences The field Grammar outputs concerns the use of the transductions The mode Merge with input text allows to insert the sequences that are produced by the transductions The mode Replace recognized sequences allows to replace the recognized sequences with the p
168. therefore necessary to compare the letters e and to distinguish these words Since the letter e appears first in the group e e s it is considered to be smaller than chant s The word chantes should therefore be consid ered to be smaller than the word chant s The sorted alphabet file allows the definition of equivalent characters It is therefore possible to ignore the different accents as well as capitalization For example if the letters b c and d are to be ordered without considering capitalization and the cedilla it is possible to write the following lines Bb ete Da This file is optional If no sorted alphabet file is specified the program SortTxt creates a sorting in the order of the Unicode encoding 10 3 Graphs This section presents the two graph formats the graphic format grf and the compiled format fst2 10 3 1 Format grf A grf file is a text file that contains presentation information in addition to information representing the contents of the boxes and the transitions of the graph A grf file begins with the following lines fUnigrapnY SIZE 1313 9504 FONT Times New Roman 124 OFONT Times New Roman B 124 BCOLOR 167772154 FCOLOR 04 ACOLOR 126322564 146 CHAPTER 10 FILE FORMATS SCOLOR 167116804 CCOLOR 2554 D D D D D D DBOXES y FRAME y DATE y FILE yf DIR yf RIG ng RST ng FITS 1004 PORIENT L 4 The first line Unigraph is a comment l
169. tic information The following tables give an overview of some of the different codes used in the Unitex dictionaries These codes are the same for almost all languages though some of them are special for certain languages i e code for neuter nouns etc 3 1 THE DELA DICTIONARIES 33 Table 3 1 Frequent grammatical codes Table 3 2 Some semantic codes NOTE The descriptions of tense in table 3 3 correspond to French Nontheless the ma jority of these definitions can be found in other languages infinitive present past participle etc In spite of a common base in the majority of languages the dictionaries contain encoding particularities that are specific for each language Thus as the declination codes vary a lot between different languages they are not described here For a complete description of all codes used within a dictionary we recommend that you contact the author of the dictionary directly However these codes are not exclusive A user can introduce codes himself and can create his own dictionaries For example for educational purposes one could use a marker faux ami in an English dictionary 34 CHAPTER 3 DICTIONARIES plural 1 2 5 Ist dnd Sed person a T y c 7 a z K Table 3 3 Common inflectional codes bless V faux ami b nir cask N faux ami tonneau journey N faux ami voyage It is equally possible to use dictionaries to add extra information Thus you can use the
170. tionaries 3 1 The DELA dictionaries The electronic dictionaries distributed with Unitex use the DELA syntax Dictionnaires Elec troniques du LADL LADL electronic dictionaries This syntax describes the simple and compound lexical entries of a language with their grammatical semantic and inflectional information We distinguish two kinds of electronic dictionaries The one that is used most often is the dictionary of inflected forms DELAF DELA de formes Fl chies DELA of inflected forms or DELACF DELA de formes Compos es Fl chies DELA of compound inflected forms in the case of compound forms The second type is a dictionary of non inflected forms called DELAS DELA de formes simples simple forms DELA or DELAC DELA de formes compos es compound forms DELA Unitex programs make no distinction between simple and compound form dictionaries We will use the terms DELAF and DELAS to distinguish the two kinds of dictionaries whose entries are simple compound or mixed forms 3 1 1 The DELAF format Entry syntax An entry of a DELAF is a line of text terminated by a newline that conforms to the following syntax apples apple N conc p this is an example The different elements of this line are e apples is the inflected form of the entry it is mandatory e apple is the canonical form of the entry For nouns and adjectives in French it is usually the masculine singular form for verbs it is the infinitive This information ma
171. tiple languages Figure 8 1 shows an example of a lexicon grammar table The table concerns verbs that take a numerical complement 8 2 Conversion of a table into graphs 8 2 1 Principle of template graphs The conversion of a table into graphs is carried out by the mechanism of template graphs The prinicple is the following a graph that describes the possible constructions is con 123 124 CHAPTER 8 LEXICON GRAMMAR S Table32NM xls Exemple l Ce salon accepte vingt personnes accueillir j f Ce salon accueille vingt personnes accuser Max accuse 80 kilos accuser Ma amp accuse ses trente ans admettre On admet 50 personnes dans cette salle affecter Ces cristauxSaffectent amp une forme g om trique afficher Les valeurs ont affich un repli aimer La plante aime eau approcher l Cette maison approche les deux millions arpenter 5 j Ceterraingarpenteg30 arpents atteindre Ma atteint 80 kilos avoir Ma a une soeur une voiture des sous avoisiner l Ce sac avoisine les 20 kg battre La montre bat les secondes cacher Son calme cache son une grandejangoisse caler Cebateaugcaleg80 cm at lela le le lls la le Figure 8 1 Lexicon Grammar Table 32NM 32NM structed This graph refers to the columns of the table in the form of variables Afterwards for each line of the tab
172. to normalize O clock to of the clock it would be a bad idea to replace O by of the because a sentence like John O Connor said it s 8 O clock would be replaced by the following incorrect sentence John of the Connor said it s 8 of the clock Thus one needs to be very careful when using the normalization grammar One needs to pay attention to spaces as well For example if one replaces re by are the sentence You re stronger than him will be replaced by You are stronger than him To avoid this problem one should explicitly insert a space i e replace re by are The accepted symbols for the normalization grammar are the same as the ones allowed for the sentence splitting grammar The normalization grammar is called Replace fst2 and can be found in the following directory home directory active language Graphs Preprocessing Replace As in the case of sentence splitting this grammar is applied using the Fst2Txt program but in REPLACE mode which means that input sequences recognized by the grammar are replaced by the output sequences that are produced Figure 2 10 shows a grammar that normalizes verbal contractions in French 24 CHAPTER 2 LOADING A TEXT E est ambig tu ou te s est ambig se ou si quoique Figure 2 10 Normalization grammar for some elisions in French 2 5 4 Splitting a text into lexical units Some languages in particular Asian languages use sep
173. to search for an expression first open a text cf chapter 2 Then click on Locate Pattern in the menu Text The window of figure 4 4 appears The box Locate pattern in the form of allows to select regular expression or grammar Click on Regular expression The box Index allows to select the recognition mode e Shortest matches prefer short matches e Longest matches prefer longer matches This is the default 4 8 SEARCH 51 E Locate Pattern Locate pattern in the form of O Regular expression AM AE outputs Shortest matches o Are not taken into account 8 Longest matches Merge with input text All matches Replace recognized sequences Search limitation FA Stop after 200 matches O Index all utterances in text Figure 4 4 Window search for expressions e All matches Output all recognized sequences The box Search limitation is used to limit the number of results to a certain number of occurrences By default the search is limited to the 200 first occurrences The options of the box Grammar outputs do not concern regular expressions They are described in section 6 6 Enter an expression and click on Search in order to start the search Unitex will trans form the expression in a grammar in the format grf This grammar will then be compiled into a grammar of the format st 2 that will be used for the search 4 8 2 Present
174. toire d Automatique Documentaire et Linguistique LADL Sim ilar resources have been developed for other languages in the context of the RELEX labora tory network The electronic dictionaries specify the simple and compound words of a language to gether with their lemmas and a set of grammatical semantic and inflectional codes The availability of these dictionaries is a major advantage compared to the usual utilities for pat tern searching as the information they contain can be used for searching and matching thus describing large classes of words using very simple patterns The dictionaries are presented in the DELA formalism and were constructed by teams of linguists for several languages French English Greek Italian Spanish German Thai Korean Polish Norwegian Por tuguese etc The grammars deployed here are representations of linguistic phenomena on the basis of recursive transition networks RIN a formalism closely related to finite state automata Numerous studies have underscored the adequacy of automata for linguistic problems at all descriptive levels from morphology and syntax to phonetic issues The grammars created with Unitex carry this approach further by using a formalism even more powerful than automata These grammars are represented as graphs that the user can easily create and update The tables built in the context of lexicon grammar are matrices describing properties of certain words Many such tables have bee
175. tomaton The program Reconstrucao allows to construct a normalization grammar for these forms for each text dynamically The thus produced grammar can then be used for normal izing the text automaton The configuration window of the automaton construction suggests an option Build clitic normalization grammar cf figure 7 10 This option automatically starts the construction of the normalization grammar which is then used to construct the text automaton if you have selected the option Apply the Normalization grammar 7 2 4 Keeping the best paths It can possible that an unknown word jeopardizesXXX the text automaton by overlapping with a completely labeled sequence Thus in the automaton of figure 7 8 it can be seen that the adverb aujourd hui overlaps with the unknown word aujourd followed by an apostrophe and the past participle of the verb hui r 7 2 CONSTRUCTION 101 FST Text 3658 sentences A _ Restez r pondit Fix Sentence 1 649 Reset Sentence Graph Rebuild FST Text Elag Frame Explode fs ADV z1 AH Reset Sentence Graph Rebuild FST Text Elag Frame Figure 7 9 Automaton of a thai sentence This phenomenon can also be found in the treatment of certain Asian languages like Thai When the words are not delimited there is no other solution than to consider all pos sible combinations which ca
176. trated by the figure 6 8 This type of loops is due to the fact that a transition with the empty word cannot be eliminated automatically by Unitex because it is associated with a transduction Thus the 6 2 COMPILATION OF A GRAMMAR 79 Recursion detection completed Compilation has succeeded Loading EMy UnitexiEnglish Graphsirer fst2 Computing grammar dependences he resulting grammar is an equivalent finite state transducer Figure 6 6 Resultat of the approximation of a grammar sur ce chemin la transduction DET est ignoree sur celui ci la transduction est prise en compte HE DET N Figure 6 7 How to associate a transduction with a call to a subgraph transition with the empty word of figure 6 8 will not be suppressed and will cause an infinite loop The second category of loop by epsilon concerns the call to subgraphs that can recognize the empty word This case is illustrated in figure 6 9 if the subgraph Adj recognizes epsilon there is an infinite loop that Unitex cannot detect The third possibility of infinite loops is related to recursive calls to subgraphs Look at the graphs Det and DetCompose in figure 6 10 Each of these graphs can call the other 80 CHAPTER 6 ADVANCED USE OF GRAPHS lt E gt gt Ep o ep mo ADJ N Figure 6 8 Infinite loop due to a transition by the empty word with a transductions Re AG Figure 6 9 Infinite loop due to a call to a subgraph that recognizes epsil
177. ument in the feminine sense You can thus distinguish the entries in this case po le N z1 fs po le a frire po le N z1 ms voile linceul appareil de chauffage NOTE In practice this distinction has the only consequence that the number of entries in the dictionary increases In the different programs that make up Unitex these entries are reduced to po le N z1 fs ms Whether this distinction is made is thus left to the maintainers of the dictionaries 3 1 2 The DELAS Format The DELAS format is very similar to the one used in the DELAF The only difference is that there is only one canonical form followed by grammatical and or semantic codes The canonical form is separated from the different codes by a comma There is an example horse N4 Anl The first grammatical or semantic code will be interpreted by the inflection program as the name of the grammar used to inflect the entry The entry of the example above indicates that the word horse has to be inflected using the grammar named N4 It is possible to add in flectional codes to the entries but the nature of the inflection operation limits the usefulness of this possibility For more details see below in section 3 4 3 1 3 Dictionary Contents The dictionaries provided with Unitex contain descriptions of simple and compound words These descriptions indicate the grammatical category of each entry optionally their inflec tional codes and diverse seman
178. units The button Set File you can select the output file Then click on Extract matching units or Extract unmatching units depending on whether you are interested in sentence with or without matching units In the box Show Matching Sequences in Context you can select the length in characters of the left and right contexts of the occurences that will be presented in the concordance If an occurrence has less characters than its right context the line will be completed with the necessary number of characters If an occurence has a length greater than that of the right context it will be displayed completely NOTE in Tha the size of the contexts is measured in displayable characters and not in real characters This makes it possible to keep the line alignment in the concordance despite the presence of diacritics that combine with other letters instead of being displayed as normal characters 4 8 SEARCH 53 You can choose the sort order in the list Sort According to The mode Text Order dis plays the occurrences in the order of their appearance in the text The six other modes allow sorting in columns The three zones of a line are the left context the occurrence and the right context The occurrences and the right contexts are sorted from left to right The left con texts are sorted from right to left The default mode is Center Left Col The concordance is generated in the form of an HTML file If a concordance reaches several
179. uotation marks 4 3 Patterns 4 3 1 Special symbols There are two kinds of patterns The first category contains all symbols that have been introduced in section 2 5 2 except for the symbol lt gt which matches a line feed Since all line feeds have been replaced by spaces this symbol cannot longer be useful when searching for patterns These symbols also called meta symbols are the following e lt E gt the empty word or epsilon Matches the empty string e lt TOKEN gt matches any lexical unit e lt MOT gt matches any lexical unit that consists of letters e lt MIN gt matches any lower case lexical unit e lt MAJ gt matches any upper case lexical unit e lt PRE gt matches any lexical unit that consists of letters and starts with a capital letter e lt DIC gt matches any word that is present in the dictionaries of the text e lt SDIC gt matches any simple word in the text dictionaries e lt CDIC gt matches any composed word in the dictionaries of the text e lt NB gt matches any contiguous sequence of digit 1234 is matched but not 1 234 e prohibits the presence of space 4 3 2 References to the dictionaries The second kind of patterns refers to the information in the dictionaries of the text The four possible forms are e lt be gt matches all the entries that have be as canonical form e lt be V gt matches all entries having be as canonical form and the grammatical code V
180. uses the creation of numerous paths carrying unknown words that are mixed with the labeled paths Figure 7 9 shows an example of such an automaton of a Tha sentence It is possible to suppress parasite paths You have to select the option Clean Text FST in the configuration window for the construction of the text automaton cf figure 7 10 102 CHAPTER 7 TEXT AUTOMATA This option indicates to the automaton construction program that it should clean up each sentence automaton Construct the Text FST E Normalization Build clitic normalization grammar available only for Portuguese vi Apply the Normalization grammar Norm fst2 vi Clean Text FST Use Following Dictionaries previously constructed The program will construct the text FST according to the DLF and DLC files previously constructed for the current text Cancel Construct FST Figure 7 10 Configuration of the construction of the text automaton This cleaning is carried out according to the following principle if several paths are concurrent in the automaton the program keeps those that contain the fewest unknown words Figure 7 11 shows the automaton of figure 7 9 after cleaning Grsr ret eae 2 sentences Bman uA Sentence El Reset Sentence Graph RebuidFSTTet Elag Frame ADJV Collq Figure 7 11 Automaton of figure 7 9 after cleaning 7 3 RESOLVING LEX
181. utput which could be used by the program elag e ruleslist listing file of ELAG grammars e lang the ELAG configuration file for the chosen language e output optional name of the output file By default the output file is identical to ruleslist except for the extension which is rul e rulesdir this optional parameter indicates the directory in which ELAG rules are to be found 9 8 Evamb Evamb imp exp o fstname n sentenceno This program calculates an average rate of ambiguity for the whole text automaton fstname or right on the sentence specified by sentenceno If the parameter imp is specified the program carries out calculation on a form the automaton known as compacte in which inflectional ambiguities are not taken into account If the parameter exp is spec ified all inflectional ambiguities are considered We then speak of the developed form of the text automaton Thus the entry aimable A ms f will count only once with imp and twice with exp The text automaton is not modified by this program 9 9 ExploseFst2 ExploseFst2 txtauto o out This program calculates the expanded form of the text automaton t xt auto and stores in out 9 10 Extract Extract yes no text concordance result This program takes a text and a concordance as parameters If the first parameter is yes the program extracts all sentences from the text that have at least one occurrence in the concordance If the paramete
182. xical labels 44 97 142 152 Lexical lables 140 152 Lexical Ressources see Dictionaries Lexical Symbols 118 Lexical Units splitting up 23 Lexical units 43 143 cutting into 142 Lexicon Grammar 123 Lexicon grammar tables 123 142 Longest matches 50 90 139 Lowercase seeRespect of lowercase uppercase 76 Matrices 123 MERGE 23 85 91 138 139 155 Meta characters 63 Meta symbols 44 Metas 21 Modification of texts 131 Modification of the text 92 Multiple Selection 60 copy paste 60 Negation 46 non terminal symbols 55 Normalisation clitics in Portugese 141 of ambiguous forms 75 143 INDEX of separators 20 140 of text automata 143 of the text automaton 75 Normalization of ambiguous forms 97 of clitics in Portuguese 98 of the text automaton 97 Normalization of non ambigue forms 23 Norvegian free composite words 28 free compounds in 140 Occurrences number of 51 91 139 Operator L 38 73 R 38 73 concatenation 47 disjunction 48 Kleene 48 Optimizing ELAG Grammars 117 Options Configuration 67 Paste 60 62 Paster 64 Pattern 44 Pattern search 139 Phrase Detection 21 Pixellisation 66 Point de synchronisation 102 PolyLex PolyLex 140 Portugese normalisation of clitics 141 Portuguese normalization of clitics 98 POSIX 49 Preferences 69 Print a phrase automaton 121 Printing a graph 71 Priorities of dictionaries 40 IND
183. y be left out as in the following example 29 30 CHAPTER 3 DICTIONARIES apple N conc s This means that the canonical form is the same as the inflected form The canonical form is separated from the inflected form by a comma N conc is the sequence of grammatical and semantic information In our example N designates a noun and conc indicates that this noun designates a concrete object see table 3 2 Each entry must have at least one grammatical or semantic code separated from the canonical form by a period If there are more codes these are separated by the character p is an inflectional code which indicates that the noun is plural Inflectional codes are used to describe gender number declination and conjugation This information is optional An inflectional code is made up of one or more characters that represent one information each Inflectional codes have to be separated by the character for instance in an entry like the following hang V W Pls P2s Plp P2p P3p The character is interpreted in OR semantics Thus W P1s P2s Plp P2p P3p means infinitive Ist person singular present 2nd person singular present etc see table 3 3 Since each character represents one information it is not necessary to use the same character more than once In this way encoding the past participle using the code PP would be exactly equivalent to using P alone this is an example is a comment Comments are o

Unitex User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents