Home
Unitex User Manual - Institut d'électronique et d'informatique
Contents
1. Disabled Name Merge Replace iter Up 1 ItoolFigement fst2 v a 2 lpersNoel fst2 v Down 3 lamount fst2 v gt 4 lamountAmount fst2 v Look In Graphs E aaa Bal BZ GR 5 ftimeDateCalendaireAvecFin fst2 Y Bottom le ItimeDateCalendaire fst2 Y CS CascadeAnalyseRenom CorpusAvecBalises C Preprocessing LA timeAnnee Siecle fst2 CA ItimeDateRelative fst2 v 7 CascadeSyntheseRenom CorpusAvecBalises C text xml david Delete timeDateAbsolue et C CasEN D3 tenatst2 vA limePeriode fst2 v C CasEN_accolades E normaliseTreeTag fst2 10 timePrep fst2 ra 41 jamountPrepDuree fst2 v 7 CasEN_ester2_nov2011 D testGram fst2 42 lamountDureeLesHour fst2 7 J CasEN Quaero 3 testnombres fst2 v imeHoraire fst2 v CA CasEN Quaero poids D testPoids fst2 5 Ge x un Save ames CA Normalization Mmm 15 itimeDet0 fst2 m Save As 16 ladhocEtapeTimeMois fst2 v D T I ip a ies ladhocEtapeTime fst2 v FileName coneis 18 persCollectif fst2 v Marec 49 foncCollectiveExtractor fst2 v Files of Type fst2 y 20 foncCollective fst2 v laDisable all Di foncCollectiveCtxtD fst2 v
2. 318 CHAPTER 14 FILE FORMATS The first two lines are comment lines The following three lines indicate the name the style and the size of the font used to display texts dictionaries lexical units sentences in text automata etc The CONCORDANCE FONT NAME and CONCORDANCE FONT HTML SIZE parame ters define the name the size and the font to be used when displaying concordances in HTML The size of the font has a value between 1 and 7 The INPUT FONT and OUTPUT FONT parameters define the name the style and the size of the fonts used for displaying the paths and the transducer out puts of the graphs The following 10 parameters correspond to the parameters given in the headings of the graphs Table 14 5 describes the correspondances Parameters in the Config file Parameters in the grf file DATE DDATE FILE NAME DFILE PATH NAME DDIR FRAME DFRAME RIGHT TO LEFT DRIG BACKGROUND COLOR BCOLOR FOREGROUND COLOR FCOLOR AUXILIARY NODES COLOR ACOLOR COMMENT NODES COLOR SCOLOR SELECTED NODES COLOR CCOLOR Table 14 5 Meaning of the parameters The PACKAGE NODES parameter defines the color to be used for displaying calls to subgraphs located in the repository The CONTEXT NODES
3. Figure 5 2 Empty graph displayed in the form of red text lines since it is not connected to another one at the moment We often use this type of boxes to insert comments into a graph If you intend to insert comments into a graph you can create a box starting with The text in this box will be displayed in green and may contain empty lines This box can t have any incoming nor outgoing transitions see figure 5 5 To connect a box to another one first click on the source box then click on the target box If 92 CHAPTER 5 LOCAL GRAMMARS there already exists a transition between two boxes it is deleted It is also possible to do that by clicking first on the target box and then on the source box while pressing Shift In our example after connecting the box to the initial and final states of the graph we get a graph as in figure 5 6 Unsaved Figure 5 3 Creating a box Figure 5 4 Box containing I you he she it we they 5 2 EDITING GRAPHS 93 al grf home paumier unitex French Graphs Unsaved De mile Re si a b lt c alors on a t CO FD dixit toto avocats OUT si atb lt e alors an a COFD dixit toto Figure 5 5 Box containing comments Figure 5 6 Graph that recognizes English pronouns NOTE If you double click a box you connect this b
4. N Mus z1 ms mp Figure 7 17 Text automaton frame 3 sentences La porte du car se ferme automatiquement D Sentence Ee Reset Sentence Graph Rebuild FST Text close elag frame O Apply Elag Rule TENDS umm b Implode PNIG V S3s P3s Replace Figure 7 18 Splitted text automaton frame 168 CHAPTER 7 TEXT AUTOMATON Don t be surprised if the automaton shown at the bottom seems more complicated This re sults from the fact that factorized lexical entries were exploded in order to treat each inflec tional interpretation separately To refactorize these entries click on the Implode button Clicking on the Explode button shows you an exploded view of the text automaton If you click on the Replace button the resulting automaton will become the new text au tomaton Thus if you use other grammars they will apply to the already partially disam biguated automaton which makes it possible to accumulate the effects of several grammars 7 3 4 Grammar collections It is possible to gather several ELAG grammars into a grammar collection in order to compile and apply them in one step The sets of ELAG grammars are described in 1st files They are managed through the window for compiling ELAG grammars figure 7 16 The label on the top left indicates the name of the current collection by default elag 1st The contents of this collection are displayed in the right part
5. 2335 sentences Sentence Reset Sentence Graph Rebuild FST Text Elag Frame N NPN z1 p Figure 7 2 Overlap between a compound word and a combination of simple words 7 2 CONSTRUCTION 155 7 2 Construction In order to construct the text automaton open the text then click on Construct FST Text in the menu Text One should first split the text into sentences and apply dictionaries If sentence boundary detection is not applied the construction program will arbitrarily split the text in sequences of 2000 lexical units instead of constructing one automaton per sen tence If no dictionaries are applied the text automaton that you obtain will consist of only one path made up of unknown words per sentence 7 2 1 Construction rules for text automata Sentence automata are constructed from text dictionaries The resulting degree of ambiguity is therefore directly linked to the granularity of the descriptions of dictionaries From the sentence automaton in figure 7 3 you can conclude that the word which has been coded twice as a determiner in two subcategories of the category DET This granularity of descrip tions will not be of any use if you are only interested in the grammatical category of this word It is therefore necessary to adapt the granularity of the dictionaries to the intended use DET Dind s DET Dadj s p Figure 7 3 Double entry for which as a determiner For each lexical unit of the s
6. Figure 6 56 Single output for the noble The Variable error policy option allows you to specify what Locate LocateTfst is sup posed to do when an output is found that contains a reference to a variable that has not been 6 10 APPLYING GRAPHS TO TEXTS 147 correctly defined Note that this parameter has no effect if outputs are to be ignored For instance let us consider the graph shown on Figure 6 57 gt A ADJ A NOUN N Figure 6 57 A variable A that may be undefined With the Ignore variable errors option A will just be ignored as if it had an empty content as shown on Figure 6 58 Concordance D My Unitex EnglishiCorpusivanhoe snticoncord n m een fixed upon the necks ADJ NOUN necks of as it were to the feudal chains ADJ feudal NOUN chains court and in the castles ADJ NOUN castles the castles of the great nobles ADJ qreat NOUN nobles nobles where the pomp ADJ NOUN pomp and s and state of a court ADJ NOUN court was e Figure 6 58 variable A that may be undefined With the Exit on variable error option Locate LocateTfst will exit with an error mes sage as shown on Figure 6 59 With the Backtrack on variable error option Locate LocateTfst will stop exploring the current path in the grammar Thus variables play the role of switches that cut paths when variables are undefined For instance the application of grammar 6 57 will only produce matches c
7. Compile recompile all the graphs of the cascade Disable all to disable all the graphs of the cascade Enable all to enable all the graphs of the cascade Close to close the current window 3t Disabled Name Merge Replace Iter foolFigement fst2 Y jpersNoel fst2 jamount fst2 jamountAmountfst2 ItimeDateCalendaireAvecFin fst2 timeDateCalendaire fst2 timeAnneeSiecle fst2 ItimeDateRelative fst2 timeDateAbsolue fst2 a a timePrep fst2 lamountPrepDuree fst2 jamountDureeLesHour fst2 timeHoraire fst2 ItimeLocution fst2 time fst2 ftimeDet0 fst2 adhocEtapeTimeMois fst2 jadhocEtapeTime fst2 persCollectif fst2 _foncCollectiveExtractor fst2 foncCollectivefst2 ffoncCollectiveCtxtD fst2 jorginstitution fst2 jorgCtxtDico fst2 lorgCtxt fst2 lorgCommerceDroite fst2 III III III ISI ISTIS IST ITI TIT II a Oooo Dor JEJEJYEXEJEJISIEYEYEJS EYEJEJEIEJEYEIEJE REE Figure 12 3 The table list of transducers 12 13 Applying a cascade In the text menu you can select the submenu Apply CasSys cascade Figure 12 4 which will open the CasSys window T
8. EN sa lexical mask immediately it applies to what was recognized by the lexical mask Here are some examples of such combinations e lt V K gt lt lt i gt gt Past participle ending with i e lt CDIC gt lt lt gt gt compound word containing a dash e lt CDIC gt lt lt gt gt a compound word containing at least two spaces e lt A fs gt lt lt pro gt gt a feminine singular adjective beginning with pro e DET u u n un gt gt a French determiner different from un e lt DIC gt lt lt es gt gt a word which is not in the dictionary and which ends with es e V S T uiss a verb in the past or present subjunctive and containing uiss NOTE By default morphological filters are subject to the same variations of case as lexical masks Thus the filter lt lt b gt gt will recognize all the words starting with b but also those which start with B To force the matcher to respect case add immediately after the filter amp g b f 4 8 Search 4 8 1 Search configuration In order to search for an expression first open a text cf chapter 2 Then click on Locate Pattern in the Text menu The window of figure 4 4 appears The Locate pattern in the form of box allows you to select regular expression or grammar Click on Regular expression The Index box allows you to select the recognition mode Shortest matches p
9. Figure 5 31 Example of using the grid 5 3 5 Display options fonts and colors You can configure the display style of a graph by pressing lt Ctrl R gt or by clicking on Pre sentation in the Format sub menu of the FSGraph menu which opens the window as in figure 5 32 The font parameters are e Input font used within the boxes and in the text area where the contents of the boxes is edited e Output font used for the attached transducer outputs 110 CHAPTER 5 LOCAL GRAMMARS Presentation Display Colors vi Date Background Set v File Name Foreground Set C Pathname Auxiliary Nodes Set vi Frame Selected Nodes _ Set _ Right to Left Comment Nodes ES gt ntialiasing Enable antialising for rendering graphs Icon Bar Position West O North East O South None Fonts Default ier 10 Pitch 10 Input Courier itc OK Output Dialog bold 12 Cancel Figure 5 32 Configuring the display options of a graph The color parameters are e Background the background color e Foreground the color used for the text and for the box display e Auxiliary Nodes the color used for calls to sub graphs e Selected Nodes the color used for selected boxes e Comment Nodes the color used for boxes that are not connected to others The other parameters are e Date display of the current date in the lower left corner of the graph
10. e case sensitive all letter tokens are protected with double quotes de fault e case insensitive letter tokens are not protected with double quotes e w x number of wildcards e i x number of insertions e r x number of replations e d x number of deletions Constructs the sequences automaton one single automaton that recognizes all the sequences from the SNT The sequences must be delimited with the special tag STOP The produced grf file is stored in the user s Graphs directory The other files named text tfst text tind are stored in the text directory 13 34 SORTTXT 279 13 34 SortTxt SortTxt OPTIONS lt txt gt This program carries out a lexicographical sorting of the lines of file lt txt gt lt txt gt represents the complete path of the file to be sorted OPTIONS e n no duplicates remove duplicate lines default e d duplicates remove duplicate lines e r reverse sort in descending order e o XXX sort order XXX sorts using the alphabet of the order defined by file XXX If this parameter is missing the sorting is done according to the order of Unicode characters e 1 XXX line info XXX backup the number of lines of the result file in file XXX e t thai option for sorting Thai text e f factorize inflectional codes makes two entries XXX YYY ZZZ A and XXX YYY ZZZ B become a single entry XXX YYY ZZZ A B The input text file is modified
11. e p XXX param file XXX loada parameters file like unitex logging parameters txt Incompatible with all others options e d XXX directory XXX location directory where log file to create e 1 XXX 1log file XXX filename of log file to create e i store input file store input file in log default e n no store input file don t store input file in log prevent rerun the logfile e o store output file store output file in log e u no store output file don t store output file in log default e s store list input file store list of input file in log default e no store list input file don t store list of input file in log e r store list output file store list of output file in log default e no store list output file don t store list of output file in log 13 49 Unxmlize This program removes all xml tags from the given xml or html file to produce a text file that can be processed by Unitex Unxmlize OPTIONS file OPTIONS e o TXT output TXT output file By default foo xml gt foo txt e output offsets XXX specifies the offset file to be produced 13 50 XMLIZER 289 PRLG XXX extracts to file XXX special information used in the PRLG project on ancient Greek requires output offsets t html consider the file as html file disregard extension x xml consider the file as xml file disregard extension 1 toler
12. 64 83 based on the following principle every verb has an almost unique set of syntactical properties Due to this fact these properties need to be systematically described since it is impossible to predict the exact behavior of a verb These descriptions are represented by matrices where rows corre spond to verbs and columns to syntactical properties The considered properties are formal properties such as the number and nature of allowed complements of the verb and the dif ferent transformations the verb can undergo passivization nominalisation extraposition etc The matrices or tables are mostly binary a sign occurs at the intersection of a row and a column of a property if the verb has that property a sign if not More information inhttp infolingu univ mlv fr including some lexicon grammar tables that you can freely download This type of description has also been applied to adjectives 67 predicative nouns 33 24 32 39 80 adverbs 45 69 as well as frozen expressions in many languages 114 26 27 73 74 77 87 88 89 81 78 46 Figure 9 1 shows an example of a lexicon grammar table The table contains verbs that among other definitional properties do not admit passivization 195 196 CHAPTER 9 LEXICON GRAMMAR lolx Fichier diter Afficher Ins rer Format Outils Donn es Fen tre Aide x BER al asa iv x B ES ln JHiO viMOMBQIOY
13. C Analyze this language char by char C Enable morphological use of space Semitic language C Right to left rendering for text C Right to left rendering for graphs Text Font Courier 10 Pitch 12 Concordance Font Courier 10 Pitch 10 Html Viewer Jusribinffirefox Graph configuration Figure 4 7 Selection of a web browser for displaying concordances CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS 84 224 peu uorum sada sty 30 suo oi Wotssaadxa 193ISTUTS E eru umop Zem JTEU payoear ATaorzeos quorum NEOTO 3108 E Ya WelTTtm 30 Ota ayy o3 qguanbasqns sTdoad sjeiredss E om 934 115 qayoel 242 aye OL Ss 12pPUElURIH US111006 E aua uorum purx JEY gjo ang uiris psoey de3 1371808 E mosaq nq ameu uoxes iau Aq saob ays SABTS UOXES E AO 343 YATA 38e13u09 e burmiog ano o2 p21 318p AS E 09 Dino ay uorum aoueUSIUNON STU 1340 puewWo3 Ape31 e uo UMOP 1123 pue 11 UTYITM moaz ssoae Peq psbuo oad e J 18 geya pas es2uo2 aqoa 1addn styl 103 181 N99d E aya jo UOTITQUE ayy UO0TITpadxa yser 13439eqn ur 4318d E Ag ausmAo us jo 139980 samoo2q su usum meu UEULON E e38 4324 peu qayoel srH S souearadde 91188308 23101 E n punoq sTepues s HAsqney QUATOUE 10 Jats uispom e qeya des am jo aaed Sr oi sem 3I ABSENT uispow e ptseuos aq aubrw pinos aq aanasod ames ayy ur SI E Y sraurep e Auen TIM aT 30qqy ue aq oi DEn ATUEW Y TY aptsaq sserzh ayy uodn Aer uorum 33235 13318n5 buo 9 211895 futaq
14. Clear alignment Save alignment Save alignment as Locate Figure 10 9 Displaying matched sentences and sentences they are linked to 210 CHAPTER 10 TEXT ALIGNMENT Chapter 11 Compound word inflection MULTIFLEX is a multi lingual Unicode compatible platform for automatic inflection of multi word units MWUs also known as compound words It is meant in particular for the creation of morphological dictionaries of MWUs It implements a unification based formalism 85 for the description of inflectional behavior of MWUs which supposes the existence of a mod ule for the inflectional morphology of simple words In this chapter we present the notion of multi word unit and we describe the method to inflect them with MULTIFLEX This chapter is derived from the MULTIFLEX manual written by Agata Savary the author of MULTIFLEX 11 1 Multi Word Units Multi word units MWUs encompass a bunch of hard to define and controversial linguistic objects cf 52 18 Their numerous linguistic and pragmatic definitions 5 22 65 4 56 3 86 37 13 invoke three major points e they are composed of two or more words e they show some degree of morphological distributional or semantic non compositionality e they have unique and constant references However the basic notions a word a reference the non compositionality and measures degree of non compositionality used in those definitions are themselves
15. MWU icc ae ba aca pe mh m ce 217 ice si ke Roe ice qon de SEER ED ERD ERS 222 1 34 Complete Example in English 2 s e ZER aus Rx OS 223 1132 Complete Example in French o ce cr ow Rr he 226 11 3 3 Complete Example in Serbian ie yx eo E Y ERG 229 Cascade of Transducers 239 12 1 Applying a cascade of transducers with CasSys 239 124141 Creating the list of transducers i lt oe di E R3 9 e 3 239 1212 Balance list or transducers 8 ie QE dus eR Oe s 240 121 9 Applying a cascade seh o dos dus RU Re RO eS 242 12 14 Displaying the results of a cascade 242 12 1 5 Sharing a cascade transducer list file ocios de eds dx 244 12 2 DEUS OI eed sow ie eS aus REESE pneus 244 IRI TOPE OLER used de eek aa A ORR EAE OEE ORES ets 244 12 2 2 Apply while concordance behaviour Ae Seed o x 9n 244 122 5 An xml like output text for lexical tags oro ees 245 12 2 4 The Unitex rules used for the cascade 246 12 25 A special way to mark up patterns with CasSys 246 8 CONTENTS 12 2 6 Interest of a cascade of transducer 248 1227 The longestpatteri s uas kcu e RO A ss eR OES 248 12 28 Files res lting from CasSys gt os caol ee due aodio e eR REOR 249 13 Use of external programs 251 131 Creating iuo Xe Rex eR Ade E ated REE ORES 252 132 Ihecons l 42e 44 ee Racks date Mine phas ee 252 18 nites INT auo ek domm Ph Re
16. will be replaced the line number which guarantees that each graph name will be unique For example if the main graph is called Test Graph grf and if subgraphs are called TestGraph_ grf the graph generated from the 16th line of the line will be named TestGraph_0016 grf Figures 9 8 and 9 9 show two graphs generated by applying the parameterized graph of figure 9 3 at table 31H 200 CHAPTER 9 LEXICON GRAMMAR Compile Lexicon Grammar to GRF Reference Graph in GRF format ily UnitexiFrenchiGraphsiparametrized_graph grt Resulting GRF grammar D imy UnitexiFrenchiGraphsiTestGraph grt Name of produced subgraphs DAMy UnitexiFrenchiGraphsiTestGraph_ grt cen Figure 9 7 Configuration of the automatic generation of graphs Figure 9 10 shows the resulting main graph Eee NO tre V ant le verbe n 0007 ne v rifie pas la propri t de la colonne A Figure 9 8 Graph generated for the verb archa ser le verbe n 0011 ne v rifie pas la propri t de la colonne A ET NO V vers N Figure 9 9 Graph generated for the verb badauder 9 2 CONVERSION OF A TABLE INTO GRAPHS 201 TestGraph_0119 TestGraph_0120 TestGraph_0121 TestGraph_0122 TestGraph_0123 TestGraph_0124 TestGraph_0125 TestGraph_0126 TestGraph_0127 TestGraph_0128 TestGraph_0129 TestGraph_0130 TestGraph_0131 Figure 9 10 Main graph referring to all the generated graphs 202 CHAPTER 9 LEXICON GRAMMAR
17. 22 Pamela DOWNING On the Creation and Use of English Compound Nouns In Proceedings of CICLING 2002 volume 53 pages 810 842 Linguistic Society of America 1977 11 1 23 Dana Marina DUMITRIU and S bastien PAUMIER Requ tes linguistiques sur alignements multilingues In Directia Terminologie si Inginerie Lingvistica DTIL 08 February 2008 ISBN 978 9 291220 37 3 10 24 Inkscape Vector Graphics Editor http www inkscape org 5 4 1 25 Samuel ELEUTERIO Elisabete RANCHHOD Helena FREIRE and Jorge BAP TISTA A system of electronic dictionaries of portuguese Lingvistice Investiga tiones 19 1 57 82 1995 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 BIBLIOGRAPHY 343 26 Anibale ELIA Le verbe italien Les compl tives dans les phrases un compl ment Schena Nizet Fasano Paris 1984 9 1 27 Anibale ELIA Lessico grammatica dei verbi italiani a completiva Tavole e indice generale Liguori Napoli 1984 9 1 28 Anibale ELIA and Simoneta VIETRI Electronic dictionaries and linguistic anal ysis of italian large corpora In Actes des 5es Journ es internationales d Analyse statistique des Donn es Textuelles Ecole Polytechnique f d rale de Lausanne 2000 3 8 29 Anibale ELIA and Simoneta VIETRI L analisi automatica dei testi e i dizionari elettronici In E Burattini and R Cordeschi editors Manuale di Intelligenza Artificiale per le Scienze Umane Roma Carocci 2002 3 8 30 Na
18. 2i 9o o ta ERE ERS E 129 GAT WAY estas Cee ers cx Rok ur EA ee dee eda 129 puo NGHE uuo da ak m CRX Ron Xo Rede SAL ets XO ee A N 129 643 Morphological dictionaries coe or RR rS 130 644 Dictionary entry Variables sc po ria CRAVED eas 131 65 Exploring grammar paths ow ooe soe de Pee Read ee RA e dec 132 ORC 0005 000 CDL 134 6 7 Rules for applying transducers amp 29 mmm or RR REOR eS 135 67 1 Insertion to the left of th matched pattern sso oo s 135 6 7 2 Application while advancing through the text 136 6 7 3 Priority ofthe leftmost match as brote m mes 136 6 CONTENTS 64 Priority of the longest match s s 224 9426 e545 See x E 137 675 Transducer outputs with variables Ze es bane ed wees 137 69 Peer o Pm 141 69 Opetabone on variabl s lt c ss so b noe E ek X OR d n ex 142 691 Testine Variables 222294 RR 4 ri 142 692 EE variables osonro she degere eR on Ode 143 61D Applying praphs TOTES eke Ne X EO NE Oe mes done SN ESS 143 6101 Configuration of the search lt sese om mo Rok x oy om AS 143 61102 Advanced search Options 654 e corea gce ao aoa a m kx n 145 6 105 Concordance o omi 3 44 Du ORI amandes erm oh S eu 148 6 10 4 Modification of the text 149 6105 Extracting Occurrences e x3 oc ARE De REG Eat 150 6 10 6 Comparing concordance deor abs eS Ot RSS pates 150 6107 Debug mode es rs netanna EERE ERED BRA OEE bp E Rea 151 7 Text automaton 1
19. Chapter 10 Text alignment The principle of text alignment is simple aligning two or more texts one supposed to be the source and the other s supposed to be its translation s The alignment is made at the sentence level because word alignment is not possible yet and certainly not relevant Then one can look for an expression A in one of the texts and look for its translations in the sentences aligned with those containing occurrences of A To include such a functionality into Unitex Patrick Watrin integrated the Open Source text alignment tool XAlign developed at the LORIA 66 In this chapter we will explain how to use the alignment module The reader interested in details about the integration of XAlign can consult 23 or 75 and 91 for an illustration of what can be done with this module 10 1 Loading texts First you need to select your 2 texts To do that go into XAlign gt Open files and you will see the frame shown on Figure 10 1 You provide texts under two formats raw unicode text as you do for your corpus or TEl encoded texts an XML format see 54 In the last text field you can select a XML alignment file if you have already built one If you select a raw text Unitex will need to build a basic TEI version of it for more details see section 13 50 about the XMLizer program So when you click on OK you will be asked to provide a XML file name as shown on Figure 10 2 Then Unitex builds the XML
20. E Figure 4 3 Error message when searching for the empty string lt lt ss gt gt contains ss a begins with a lt lt ez gt gt ends with ez lt lt a s gt gt contains a followed by any character followed by s lt lt a s gt gt contains a followed by a sequence of any character followed by s lt lt ss tt gt gt contains ss or tt lt lt aeiouy gt gt contains a non accentuated vowel lt lt aeiouy 3 5 gt gt contains a sequence of non accentuated vowels whose length is between 3 and 5 es contains e followed by an optional s ss e contains ss followed by an optional character which is not e It is possible to combine these elementary filters to form more complex filters e lt lt ai ble gt gt ends with able or ible e anti pro begins with anti or pro followed by an optional dash e rst aeiouy 2 a word formed by 2 or more sequences beginning with r s or t followed by a non accentuated vowel 4 8 SEARCH 79 e lt lt 1 1 e gt gt does not begin with 1 unless the second letter is an e in other words any word except the ones starting with 1e Such constraints are better de scribed using contexts see section 6 3 By default a morphological filter alone is regarded as applying it to the lexical mask lt TOK that means any token except space and STOP On the other hand when a filter follow
21. courante inflect in the same way in the sense that in both cases we need to put the first and the last constituent to plural in order to obtain the plural form of the whole MWU That s why another type of instantiation for unification variables has been introduced It is accompanied by a double equal sign as opposed to the single equal sign as for n on Figure 11 5 If a unification variable is assigned to a category by this symbol then it inherits the value of this category from the corresponding constituent as it appears in the lemma of the MWU For instance Figure 11 6 contains a graph describing the inflected forms for both masculine and feminine French compounds of types Noun Noun and Noun Adjective Its first box contains the double assignment of the gender to variable g which means that this variable has its value fixed to the gender value of the first constituent For bateau mouche it is fixed to masculine because bateau is masculine while for main courante it is fixed to feminine pss sv lt Gen g Nb n gt e g bateau mouche Figure 11 6 Inflection graph for bateau mouche with two types of instantiation Note that the double assignment contrary to the single assignment no longer means that the variable is to be instantiated to all values of the corresponding category domain It has a unique value all through the path on which it appears even if it is concerned by another single assignment somewhere else on t
22. e lt gt recognizes a newline e prohibits the presence of a space By default the space is optional between two boxes If you want to prohibit the presence of the space you have to use the special character At the opposite if you want to force the presence of the space you must use the sequence Lower and upper case letters are defined by an alphabet file see chapter 14 For more details on grammars see chapter 5 For more information about sentence boundary detection see 21 The grammar used here is named Sentence fst2 and can be found in the following directory user home directory language Graphs Preprocessing Sentence This grammar is applied to a text with the Fst2Txt program in MERGE mode This has the effect that the output produced by the grammar in this case the symbol S is inserted into the text This program takes a snt file and modifies it 2 5 PREPROCESSING A TEXT 39 Placement des marques de s paration de phrases S 1 ee Y SES S Cas g n ral I Ponctuation parentheses crochets p lt PNC gt y lt MAJ gt lt PRE gt A AA lt MIX gt NB S Ponctuation suivie de cas particuliers sigles noms symboles Sigles pr noms anthroponymes Mots compos s ou suivis d une lettre majuscule symboles cas3 Cas particuliers Abr viations Graphe r alis par Nathalie Friburger LI Tours Anne Dister Un
23. e File Name display of the graph name in the lower left corner of the graph e Pathname display of the graph name along with its complete path in the lower left corner of the graph This option only has an effect if the option File Name is selected e Frame draw a frame around the graph e Right to Left invert the reading direction of the graph see an example in figure 5 33 You can reset the parameters to the default ones by clicking on Default If you click on OK only the current graph will be modified In order to modify the preferences for a language as a default click on Preferences in the Info menu and click on the Graph configuration button in the Language amp Presentation tab 5 4 EXPORTING GRAPHS 111 Figure 5 33 Graph with reading direction set to right to left 5 4 Exporting graphs 5 41 Inserting a graph into a document In order to include a graph into a document you have to convert it to an image To do this save your graph as a PNG image Click on Save as in the FSGraph menu and select the PNG file format You will get an image ready to be inserted into a document or to be edited with an image editor You should activate antialiasing for the graph that interests you this is not obligatory but results in a better image quality Another solution consists of making a screenshot On Windows Press Print Screen on your keyboard This key should be next to the F12 k
24. e test info std out txt content of standard console output e test info std err txt content of error console output e src xxx a copy of file read by the tool needed to run the log again e dest xxx a copy of file created by the tool If the second line of unitex logging parameters txt contains 0 these file are not recorded if this line contains 1 they are recorded 14 13 9 Arabic typographic rules arabic typo rules txt For Arabic dictionary lookups can be parameterized with a file that describes whether some typographic variations are allowed or not This file is made of lines like the following fatha omission YES where fatha omission is the name of the rule For a complete description of all the available rules you have to consult the Arabic h file in the program sources Appendix A GNU Lesser General Public License This license can also be found in 35 GNU LESSER GENERAL PUBLIC LICENSE Version 2 1 February 1999 Copyright C 1991 1999 Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Everyone is permitted to copy and distribute verbatim copies of this license document but changing it is not allowed This is the first released version of the Lesser GPL It also counts as the successor of the GNU Library Public License version 2 hence the version number 2 1 Preamble The licenses for most software are designed to take away your freedom to share and change it
25. piled ina rul file This operation is carried out via the Elag Rules command in the Text menu which opens the windows shown in figure 7 16 If the frame on the right already contains grammars which you don t wish to use you can withdraw them with the e button Then select your grammar s in the file explorer located 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 165 a dash followed by il elle or on must be preceeded by a verb Figure 7 14 Use of the synchronization point e TE 2 sentences Est il gentil D Sentence 2 ES Reset Sentence Graph Rebuild FST Text close elag frame Explode Implode Apply Elag Rule Explode Implode Figure 7 15 Result of the application of the grammar in figure 7 14 166 CHAPTER 7 TEXT AUTOMATON A Elag Grammar Compilation Set of Elag Grammars bai browse D iMy UnitexiFrenchiEtagiPPYSISE grf Look In CJ PPVs y ca 65 D postpos grf D SE grf TN pat at E pe tor E Pput uo TN PpvPR grt 3 PpvSeq grt File Name Files of Type Elag Grammar grf v Compiled Elag Rule baam compile cancel compilation Figure 7 16 ELAG grammars compilation frame in the left frame and click on the button to add them to the list in the right frame Then click on the Compile button This will launch the E1agComp program which will co
26. 7300 6900 6F 00 6E 00 OD 00 0A 00 Table 14 1 Hexadecimal representation of a Unicode Little Endian text Here is its representation in Unicode Big Endian BOM header U n i t e x q B FEFF 0055 006E 0069 0074 0065 0078 000D000A 03B2 v e r S i O n 4 00 2D 0076 0065 0072 0073 0069 00 6F 00 6E 000D000A Table 14 2 Hexadecimal representation of a U Here is its representation in Unicode UTF 8 nicode Big Endian text BOMheader U n i t e x q B EF BBBF 55 6E 69 74 65 78 ODOA CEB2 e r S i o n q 2D 76 65 72 73 69 6F 6E OD OA Table 14 3 Hexadecimal representation of a Unicode UTF 8 text On Unicode Little Endian the hi bytes and lo bytes have been reversed which ex plains why the start character is encoded as FF FE in stead of FE FF and 00 0D and 00 0A are 0D 00 and 0A 00 respectively 14 2 Alphabet files There are two kinds of alphabet files a file which defines the characters of a lan guage and a file that indicates the sorting preferences The first is designed under the name alphabet the second under the name sorted alphabet 14 21 Alphabet The alphabet file is a text file that describes all characters of a language as well as the correspondances between capitalized and non capitalized letters This file is called 14 2 ALPHABET FILES 293 Alph
27. By choosing the inflectional grammar names carefully one can construct a ready to use dictionary Figure 3 9 shows the dictionary we get after the inflection of our DELAS example Semantic codes In some languages there are inflectional features that actually correspond to semantic ones like for instance markers for the passive form Such codes may not appear as inflectional ones but rather as semantic ones To do that and produce semantic codes you have to insert a plus sign at the beginning of the output of a box The box must only contain the semantic code preceeded by a plus as shown on Figure 3 10 an invalid path az P3ms passive A a good path az P3ms passive Figure 3 10 An inflection grammar with a semantic code 3 5 2 Inflection of compound words See chapter 11 62 CHAPTER 3 DICTIONARIES 3 5 3 Inflection of semitic languages Semitic languages like Arabic or Hebrew are not inflected in the same way than other kinds of languages since their morphology obey a different logic In fact in such languages words are inflected according to consonant skeletons A lemma is made of consonants and the inflection process is supposed to enrich this skeleton with vowels First let us see what a semitic entry is supposed to be ktb V31 123 The sign before the grammatical code indicates that this is a semitic entry and the lemma here ktb is the consonant skeleton Figure 3 11 shows the toy grammar V31 12
28. By contrast the GNU General Public Licenses are intended to guar antee your freedom to share and change free software to make sure the software is free for all its users This license the Lesser General Public License applies to some specially des ignated software packages typically libraries of the Free Software Foundation and other authors who decide to use it You can use it too but we suggest you first think carefully about whether this license or the ordinary General Public License is the better strategy to use in any particular case based on the explanations below When we speak of free software we are referring to freedom of use not price Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software and charge for this service if you wish that you receive source code or can get it if you want it that you can change the software and use pieces of it in new free programs and that you are informed that you can do these things To protect your rights we need to make restrictions that forbid distributors to deny you these rights or to ask you to surrender these rights These restrictions 323 324 CHAPTER 14 FILE FORMATS translate to certain responsibilities for you if you distribute copies of the library or if you modify it For example if you distribute copies of the library whether gratis or for a fee you must give the recipients all the rights that we gave you Y
29. If this rule is applied to the three occurrrences of the preceding concordance the occurrence in ancient overlaps with ancient times The first is retained because this is the leftmost occurrence and ancient times is eliminated The following occurrence of times a is no longer in conflict with ancient times and can therefore appear in the result Don there extended in ancient times a large forest The rule of priority of the leftmost match is applied only when the text is modified be it during preprocessing or after the application of a syntactic graph cf section 6 10 4 6 74 Priority of the longest match During the application of a syntactic graph it is possible to choose if the priority should be given to the shortest or the longest sequences or if all sequences should be retained During preprocessing the priority is always given to the longest sequences 6 7 5 Transducer outputs with variables As we have seen in section 5 2 5 it is possible to use variables to store some text that has been analyzed by a grammar These variables can be used in preprocessing graphs and in syntactic graphs You have to give names to the variables you use These names can contain non accentuated lower case and upper case letters between A and z digits and the character _ underscore In order to define the boundings of the zone to be stored in a variable you have to create two boxes that contain the name of the variable enc
30. Inflection graph NC_XXNs for English MWUs 226 CHAPTER 11 COMPOUND WORD INFLECTION e g head of government lt Nb p gt Figure 11 18 Inflection graph NC_NofNs for English MWUs lt Nb n gt Figure 11 20 Inflection graph NC Ns N for English MWUs 11 3 2 Complete Example in French Let us assume that the description of morphological features of French is given by the fol lowing Morphology txt file French CATEGORIES Nb s p Gen m f lt CLASSES gt noun Nb lt var gt Gen lt var gt adj Nb lt var gt Gen lt var gt adv and that the equivalences between these features and their corresponding codes in DELA 11 3 INTEGRATION IN UNITEX 227 dictionaries are given by the following Equivalences txt file French s Nb s p Nb p m Gen m f Gen f Consider the following sample French DELAC file the DELAS inflection codes may vary from those present in UNITEX avant garde garde N21 fs NC XXN bateau bateau N3 ms mouche mouche N21 fs NC NN caf caf N1 ms au lait NC_NXXXX carte carte N21 fs postale postal A8 fs NC NN cousin cousin N8 ms germain germain A8 ms NC NNmf franc franc A47 ms ma on ma on N41 ms NC_ANI m moire m moire N21 fs vive vif A48 fs NC_NN microscope microscope Nl ms porte serviette serviette N21 effet tunnel NC_NXXXXXX fs NC_VNm The corresponding inflection graphs for MWUs are shown on f
31. July His d August ye ar month September October November December month year Figure 5 22 Inverting month and year in a date The default behavior of Locate and LocateTfst is to consider variables that have not been defined as being empty You can modify this behavior as shown in section 6 10 2 Moreover it is possible to test whether a variable has been defined or not as shown in section 6 7 5 5 2 6 Copying lists It can be practical to perform a copy paste operation on a list of words or expressions from a text editor to a box in a graph In order to avoid having to copy every term manually Unitex provides a mean to copy lists To use this select the list in your text editor and copy it using lt Ctrl C gt or the copy function integrated in your editor Then create a box in your graph and press lt Ctrl V gt or use the Paste command in the Edit menu to paste it into the box A window as in Figure 5 23 opens Message T Choose your left and right contexts item Figure 5 23 Selecting a context for copying a list This window allows you to define the left and right contexts that will automatically be used for each term of the list By default these contexts are empty If you use the contexts lt and V gt with the following list 5 2 EDITING GRAPHS eat sleep drink play read you will get the box in figure 5 24 lt eat V gt lt sleep V gt drink V
32. The formats of the DELAS and DELAF dictionaries have already been presented in sections 3 1 1 and 3 1 2 NOTE In this chapter the symbol Y represents the newline symbol Unless other wise indicated all text files described in this chapter are encoded in Unicode Little Endian 141 Unicode encoding By default text files processed by Unitex have to be encoded in Unicode Little Endian Unitex accepts also Unicode Big Endian or UTF 8 files This encoding al lows the representation of 65536 characters by coding each of them in 2 bytes In Little Endian the bytes are in lo byte hi byte order If this order is reversed we speak of Big Endian A text file encoded in Unicode Little Endian Big Endian or UTF 8 starts with the special character Unicode Byte Order Mark BOM with the hexadecimal value FF FE Little Endian FE FF Big Endian or EF BB BF UTF 8 Because UTF 8 has no byte order adding a UTF 8 BOM is optional for UTF 16 it is required The newline symbols have to be encoded by the two characters 0D 00 and 0A 00 Little Endian 00 0D and 00 0A Big Endian or 0D and 0A UTF 8 Consider the following text Unitex B version9 Here is its representation in Unicode Little Endian 291 292 CHAPTER 14 FILE FORMATS BOM header U n i t e x q B FF FE 5500 6E00 6900 7400 6500 7800 00000A00 B2 03 v e r S i O n 4 2D00 7600 6500 7200
33. The standard separators are the space the tab and the newline characters There can be several separators following each other but since this isn t useful for linguistic analyses separators are normalized according to the following rules e a sequence of separators that contains at least one newline is replaced by a single new line e all other sequences of separators are replaced by a single space The distinction between space and newline is maintained at this point because the presence of newlines may have an effect on the process of splitting the text into sentences The result of the normalization of a text named my text txt is a file in the same directory as the txt file and is named my_text snt NOTE When the text is preprocessed using the graphical interface a directory named my_text_snt is created immediately after normalization This directory called text direc tory contains all the data associated with this text 2 5 2 Splitting into sentences Splitting texts into sentences is an important preprocessing step since this helps in determin ing the units for linguistic processing The splitting is used by the text automaton construc tion program In contrast to what one might think detecting sentence boundaries is not a trivial problem Consider the following text The family has urgently called Dr Martin The full stop that follows Dr is followed by a word beginning with a capital letter Thus it may be considere
34. and is quite different from the ordinary General Public License We use this license for certain libraries in order to permit linking those libraries into non free programs When a program is linked with a library whether statically or using a shared library the combination of the two is legally speaking a combined work a derivative of the original library The ordinary General Public License therefore permits such linking only if the entire combination fits its criteria of freedom The Lesser General Public License permits more lax criteria for linking other code with the library We call this license the Lesser General Public License because it does Less to protect the user s freedom than the ordinary General Public License It also pro vides other free software developers Less of an advantage over competing non free programs These disadvantages are the reason we use the ordinary General Pub lic License for many libraries However the Lesser license provides advantages in certain special circumstances For example on rare occasions there may be a special need to encourage the widest possible use of a certain library so that it becomes a de facto standard To achieve this non free programs must be allowed to use the library A more frequent case is that a free library does the same job as widely used non free libraries In this case there is little to gain by limiting the free library to free software only so we use the Lesser General
35. e Some Unitex programs can run without memory leak when compiled with preproces sor macro UNITEX RELEASE MEMORY AT EXIT or UNITEX LIBRARY This in clude Concord Convert Dico Elag Evamb Extract Flatten Fst2Txt Grf2Fst2 ImplodeTfst LocateTfst MultiFlex Normalize PolyLex Reg2Grf SortIxt TagsetNormTfst TEI2Txt Tfst2Grf Tfst2Unambig Tokenize Txt2Tfst XMLizer Unitex code is thread safe Encoding of Unicode text file can be specified 13 4 Introduction of UnitexTool that can be used to launch several Unitex programs in one command to make scripting easier 13 47 Introduction of UnitexToolLogger that can be used to create and run again log of Unitex programs execution 13 48 Introduction of Seq2Grf that automatically produce local grammar from raw text or XML TEILite document 8 You can look up for a word in an opened Dictionnary or in several dictionnaries from the User or System ressources 3 2 e You can acces to sub graphs called in the current graph or graphs in which the cur rent graph is a subgraph 5 2 2 automatically or manually reload latest version on disk of the current graph compare two verisions of the same graph and insert context delimiters in the new extended ToolBar 5 2 8 e If you re using a macintosh device you can use every shortcut involving the Ctrl key by pressing the Command key instead Introduction of a contextual menu for graph edition accessible by right clicking in the grap
36. e if there is an upper case letter in the dictionary then an upper case letter has to be in the text e if a lower case letter is in the dictionary there can be either an upper or lower case letter in the text Thus the entry peter N fs will match the words peter Peter et PETER while Peter N firstName only recognizes Peter and PETER Lower and upper case letters are defined in the alphabet file passed to the Dico program as a parameter Respecting white space is a very simple rule For each sequence in the text to be recognized by a dictionary entry it has to have exactly the same number of spaces For example if the dictionary contains aujourd hui ADV the sequence Aujourd hui will not be recog nized because of the space that follows the apostrophe 3 7 3 Dictionary graphs The Dico program can also apply dictionary graphs Dictionary graphs conform to the following rule if applied by Locate in MERGE mode they must produce output sequences that are valid DELAF lines Figure 3 14 shows a graph that recognizes chemical elements We can observe a first ad vantage of graphs over usual dictionaries we can force case with double quotes Thus this graph will correctly match Fe but not FE while this restriction cannot be specified in a normal DELAF Another advantage of dictionary graphs is that they can use results given by previous dic tionaries Thus it is possible to apply the standard dictionary and then
37. for each of them the list of syntactic and inflectional codes compatible with it and a description of their possible combinations This description must be contained in a file called tagset def and placed in your personal folder in the Elag subfolder of the desired language tagset def file Here is an extract of the tagset def file used for French NAME french POS ADV POS PRO flex pers 123 genre nombre s diser Il DT th 170 CHAPTER 7 TEXT AUTOMATON Pind Pdem PpvIL PpvLUI PpvLE Ton PpvPR PronQ Dnom Ppossls subcat complete Pind genre n Pdem genre n Ppossls genre lt n Pposslp genre n Pposs2s genre n Pposs2p genre n Pposs3s genre n Pposs3p genre n PpvIL genre n PpvLE genre n PpvLUI genre n Ton genre n PpvPR PronQ Dnom POS A adjectifs flex genre m f nombre s cat gauche g droite d complete lt genre gt lt nombre gt POS V flex temps pers genre nombre complete lt pers gt lt pers gt lt pers gt lt pers gt Ei CX 763 SS od oen nombre nombre nombre nombre E 25 53 f p om om om om om om om om om om om om bre bre bre bre bre bre bre bre bre bre bre bre pers pers pers pers pour de bonne humeur A IJKPSTWYGKX Sse de dB db SHE
38. hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p6fgea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p6ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p6ngea hungry as a wolf gladnima kao vukovi gladan kao vuk AC A3XN2 p6ngea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p6ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC_A3XN2 p6ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p6ngea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p7mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p7mgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC_A3XN2 p7mgea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p7mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p7mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p7mgea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p7fgea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p7ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p7ngea hungry as a wolf gladnima kao vukovi gladan kao vuk AC A3XN2 p7ngea hun
39. in London lt p gt lt p id 3 gt These meetings will be held at least seg type sequence gt twice a month lt seg gt lt p gt lt p id 4 gt We will bring forward an amended proposal seg type sequence gt as soon as possible lt seg gt lt p gt lt p id 5 gt We will have to decide lt seg type sequence gt in the next few days lt seg gt how we take all this together lt p gt lt body gt lt ftext gt lt TEI 2 gt Figure 8 3 TEILite 8 2 Usage In order to create a sequence automaton click on Construct Sequence Automaton in the Text menu You will then see the window coming up as in figure 8 4 This window will allow you to set the parameters to produce a sequence automaton You have to follow these three steps e choose the sequence corpus that can be a file which format is one of the three de scribed in the previous section The file format is automatically detected according to the file extension e set the specific options Applying the beautifying algorithm will place each box so that the resulting graph is smaller and as easily readable as possible The exact case matching will put litteral tokens into braces in the graph so that the graph doesn t match tokens with same letters but with case differences 8 2 USAGE 191 You can set more options to produce a graph that allow approximate matching you can set the number of jokers to be used to produce new sequences derived from the seque
40. in morphological mode you can extract information from a lexical tag contained in the text 132 CHAPTER 6 ADVANCED USE OF GRAPHS gn of Stephen e of Henry the nry the Second ond had scarce jection to the to the crown crown had now their ancient ost extent 5 ference of the Figure 6 34 Results of grammar of Figure 6 33 applied in MERGE mode HD gt 9 Inflected form a INFLECTED Lemma a LEMMAS Codes a CODES Figure 6 35 Using a morphological variable in normal mode automaton and capture it into a dictionary entry variable in a grammar In your grammar you have to set the output of a box with xxx where xxx is a valid variable name In the rest of the paths that contain the box you can use xxx as a dictionary entry variable in the same way as described above for the morphological mode If a semantic code is of the form xxx yyy you can query the attribute value xxx This matches the inflected form lemma or codes of the entry variables when the semantic code come from a dictionary or any other attribute value pair 6 5 Exploring grammar paths It is possible to generate the paths recognized by a grammar if they are in finite number for example to check that it correctly generates the expected forms For that open the main graph of your grammar and ensure that the graph window is the active window the active window has a blue title bar while the inactive windows have a gray title bar
41. logical system for single words In the Unitex interfaced version of MULTIFLEX we would generate the plural of royal due to the fact that its lemma is known as having the inflection code N1 represented on Figure 11 3 In an inflection paradigm of a MWU each constituent is accompanied only by those mor phological categories which it should inflect for The categories that remain unchanged don t have to be mentioned For instance in bateau mouche in French a Paris style river boat both noun constituents have their gender set but they inflect in number bateaux mouches That s why on Figure 11 4 containing the inflection graph for this MWU the cor responding boxes contain value assignments for number only Note that both constituents may or may not agree in gender here bateau is masculine while mouche is feminine 11 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 219 AUS XD D P Figure 11 3 Inflection graph N1 for simple words inflecting like royal si lt 2 gt jse lt 1No p gt lt s2 gt 3o e g bateau mouche lt Gen m Nb p gt Figure 11 4 Inflection graph for MWUs inflection like bateau mouche Unification Variables An important feature of our formalism are unification variables They are introduced by the dollar sign followed by an identifier which may contain any number of characters e g 21 num_10 c etc For example Figure 11 5 shows a graph roughly equivalent to the o
42. lt tfst gt This program normalizes the specified t st text automaton according to a tagset description file discarding undeclared dictionary codes and incoherent lexical en tries Inflectional features are unfactorized so that rouge A fs ms will be di vided into the 2 tags rouge A s and rouge A ms OPTIONS e o OUT output OUT output text automaton By default the input text automaton is modified e t TAGSET tagset TAGSET name of the tagset description file 13 39 TEI2Txt TEI2Txt OPTIONS xml Produces a raw text file from the given lt xm1 gt TEI file OPTIONS e o TXT output TXT name of the output text file By default the output file has the same name than the input one replacing xml by txt 13 40 Tfst2Grf Tfst2Grf OPTIONS lt tfst gt This program extracts a sentence automaton in grf format from the given text automaton OPTIONS e s N sentence N the number of the sentence to be extracted 282 CHAPTER 13 USE OF EXTERNAL PROGRAMS e o XXX output XXX pattern used to name output files XXX gr f XXX txt and XXX tok default cursentence e f FONT font FONT sets the font to be used in the output gr default Times new Roman e z N fontsize n sets the font size default 10 The program produces the following files and saves them in the directory of the text e cursentence grf graph representing the automaton of the sentenc
43. matches all entries having be as canonical form and the grammatical code V e V matches all entries having the grammatical code V This pattern is as ambiguous as the first one To remove the ambiguity you can use either lt V or lt V gt e am be V or am be V matches all the entries having am as inflected form be as canonical form and the grammatical code V This kind of lexical mask is only of in terest if applied to the text automaton where all the ambiguity of the words is explicit While executing a search on the text that lexical mask matches the same as the simple token am 4 3 3 Grammatical and semantic constraints The references to dictionary information be V in these examples are basic It is possible to express more complex lexical masks by using several grammatical or semantic codes sepa rated by the character An entry of the dictionary is then only found if it has all the codes that are present in the mask The mask lt N z1 gt thus recognizes the entries broderies broderie Nt zl fp capitales europ ennes capitale europ enne N NA Conc HumColl zl fp but not Descartes Ren Descartes N Hum NPropre ms habitu A z1 ms It is possible to exclude codes by preceding them with the character instead of In order to be recognized an entry has to contain all the codes required by the lexical mask and none of the prohibited ones The mask lt A z3 gt thus recognizes all the adjectives that
44. modify sublicense link with or distribute the Library is void and will automati cally terminate your rights under this License However parties who have received copies or rights from you under this License will not have their licenses terminated so long as such parties remain in full compliance 9 You are not required to accept this License since you have not signed it How ever nothing else grants you permission to modify or distribute the Library or its derivative works These actions are prohibited by law if you do not accept this Li cense Therefore by modifying or distributing the Library or any work based on the Library you indicate your acceptance of this License to do so and all its terms and conditions for copying distributing or modifying the Library or works based on it 10 Each time you redistribute the Library or any work based on the Library the recipient automatically receives a license from the original licensor to copy dis tribute link with or modify the Library subject to these terms and conditions You may not impose any further restrictions on the recipients exercise of the rights granted herein You are not responsible for enforcing compliance by third parties with this License 11 If as a consequence of a court judgment or allegation of patent infringement or for any other reason not limited to patent issues conditions are imposed on you whether by court order agreement or otherwise that contr
45. where the variables are replaced with the contents of the cell at the intersection of line and the column that corresponds to the variable If a cell of the table contains the sign the corresponding variable is replaced by lt E gt If the cell contains the sign the box containing the corresponding variable is removed interrupting the paths through that box In all other cases the variable is replaced by the contents of the cell 9 2 2 Format of the table The lexicon grammar tables are usually encoded with the aid of a spreadsheet like OpenOf fice org Calc 72 To make them usable with Unitex the tables have to be encoded in Unicode text format in accordance with the following convention the columns need to be 9 2 CONVERSION OF A TABLE INTO GRAPHS 197 separated by a tab and the lines by a newline In order to convert a table with OpenOffice org Calc save it in text format csv extension You can then parameterize the output format with a window as shown on Figure 9 2 Choose Unicode select tabulation as column separator and do not set any text delimiter Export de texte E x Options de champ Jeu de caract res Unicode y L S d Annuler S parateur de champ Pen S parateur de texte v Aide Largeur de colonne fixe Figure 9 2 Saving a table with OpenOffice org Calc During the generation of the graphs Unitex skips the first line considering that it contains the headings of the columns It
46. 1254 Turkish windows 1258 ISO 8859 1 Latin 1 Europe de l ouest amp USA ISO 8859 15 Latin 9 Western Europe amp USA ISO 8859 2 Latin 2 Eastern and Central Europe ISO 8859 3 Latin 3 Southern Europe ISO 8859 4 Latin 4 Northern Europe ISO 8859 5 Cyrillic ISO 8859 7 Greek ISO 8859 9 Latin 5 Turkish ISO 8859 10 Latin 6 Nordic NextStep code page windows 1250 windows 1257 windows 1251 windows 1254 iso 8859 1 iso 8859 15 iso 8859 2 iso 8859 3 iso 8859 4 iso 8859 5 iso 8859 7 iso 8859 9 iso 8859 10 next step CHAPTER 13 USE OF EXTERNAL PROGRAMS Microsoft Windows 1252 Latin I Western Europe amp USA Microsoft Windows 1258 Viet Nam LITTLE ENDIAN BIG ENDIAN UTF8 13 12 Dico Dico OPTIONS lt dic_1 gt lt dic_2 gt lt dic_3 gt This program applies dictionaries to a text The text must have been cut up into lexical units by the Tokenize program OPTIONS e t TXT text TXT complete snt text file name e a ALPH alphabet ALPH the alphabet file to use e m DICS morpho DICS this optional parameter indicates which morpho logical dictionaries are to be used if needed by some st 2 dictionaries DICS represents a list of bin files with full paths separated with semi colons e K korean tells Dico that it works on Korean e s semitic tells Dico that i
47. 2 n1 Case c Anim a gt 4 5 Gen g5 Nb Gen 9g1 Nb n Case c Anim a Figure 11 32 Inflection graph NC N3XN for Serbian MWUs istrazxni sudija a gt 1 Gen g Nb s Case c Anim g Det e 2 3 Gen g Nb s Case c Anim Gen g Nb s lt 1 Gen g Nb s Case c Anim g Det d gt lt 3 Gen f Nb w Case c Anim a gt lt 1 Gen f Nb w Case c Anim g Det e gt lt 1 Gen f Nb p Case c Anim g Det e gt lt 3 Gen f Nb p Case c Anim a gt lt 1 Gen mNb s Case 4 Anim a Det e gt lt 3 Gen m Nb s Case 4 Anim a gt Gen m Nb s Case 4 Anim a lt 1 Gen m Nb s Case 4 Anim a Det d gt Figure 11 33 Inflection graph NC AXNF for Serbian MWUs 238 CHAPTER 11 COMPOUND WORD INFLECTION feminin name first name surname Katarina Jovanovic lt 1 Anim a Gen f Case c Nb s gt lt 2 gt lt 3 Nb s Anim a Gen g1 Case 1 gt lt 3 Nb s Anim a Gen g1 Case 1 gt lt 2 gt lt 1 Anim a Gen f Case c Nb feminine name surname first name Jovanovic Katarina 7 lt Nb s Case c Anim a Gen f gt masculine name first name surname Ljuba Popovic ib s Case fc Animz a Gen m 1 Anim a Gen m Case c Nb s lt 2 gt lt 3 Nb s Anim a Gen gl Case c gt H masculine name surname first name Popovic Ljuba lt 3 Nb
48. 3 Text alignment frame 10 2 Aligning texts Once you have loaded your texts you can align them by clicking on the Align button You will be asked to provide the name of the XML file that will contain all the information about the alignment Then Unitex launches the XAl ign program and you will visualize the alignment under the form of red links between aligned sentences as shown on Figure 10 4 You can edit the alignment links with the mouse Clicking on a link removes it To add a link or remove it if it already exists click on one sentence in the text you want source or destination and then move your mouse over the corresponding sentence in the other text The link about to be added will appear in yellow as shown on Figure 10 5 When you click the link is actually added and becomes red When you have made all your corrections you can save your modified alignment using the Save alignment and Save alignment as buttons An interesting feature of XAlign is that it is reentrant It means that you can take an existing alignment as a set of mandatory links in input of the alignment process This can be useful if you want to work with cognates For more details about cognates and XAlign see discussion in 75 206 D My UnitexiXAlign funtana xml 78 s entre d chirent n crivez pas cela s il ous plait on pourrait me le a je ne suis ici que depuis quelques minutes un quart d heure tout au plus a
49. 4 9 Statistics panel In the Mode panel you can select the kind of statistics you want e collocates by z score the previous one plus some additionnal information number of occurrences of the collocate in the match context and in the whole corpus z score of the collocate e collocates by frequency shows the tokens that cooccur in the match context 86 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS e contexts by frequency shows matches with left and right contexts see below count is the number of occurrences of a given match context In the second panel you can set the lenght of left and right contexts to be used in non space tokens NOTE this notion of context has nothing to do with contexts in grammars In the last panel you can allow or not case variations If you allow case variations the and THE will be considered as a same token and the count will be the sum of the counts of the and THE The following figures show the statistics computed in each mode for the query lt have gt on ivanhoe snt Statistics Left context Right context would been which been been been not been LA e ave been O have been 1t had been e e ORILLELE t W ave alrea h had hitherto had been a E have a have thought a e Pal dal uS ES Va vs received been seen been been Figure 4 10 left match right count 4 8 SEARCH Statist
50. 9 3 shows an example of a parameterized graph designed to be applied to the lexicon grammar table 31H presented in figure 9 4 198 CHAPTER 9 LEXICON GRAMMAR le verbe n ne v rifie pas la propri t de la colonne A H epv vers ac NO V vers N Figure 9 3 Example of parameterized graph G _31H OpenOffice org Calc Fichier diter Afficher Ins rer Format Outils Donn es Fen tre Aide x A Bel gSOSI xx386 695 e i amp 69 u80vimBommoiomR dl aria We s 361sigzss i hb7iacwimse n o A 8 lt OPT gt E AVOIT AUX abandonner Paul agabandonn s abuser Max abuse acquiescer Max aSacquiesc E de adouber Paul amp adoubeS checs agioter Max agiote sur les chan agoniser MaxS amp agonise amp archaiser Cet auteur amp archaiseSvolc arquer Max agarqu stoute la jou arriver Max estSarriv amp atermoyer Max atermoie badauder Max badaude Feuille 1 1 PageStyle_c31H 100 sp Po Somme 0 Figure 9 4 Lexicon grammar table 31H 9 24 Automatic generation of graphs In order to be able to generate graphs from a parameterized graph and a table first of all the table must be opened by clicking on Open in the Lexicon Grammar menu see figure 9 5 The table must be in Unicode text format The selected table is then displayed in a window see figure figure 9 6 If it does not appear 9 2 CONVERSION OF A TABLE INTO GRAPHS 199 Unitex 2 1 curr
51. By default the sorting is performed in the order of Unicode characters removing duplicate lines 13 35 Stats Stats OPTIONS ind This program computes some statistics from the ind concordance index file OPTIONS e m MODE mode MODE specifies the output to be produced 0 matches with left and right contexts number of occurrences 1 collocates number of occurrences 2 collocates number of occurrences z score e a ALPH alphabet ALPH alphabet file to use e o OUT output OUT output file 280 CHAPTER 13 USE OF EXTERNAL PROGRAMS e 1 N left Nn length of left contexts in tokens e r N right N length of right contexts in tokens e c N case N case policy 0 case insensitive 1 case sensitive default 13 36 Table2Grf Table2Grf OPTIONS table This program automatically generates graphs from a lexicon grammar table and a template graph OPTIONS e r GRF reference_graph GRF name of the template graph e o OUT output OUT name of the result main graph e s XXX subgraph pattern XXX ifthis optional parameter if specified all the produced subgraphs will be named according to this pattern In order to have unambiguous names we recommend to include 8 in the parameter remind that will be replaced by the line number of the entry in the table For instance if you set the pattern parameter to subgraph 8 grf sub graph names wil
52. N gt lt lt fn gt gt Seet WSL ell Figure 6 16 Advanced use of right contexts You can use nested contexts For instance the graph shown in figure 6 17 matches a number that is not followed by a dot except for a dot followed by a number Thus in the sequence 5 0 7 12 this graph will match 5 0 and 12 J HH HH 99 O Figure 6 17 Nested contexts If a right context contains boxes with transducer outputs the outputs are ignored However it is possible to use a variable that was defined inside a right context cf figure 6 18 If you apply this graph in MERGE mode to the text the cat is white you will obtain the pet name cat color white gt is white 6 3 CONTEXTS 125 Ee EP pet name color C gt Figure 6 18 Variable defined inside a right context 6 3 2 Left contexts It is also possible to look for an expression X only if it occurs after an expression Y Of course it was already possible to do that with a grammar like the one shown on Figure 6 19 However with such a grammar the context part on the left will be included in the match as shown on Figure 6 20 seven eight nine ten Figure 6 19 Matching a noun that occurs after a numerical determiner Concordance D My Unitex English Corpusvanhoe snticoncord html horseback at any secure place within eight days after our liberation 5 wh were briefly as follows 5 First the five challengers were to undertake al ES
53. Public License 14 13 VARIOUS OTHER FILES 325 In other cases permission to use a particular library in non free programs en ables a greater number of people to use a large body of free software For example permission to use the GNU C Library in non free programs enables many more peo ple to use the whole GNU operating system as well as its variant the GNU Linux operating system Although the Lesser General Public License is Less protective of the users free dom it does ensure that the user of a program that is linked with the Library has the freedom and the wherewithal to run that program using a modified version of the Library The precise terms and conditions for copying distribution and modification fol low Pay close attention to the difference between a work based on the library and a work that uses the library The former contains code derived from the library whereas the latter must be combined with the library in order to run GNU LESSER GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING DISTRIBUTION AND MODIFICATION 0 This License Agreement applies to any software library or other program which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License also called this License Each licensee is addressed as you A library means a collection of software functions and or data prepared so as to be conven
54. Public License instead of this License to a given copy of the Library To do this you must alter all the notices that refer to this License so that they refer to the ordinary GNU General 14 13 VARIOUS OTHER FILES 327 Public License version 2 instead of to this License If a newer version than version 2 of the ordinary GNU General Public License has appeared then you can specify that version instead if you wish Do not make any other change in these notices Once this change is made in a given copy it is irreversible for that copy so the or dinary GNU General Public License applies to all subsequent copies and derivative works made from that copy This option is useful when you wish to copy part of the code of the Library into a program that is not a library 4 You may copy and distribute the Library or a portion or derivative of it under Section 2 in object code or executable form under the terms of Sections 1 and 2 above provided that you accompany it with the complete corresponding machine readable source code which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange If distribution of object code is made by offering access to copy from a designated place then offering equivalent access to copy the source code from the same place satisfies the requirement to distribute the source code even though third parties are not compelled to copy the source along with the o
55. R CIPROQUEMENT RECIPROQUEMENT r ciproquement ADV L la le DET Ddef fs la le PRO PpvLE 3fs L UN L UN l un PRO Pind ms UN UN un A ms UN un DET Dind ms COMME COMME comme ADV COMME comme CONJS MAITRE MAITRE maitre N ms LA IL la le DET Ddef fs la le PRO PpvLE 3fs AUTRE AUTRE autre DET4Dadj ms fs COMME COMME comme ADV COMME comme CONJS pou DOMESTIQUE domestiquer V Kms DOMESTIQUE domestique A ms i Figure 7 36 Filtered table display feature that may change in the future Here is an example of output Je N ms mp Je PRO PpvIL 1fs 1ms suis V P1s suis V Y2s P2s Pl1s M N mp ms Mdiba de DET Dind fp mp fs ms de PREP de PREP z1 de la DET Dind z1 fs de PREP z1 des DET Dind zl1l mp fp de PREP z1 du DET Dind z1 ms de la DET Dind z1 fs des DET Dind z1 mp fp du DET Dind zl ms LG ville N fs S 7 9 The special case of Korean Korean is an agglutinative language that has a very special morphological system words are made of Hangul syllabic characters but one Hangul character corresponds to several Jamo alphabetic characters For instance you can see on Figure 7 37 two examples of Hangul characters followed by their equivalent Jamo letter sequences AH o r or O AIO Figure 7 37 Hangul characters and their equivalent Jamo sequences Moreover morphemes do not correspond necessarily to Hangul characters For instance Figure 7 38
56. Sentence Graph Rebuild FST Text Elag Frame ath is crossed by a monk a owling dog until you have eaten your next meal Cedric impatiently Automaton Table Explode Implode Apply Elag Rule Figure 3 18 Path added by a morphological dictionary graph 3 8 Bibliography Table 3 4 gives some references for electronic dictionaries with simple and compound words For more details see the references page on the Unitex website http www igm univ mlv 1 fr unitex 70 CHAPTER 3 DICTIONARIES Language Simple words Compound words English 56 70 15 84 French 19 20 61 20 37 86 45 Modern Greek 2 17 58 59 60 Italian 28 29 90 Spanish 8 7 Portuguese 25 82 79 71 78 79 Table 3 4 Some bibliographical references for electronic dictionaries Chapter 4 Searching with regular expressions This chapter describes how to search a text for simple patterns by using regular expressions 4 4 Definition The goal of this chapter is not to give an introduction on formal languages but to show how to use regular expressions in Unitex in order to search for simple patterns Readers who are interested in a more formal presentation can consult the many works that discuss regular expression patterns A regular expression can be e a token book or a lexical mask lt smoke
57. Text menu of Unitex and submenu Apply CasSys Cascade Look In cassys Y E liste_trans D liste trans nompropre2 File Name liste trans nompropre Figure 12 5 CasSys Window to launch a cascade of transducers concordance resulting of a cascade recognizing named entities To create the file containing all the modifications of the cascade on the text you have to click on Modify text in the Located Sequences window The resulting file is a copy of the text in which the transducer outputs appear In fact such a file is allways created by CasSys 244 CHAPTER 12 CASCADE OF TRANSDUCERS ieux sergent se mit leur t te Merci capitaint NiEntityeFunctioneMilitaryl dit Mr Fog nt Savez vous une chose ajouta t 11 capitaine BeEntity Punction Military Fogg que ainsi congue Suez Londres Rovan directeur MiEntityePunction Administration police able Batulcar sorte de Barnum ex ricein directeur BtEntity PumctiontAdministration d une t esko la grande cit qu habite le mikado empereur N EntitveFunctiontAristocratic eccl siast ecient quelques paroles et ce moment le brigadier NtEnticytMm tiontMilitaryl g n ral r rche du steamer Quand il tait maniable l le capitaine JHEntitvsFunctionsS litary faisait t Phileas Fogg voulait aller Liverpool le capitaine MNtEnticy Pm ctrioneMilitary ne voulait tendant que j avais tort de jouer
58. a sentence automaton 182 Priority of dictionaries 64 of the leftmost match 136 of the longest match 137 Reconstruction of the text automaton 277 Recursive Transition Networks 90 Reentrant alignment 205 Reference to information in the dictionar ies 73 116 Regular expressions 71 77 90 277 REPLACE 135 144 305 Resolving ambiguity 166 Respect of lowercase uppercase 114 116 of spaces 116 RTN 90 Rules for transducer application 135 rewriting 89 upper case and lower case letters 65 white space 65 Russian compound words 45 Scripting Unitex programs 286 Search for patterns 79 143 Selecting a language 31 Semitic languages 62 Sentence delimiter 275 Sentence separator 38 76 283 299 321 Sentenceseparator 315 Separators 37 Sequence Automaton 189 Shortest matches 79 143 Sorting 278 279 a dictionary 55 concordances 258 lines of a box 106 of concordances 81 148 355 SoyLatte 20 Space obligatory 72 prohibited 72 Splitting into sentences 37 State final 90 initial 90 Statistics 279 SVG graph export 112 Symbols non terminal 89 special 103 terminal 89 Synchronization point 163 Syntactical properties 195 Syntax diagrams 90 Testing variables 142 Text automata 281 automaton 277 automaton of the 73 284 directory 37 directory of 252 formats 31 modification 149 257 normalisation of the automaton 115 normalization 37 275 normalization of the
59. appli cation that runs on different operating systems Before you can use the graphical interface you first have to install the runtime environment usually called Java virtual machine or JRE Java Runtime Environment For the graphical mode Unitex needs Java version 1 6 or newer If you have an older version of Java Unitex will stop after you have chosen the working language 17 18 CHAPTER 1 INSTALLATION OF UNITEX You can download the virtual machine for your operating system for free from the Sun Microsystems web site 68 at the following address http java sun com If you are working under Linux or MacOS or if you are using a Windows version with personal user accounts you have to ask your system administrator to install Java 1 3 Installation on Windows If Unitex is to be installed on a multi user Windows machine it is recommended that the systems administrator performs the installation If you are the only user on your machine you can perform the installation yourself Decompress the file Unitex3 0 zip You can download this file from the following ad dress http www igm univ mlv fr unitex intoa directory Unitex3 0 that should preferably be created outside the Program Files folder In fact if you run Unitex under Windows 7 you will experience troubleshootings with your Unitex configuration file be cause Unitex tries to write it in the Users subdirectory and Windows 7 forbids it After decompress
60. automa ton The main advantages are that you can e benefit from ambiguity removal e benefit from the application of normalization grammar see below e work at several morphological levels multi word units simple words morphemes This is particularly interesting since you can now easily manipulate agglutinative lan guages like Korean for Korean see section 7 9 The rules are very similar to the ones that apply to classical searches with Locate Here are the differences 184 CHAPTER 7 TEXT AUTOMATON e you cannot capture sequences with variables inside right contexts as it is possible with Locate see Figure 6 18 page 125 e you cannot match things that are not in the text automaton if the text automaton only contains a compound word tag and not its concurrent simple word tags you won t be able to match simple words For instance in the sentence automaton shown on Figure 7 33 it is not possible to match soixante or huit since there are no such paths heures heure heures DET Dnum 21 mp fp Figure 7 33 Sentence automaton that cannot match with pattern huit e matched sequences can differ from sequences that will appear in concordances In fact the text automaton may contain tags that do not correspond to the raw input text in particular when a normalization grammar has been applied For instance if you look for the pattern 1e DET in 80jours s text automaton you will obtain 7703 matches w
61. by the and symbols Outputs with variables do not make sense in this kind of graph You cannot use morphological filters morphological mode or contexts It is possible to reference subgraphs It is not possible to reference information in dictionaries in order to describe the forms to normalize The only special symbol that is recognized in this type of graph is the empty word lt E gt The graphs for normalizing ambiguous forms need to be compiled before using them 6 1 4 Syntactic graphs Syntactic graphs often called local grammars allow you to describe syntactic patterns that can then be searched in the texts Of all kinds of graphs these have the greatest expressive power because they allow you to refer to information in dictionaries Lower case upper case variants may be used according to the principle described above It is still possible to enforce respect of case by enclosing an expression in double quotes The use of double quotes also allows you to enforce the respect of spaces In fact Unitex by default assumes that a space is possible between two boxes In order to enforce the presence of a space you have to enclose it in double quotes For prohibiting the presence of a space you have to use the special symbol Syntactic graphs can reference subgraphs cf section 5 2 2 They also have outputs includ ing outputs with variables The produced sequences are interpreted as strings of characters that will be inserted in the
62. concordances or in the text if you want to modify it cf sec tion 6 10 4 Syntactic graphs can use contexts see section 6 3 Syntactic graphs can use morphological filters see section 4 7 Syntactic graphs can use morphological mode see section 6 4 The special symbols that are supported by the syntactic graphs are the same as those that are usable in regular expressions cf section 4 3 1 It is not obligatory to compile syntactic graphs before using them for pattern matching If a graph is not compiled the system will compile it automatically 6 1 5 ELAG grammars ELAG grammars for disambiguation between lexical symbols in text automata are described in section 7 3 1 page 163 6 2 COMPILATION OF A GRAMMAR 117 6 1 6 Parameterized graphs Parameterized graphs are meta graphs that allow you to generate a family of graphs using a lexicon grammar table It is possible to construct parameterized graphs for all possible kinds of graphs The construction and use of parameterized graphs are explained in chapter 9 6 2 Compilation of a grammar 6 2 1 Compilation of a graph Compilation is the operation that converts the grf format to a format that can be ma nipulated more easily by Unitex programs In order to compile a graph you must open it and then click on Compile FST2 in the Tools submenu of the menu FSGraph Unitex then launches the Gr 2Fst2 program You can keep track of its execution in a window cf Figure 6 4
63. default the first lines of this file for French look like this 56 CHAPTER 3 DICTIONARIES Check Results Line 1 unexpected end of line agreeably ADV Line 2 unexpected end of line agreed INTJ Line 4 empty grammatical or semantic code File D My Unitex English Dela agreeably dic Type DELAF 5 lines read 2 simple entries for 2 distinct lemmas 0 compound entry for O distinct lerma Figure 3 6 Results of checking 3 5 AUTOMATIC INFLECTION 57 AAAAaaaa Bb CCoc Dd E EREe e 8 Characters in the same line are considered equivalent if the context permits If two equiv alent characters must be compared they are sorted in the order they appear in from left to right As can be seen from the extract above there is no difference between lower and upper case Accents and the c dille character are ignored as well To sort a dictionary open it and then click on Sort Dictionary in the DELA menu By default the program always looks for the file Alphabet sort txt If that file doesn t exist the sorting is done according to the character indices in the Unicode encoding By modifying that file you can define your own sorting order NOTE After applying the dictionaries to a text the files d1 dlc and err are automatically sorted using this program 3 5 Automatic inflection 3 5 1 Inflection of simple words As described in section 3 1 2 a line in a DELAS consists of a canonical form and a
64. do not have the code z3 cf table 3 2 If you want to refer to a code containing the character you have to escape this character by preceding it witha CHANGE NOTE before version 2 1 the negation operator was the minus If you want to preserve backward compatibility without modifying your graphs you have to call Locate by hand with the g minus option The order in which the codes appear in the mask is not important The three following patterns are equivalent 74 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS lt N Hum z1 gt lt z1 N Hum gt lt Hum z1 N gt NOTE it is not possible to use a lexical mask that only has prohibited codes lt N gt and lt A z1 gt are thus incorrect masks However you can express such constraints using con texts see section 6 3 4 3 4 Inflectional constraints It is also possible to specify constraints about the inflectional codes These constraints have to be preceded by at least one grammatical or semantic code They are represented as in flectional codes present in the dictionaries Here are some examples of lexical masks using inflectional constraints e lt A m gt recognizes a masculine adjective e A mp f gt recognizes a masculine plural or a feminine adjective e lt V 2 3 gt recognizes a verb in the 2nd or 3rd person that excludes all tenses that have neither a 2nd or 3rd person infinitive past participle and present participle as well as the tenses that are conjugate
65. e a N random N select N time a random log in the list in each thread e f N break after N user cancel after N run with one thread only e u PATH unfound location PATH take dictionnary and FST2 from PATH if not found on the logfile Another usage of UnitexToolLogger is using the MzRepairUlp option to repair a corrupted ulp file often a crashing log UnitexToolLogger MzRepairUlp OPTIONS lt ulpfile gt OPTIONS after MzRepairUlp e t X temp X uses X as filename for temporary file lt ulpfile gt build by de fault e o X output X uses X as filename for fixed ulp file lt ulpfile gt repair by default e m quiet do not emit message when running e v verbose emit message when running Another usage of UnitexToolLogger is using the CreateLog option with round bracket to create logfile of running Unitex program like UnitexToolLogger CreateLog OPTIONS cmd args UnitexToolLogger CreateLog OPTIONS cmd l args cmd 2 args By example 288 CHAPTER 13 USE OF EXTERNAL PROGRAMS UnitexToolLogger CreateLog log file my run normalize ulp Normalize C My Unitex French Corpus 80jours txt UnitexToolLogger CreateLog directory c logs Compress c dela mydela dic CheckDic delaf c dela mydela inf OPTIONS after CreateLog e g no_create_log do not create any log file Incompatible with all oth ers options
66. en grec moderne 1990 Th se de doctorat Universit Paris 8 3 8 59 Tita KYRIACOPOULOU Un syst me d analyse de textes en grec moderne repr sentation des noms compos s In Actes du 5 me Colloque International de Linguistique Grecque 13 15 septembre 2001 Sorbonne Paris 2002 3 8 60 Tita KYRIACOPOULOU Safia MRABTI and AnastasiaYANNACOPOULOU Le dictionnaire lectronique des noms compos s en grec moderne Lingvistice In vestigationes 25 1 7 28 2002 Amsterdam Philadelphia John Benjamins Pub lishing Company 3 8 61 Jacques LABELLE Le traitement automatique des variantes linguistiques en francais l exemple des concrets Lingvistice Investigationes 19 1 137 152 1995 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 62 Eric LAPORTE and Anne MONCEAUX Elimination of lexical ambiguities by grammars The ELAG system Lingvistice Investigationes 22 341 367 1998 Amsterdam Philadelphia John Benjamins Publishing Company 7 3 63 Ville LAURIKARI TRE home page http laurikari net tre 1 1 47 64 Christian LECLERE The lexicon grammar of french verbs a syntactic database In Kawaguchi Y et alii editor Linguistic Informatics State of the Art and the Future pages 29 45 Amsterdam Philadelphia Benjamins 2005 9 1 65 Judith N LEVI The Syntax and Semantics of Complex Nominals Academic Press New York London 1978 11 1 66 XAlign Alignement multilingue LORIA 2006 http led loria
67. gladni gladni gladni gladni gladni kao vuk gl N Kao Kao Kao Kao Kao Kao D D vd vd d vd D vd n 11 3 INTEGRATION IN UNITEX gladni kao vuci gladan kao vuk AC_A3XN2 p5mgea hungry as a wolf gladni kao vukovi gladan kao vuk AC_A3XN2 p5mgea hungry as a wolf gladne kao vuk gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladne kao vuci gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC_A3XN2 p5fgea hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladna kao vuci gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC_A3XN2 p5ngea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p6mgea hungry as a wolf gladnima kao vuk gladan kao vuk AC A3XN2 p6fgea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p6fgea hungry as a wolf gladnima kao vukovi gladan kao vuk AC A3XN2 p6fgea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p6fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p6fgea
68. gram enters in graph subname This parameter can be used several times in order to specify several stop graphs e p s d s displays paths graph by graph f default displays global paths d displays global paths with information on nested graph calls e c SS 0xXXXX replaces symbol SS when it appears between angle brackets by the Unicode character whose hexadecimal number is 0xXXXX e s L R specifies the left L and right R delimiters that will enclose items By default no delimiters are specified 13 20 FST2TXT 267 s0 Str if the program must take outputs into account this parameter specifies the sequence St x that will be inserted between input and output By default there is no separator f a s ifthe program must take outputs into account this parameter spec ifies the format of the lines that will be generated in0 inl out0 outl s orin0 out0 inl outl a The default value is s ss stop set str as the mark of stop exploitation at lt stop gt The de fauld value is nu11 v prints information during the process verbose mode m mode special for description with alphabet rx L R specifies how cycles must be displayed L and R are delimiters If we consider the graph shown on Figure 13 4 here are the results for L and R x il fait tr s tr s il fait tr s beau farsa Yrs ch Figure 13 4 Graph with a cycle 13 20 Fst2Txt Fst2Txt OPTIONS lt fst2
69. gt This program applies a transducer to a text in longest match mode at the prepro cessing stage when the text has not been cut into lexical units yet OPTIONS t TXT text TXT the text file to be modified with extension snt a ALPH alphabet ALPH the alphabet file of the language of the text s start_on_space this parameter indicates that the search will start at any position in the text even before a space This parameter should only be used to carry out morphological searches x dont start on space forbids the program to match expressions that start with a space default 268 CHAPTER 13 USE OF EXTERNAL PROGRAMS e c char by char works in character by character tokenization mode This is useful for languages like Thai e w word by word works in word by word tokenization mode default Output options e M merge merge transducer outputs with text inputs default e R replace replace texts inputs with corresponding transducer outputs This program modifies the input text file 13 24 Grf2Fst2 Grf2Fst2 OPTIONS lt grf gt This program compiles a grammar into a fst2 file for more details see section 6 2 The parameter grf denotes the complete path of the main graph of the grammar without omitting the extension grf OPTIONS e y loop_check enables error checking loop detection e n no_loop_check disables error checking default a ALPH alphabet
70. in the text and appear in the resulting concordance or modified text e Replace the recognized sequence to modify the text These two operations transform the text add information inside the text for dif ferent purposes or modify it The cascade can then be used for syntactic analysis chunking information extraction etc The advantage of a cascade is mainly that it is a good way to manage priority be tween patterns that you want to find in a text If you know two ambiguous pattern you can apply the less ambiguous pattern first 12 2 7 Thelongest pattern The heuristic of the longest pattern matching is applied to each transducer of the cascade When a graph is applied to a text several paths can be recognized by the graph 12 2 DETAILS ON CASSYS 249 If the graph arrives to its final state through several paths then it is the path that recognizes the longest pattern that is chosen The longer the pattern is the less ambiguous it is If the transducer doesn t arrive to its final state then the recognizing step restarts on the next word of the text The longest pattern matching heuristic is interesting but if several paths of the same size are recognized there is still a problem one of the paths will be chosen with no control on this choice for the user meaning that the worst paht might well be chosen A solution to that problem can be the creation of a cascade of transducer giving priority to a transducer among the list of tra
71. no_ambiguous_outputs forbids ambiguous outputs In case of am biguous outputs one will be arbitrarily keeped depending on the internal state of the program Variable error options These options have no effect if the output mode is set with ignore otherwise they rule the behavior of the Locate program when an output is found that con tains a reference to a variable that is not correctly defined e X exit on variable error kills the program e Y ignore variable errors acts as if the variable has an empty con tent default e Z backtrack on variable errors stop exploring the current path of the grammar Variable injection e v X Y variable X Y sets an output variable named X with content Y Note that Y must be ASCII Tagging option e tagging indicates that the concordance must be a tagging one containing additional information on the start and end states of each match This program saves the references to the found occurrences in a file called concord ind The number of occurrences and the number of produced outputs are saved in a file called concord tfst n These two files are stored in the directory of the text 13 27 MultiFlex MultiFlex OPTIONS dela This program carries out the automatic inflection of a DELA dictionary containing simple see section 3 1 2 or compound word lemmas see chapter 11 OPTIONS e o DELAF output DELAF output DELAF file e a ALPH
72. not For each state the list of transitions is a possibly empty sequence of pairs of integers e the first integer indicates the number of the label or sub graph that corresponds to the transition Labels are numbered starting at 0 Sub graphs are represented by negative integers which explains why the numbers preceding the names of the graphs are negative e the second integer represents the number of the result state after the transition In each graph the states are numbered starting at 0 By convention state 0 is the initial state Each state definition line terminates with a space The end of each graph is marked by a line containing an f followed by a space and a newline Labels are defined after the last graph If the line begins with the character the contents of the label is to be searched without allowing case variations This infor mation is not used if the label is not a word If the line starts with a capitalization variants are authorized If a label carries a transducer output sequence the input and output sequences are separated by the character example the DET By 14 4 TEXTS 299 convention the first label is always the empty word lt E gt even if that label is never used for any transition The end of the file is indicated by a line containing the f character followed by a newline 14 4 Texts This section presents the different files used to represent texts 14 4 1 txt files txt files are
73. o eac recni eu ba PES pdd esos de Rad ERE YER 294 ES Format Gi ee EG OR OT UT AMAR EEE E ss 294 1432 Format fst 4 452 oseo A ed ae aS 297 BEA ne eo E kae ce E aaa wee we 299 TEA blei e ae A hw Er e ew eh Re ee bee eee 299 1422 SM NES ou Sok we da dune os Eat s ee en Cid 299 1443 File text cod 299 14 4 4 The tokens txt file 4 4 444 44 444444 299 1445 The tok_by_alph txt and tok_by_freq txt files 300 144 6 IBeentebposH 232 RE dee R e e EX ex ds 300 14 5 Text Automaton asas 300 14541 Thetexttsthl octe cata ead aua a ER E a 300 1452 The texttnd ile o uae tuer ERRARE GERE S REDE 303 145 3 Ihecusentencegri file ses oste eR Ron RR X Ws 303 1454 ThesentenceN grf file ous e s ex oe EHS ex ER 303 14 5 5 The cursentence txt file 304 14 5 6 The cursentence tok file 304 14 5 7 The tfst_tags_by_freq txt and tfst_tags_by_alph txt files 304 146 Concordances 6 666 Se ed dada dew se mab Eo INPI SMS EER 304 146 1 The concord ind file 304 1452 Theconecord ixE le 222 si ue sa unten mas es ae 305 1465 Th econcord h inl file s lt i see ede o saines tes 306 1464 Thedithhtml file 4 252522 4 Ra ea mx SESS 307 14 7 Text dictionaries 308 H71 dia dE one D na a de al ee die de ee bite 308 E D E eee D dre tt ee ee TN 308 PILLS TOPS se
74. of frozen expressions most notably those which concern the enumeration of com pound words What s new from version 2 0 Here are the main new features Introduction of LocateTfst that performs locate operations on the text automaton 7 7 13 26 LocateTfst fully supports Korean 7 9 New search options 6 10 2 Introduction of Stats that can compute statistics from a concordance file 4 8 3 Introduction of a statistical tagger that can trim text automata to make them linear 7 4 Introduction of the transducer cascade system CasSys chapter 12 Introduction of output variables that can be used to catch outputs emitted by gram mars 6 8 Introduction of operators to test and compare variables 6 9 Supporting semitic inflection 3 5 3 with a full support of Arabic typographical vari ations 14 13 9 New options to apply dictionary graphs 3 7 3 Introduction of Uncompress that can rebuild a dic dictionary from a bin com pressed one 13 45 Introduction of Untokenize that can rebuild a snt text file from text cod and tokens txt 13 46 Introduction of ancient Georgian The console has changed you can view error messages emitted by previous com mands 13 2 There is a tutorial for installing Unitex on MacOS X 1 5 New text automaton format t fst 14 5 1 Unitex programs have Unix style options Unitex is now pure LGPL 1 1 Unitex can be compiled as a dynamic library d11 or so 1 9 CONTENTS 13
75. of the window To modify the name of the collection click on the Browse button In the dialog box that appears enter the 1st file name for the collection To add a grammar to the collection select it in the file explorer in the left frame and click on the button Once you have selected all your grammars compile them by clicking on the Compile button This will create a rul file bearing the name indicated at the bottom right the name of the file is obtained by replacing 1st by rul You can now apply your grammar collection As explained above click on the Apply Elag Rule button in the text automaton window When the dialog asks for the ru1 file to use click on the Browse button and select your collection The resulting automaton is identical to that which would have been obtained by applying each grammar successively 7 3 5 Window For ELAG Processing At the time of disambiguation the Elag program is launched in a processing window which displays the messages printed by the program during its execution For example when the text automaton contains symbols which do not correspond to the set of ELAG labels see the following section a message indicates the nature of the error In the same way when a sentence is rejected all possible analyses were eliminated by grammars a message indicates the number of the sentence That makes it possible to locate the source of the problems quickly Entries which gathe
76. output In MERGE mode the output is inserted to the left of the recognized sequences fine HEO Adj Figure 6 40 Example of a transducer Look at the transducer in Figure 6 40 If this transducer is applied to the novel Ivanhoe by Sir Walter Scott in MERGE mode the following concordance is obtained Concordance D My UnitexEnglish Corpus ivanhoe_snticoncord html of pointed beams which the Adj adjacent forest supplied defended the o f the outlaws with whom the Adj adjacent forest abounded or by the viol es may be still seen in the Adj antique Colleges of Oxford or Cambridge insolence fellow said the dj armed rider breaking in on his prattle an 3 take a turn round the Adj back oi the hill to gain the wind on the ring the greater part of the dj beautiful hills and valleys which lie be mantle and hood were of the Adj best Flanders cloth and fell in ample dest wine cask 5 place the Adj best mead the mightiest ale the riches Then sad relief from the Adj bleak coast that hears The German Ocean e bring to the shrine of the Adj Blessed Virgin Well you have said en rong And yellow hair d the Adj blue eyed Saxon came Thomson s Liber the son of Beowulph is the Adj born thrall of Cedric of Rotherwood Be Figure 6 41 Concordance obtained in MERGE mode with the transducer of figure 6 40 6 7 2 Application while advancing through the text During the preproces
77. parameter defines the color to be used for displaying boxes that correspond to context bounds The CONTEXT NODES indicates if the current language must be tokenized character by character or not The ANTIALIASING parameter indicates whether graphs as well as sentence au tomata are displayed by default with the antialiasing effect The HIML VIEWER parameter indicates the name of the navigator to be used for dis playing concordances If no navigator name is defined concordances are displayed in a Unitex window 14 11 CONFIGURATION FILES 319 The MAX TEXT FILE SIZE parameter is deprecated The ICON BAR POSITION parameter indicates the default position of icon bars in graph frames The PACKAGE PATH parameter specifies the location of the repository The MORPHOLOGICAL DICTIONARY parameter specifies the list of morphological dictionaries to use separated with semi colons The MORPHOLOGICAL NODES COLOR parameter specifies the color to use to render the lt and gt tags The MORPHOLOGICAL USE OF SPACE parameter indicates if the Locate program is allowed to start matching on spaces Default is false 14 11 2 The system_dic def file The system_dic def file is a text file that describes the list of system dictionaries that are applied by default This file can be found in the directory of the current language Each line corresponds to a name of a bin fi
78. path that matches it This window is described with more details in section 6 10 7 The Index field allows to select the recognition mode e Shortest matches give precedence to the shortest matches 144 CHAPTER 6 ADVANCED USE OF GRAPHS e Longest matches give precedence to the longest sequences This is the default mode e All matches give out all recognized sequences The Search limitation field allows you to limit the search to a certain number of occur rences By default the search is limited to the 200 first occurrences Locate Pattern Locate configuration Advanced options Locate pattern in the form of Regular expression a Graph _ Activate debug mode Index Grammar outputs Shortest matches amp Are not taken into account 8 Longest matches Merge with input text All matches Replace recognized sequences Search limitation amp Stop after 200 matches SEARCH Index all utterances in text Search algorithm 8 Paumier 2003 working on text quicker automaton intersection higher precision Figure 6 52 Locate pattern Window The Grammar outputs field concerns transducers The Merge with input text mode allows you to insert the output sequences in input sequences The Replace recognized sequences mode allows you to replace the recognized sequences with the produced se quences The third mode ignores all outputs This l
79. pique l le colonel N Entity FunctiontMilitars m a fait une rt Arriv Suez mercredi 9 octobre 11 heuxes N Entity4 TimetHour matin Total des heur e lendemain c tait le 12 d cembre Du 121 sept heures NtEntity Tine Mour du matin au 21 ut rapidement vers l est Le lendemain 13 d cembre N Entityv TimesDatesPelative midi un tion ne partait que le surlendemain 114 d cembre NEnticytTinetDatetPelativel Et d ailleu saki et Yokohama Arriv le matin m me 14 novembre MHEntity Tine Dare Pelative l heure r faux pont tout y passa Le lendemain 112 d cembre NtEnticytTinetDaretFRelativel on br la la Figure 12 6 Concordance of Cassys under Unitex 12 15 Sharing a cascade transducer list file In order to ease collaborating work within CasSys a simple export import system for trans ducers list file feature is provided This possibility is offered in the Text Apply cascade menu To share a cascade list file the following steps has to be fullfilled 1 Export Select a cascade file and click the export button A ready to share file is created in the Cassys Share repository 2 Send the shared file to your colleague 3 Import Select the import file and click the import button A ready to use file is created in the Cassys repository 12 2 Details on Cassys In this section we present details concerning the functioning of Cassys expliquer l iteration 12 2 1 Type of graphs used C
80. produce a modified executable containing the modified Library It is un derstood that the user who changes the contents of definitions files in the Library will not necessarily be able to recompile the application to use the modified defini tions b Use a suitable shared library mechanism for linking with the Library A suit able mechanism is one that 1 uses at run time a copy of the library already present on the user s computer system rather than copying library functions into the exe cutable and 2 will operate properly with a modified version of the library if the user installs one as long as the modified version is interface compatible with the version that the work was made with c Accompany the work with a written offer valid for at least three years to give the same user the materials specified in Subsection 6a above for a charge no more than the cost of performing this distribution d If distribution of the work is made by offering access to copy from a designated place offer equivalent access to copy the above specified materials from the same place e Verify that the user has already received a copy of these materials or that you have already sent this user a copy For an executable the required form of the work that uses the Library must include any data and utility programs needed for reproducing the executable from it However as a special exception the materials to be distributed need not include anything
81. required form E g lt 3 Nb p gt means that the plural form of royal is needed In order to generate all inflected forms of the MWU we have to explore all the paths existing in the graph Each path starts at the leftmost right arrow and ends at the final encircled box Each time we come to a node we perform the action contained in the box a recopy or an inflection of a constituent and we accumulate the morphological features contained under the box The total of the accumulated node outputs should result in the complete morphological description of the inflected form For example in the graph on Figure 11 1 if we follow the intermediate path shown on Fig ure 11 2 lt Nb p gt Figure 11 2 One path of the inflection graph for battle royal we recopy battle 1 and the space 2 and we put royal into plural which yields the plural form battle royals of the whole MWU As the graph on Figure 11 1 contains three different paths the whole set of inflected forms generated for battle royal would be battle royal lt Nb s gt battle royals lt Nb p gt battles royal lt Nb p gt After rewriting these forms into the Unitex DELACF format we obtain the following entries battle royal battle royal N s battle royals battle royal N p battles royal battle royal N p Note that this description is independent of the way we generate inflected forms of single words because we suppose that this problem is handled by an existing external morpho
82. s Anim a Gen 21 Case 1 gt 2 lt 1 Anim a Gen m Case c Nb s gt Figure 11 34 Inflection graph NC ImePrezime for Serbian MWUs gladan kao vuk ES Es Ex Figure 11 35 Inflection graph AC_A3XN2 for Serbian MWUs Chapter 12 Cascade of Transducers This chapter presents the tool Cassys that provides users the possibility to create Unitex cascade of transducers and new opportunities to work on natural language whith Finite State Graphs A cascade of transducers applies several FSGraphs also called automata or transducers one after the other onto a text each graph modifies the text and changes can be useful for further processings with the next graphs Such a system is typically used for syntactic analysis chunking information extraction recognizing named entities etc To do that CasSys uses a succession of locate patterns to which was added special options and behaviors The first prototype of the CasSys system was created in 2002 at the LI Computer science Laboratory of Universit Francois Rabelais Tours France 30 This prototype was totally dedicated to named entity recognition Later CasSys was generalized to allow any sort of work needing a cascade throughout the years it was improved but never really integrated in Unitex until a recent project which resulted in the complete integration of CasSys in Unitex In this chapter we will explain how to create modify cascades of transduce
83. semantic code Hum 76 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS e lt ADV gt all words that are not adverbs e lt MOT gt all tokens that are not made of letters cf figure 4 2 This mask does not recognize the sentence separator S and the special tag STOP 5 Concordance D My Unitex English Corpusianhoe_snticoncord html ngland which is watered by the river Don there extended in ancient times a large forest cover extended in ancient times a large forest covering the greater part of the beautiful hills and field and the pleasant town of Doncaster The remains of this extensive wood are still to be be seen at the noble seats of Wentworth of Warncliffe Park and around Rotherham Here hau e seats of Wentworth of Warncliffe Park and around Rotherham Here haunted of yore the fab of Warncliffe Park and around Rotherham Here haunted of yore the fabulous Dragon of Wantle d of yore the fabulous Dragon of Wantley here were fought many of the most desperate battle ttles during the Civil Wars of the Roses 5 and here also flourished in ancient times those ba ent times those bands of gallant outlaws whose deeds have been rendered so popular in English been rendered so popular in English song Such being our chief scene the date of our story lish song 3 Such being our chief scene the date of our story refers to a period towards the owards the end of the reign of Richard I when his return from his long capti
84. snt extension OPTIONS e a ALPH alphabet ALPH alphabet file e c clean indicates whether the rule of conservation of the best paths see section 7 2 4 should be applied e n XXX normalization grammar XXX name ofa normalization gram mar that is to be applied to the text automaton 13 45 UNCOMPRESS 285 e t TAGSET tagset TAGSET Elag tagset file to use to normalize dictio nary entries e K korean tells Txt2Tfst that it works on Korean If the text is separated into sentences the program constructs an automaton for each sentence If this is not the case the program arbitrarily cuts the text into sequences of 2000 lexical units and produces an automaton for each of these sequences The result is a file called text t fst which is saved in the directory of the text Another file named text t ind is also produced NOTE The program will also try to use the tags ind file if any see section 14 7 4 13 45 Uncompress Uncompress OPTIONS lt bin gt This program uncompresses a bin dictionary into a text file dic one OPTIONS e o OUT output OUT optional output file name default file bin gt file dic 13 46 Untokenize Untokenize OPTIONS lt txt gt Untokenizes and rebuild the orgininal text The token list is stored into tokens txt and the coded text is stored into text cod The file enter pos contains the po sition in tokens of all the carriage return sequences These f
85. strict syntax checking against unprotected dot and comma t tolerate tolerates unprotected dot and comma default n no space warning tolerates spaces in grammatical semantic inflectional codes p skip path does not display the full path of the dictionary useful for consistent log files across several systems a ALPH alphabet ALPH specifies the alphabet file to use The program checks the syntax of the lines of the dictionary It also creates a list of all characters occurring in the inflected and canonical forms of words in the text the list of grammatical codes and syntax as well as the list of inflection codes used The results of the verification are stored in a file called CHECK DIC TXT Selecting strict syntax checking detects using unprotected dot in inflected form or unprotected comma in lemma The tolerate option acts like Unitex 2 0 and lower and does not detect them 13 8 Compress Compress OPTIONS dictionary OPTIONS o BIN output BIN sets the output file By default a file xxx dic will produce a file xxx bin flip indicates that the inflected and canonical forms should be swapped in the compressed dictionary This option is used to construct an inverse dic tionary which is necessary for the program Reconst rucao 13 9 CONCORD 257 e s semitic indicates that the semitic compression algorithm should be used Setting this option with semitic languages li
86. tagger for your language you need to launch the TrainingTagger program on your own annotated corpus The format of the annotated corpus is described in 7 5 MANIPULATION OF TEXT AUTOMATA 179 insectes nuisibles envahissent insecte nuisible envahir DET mp N mp V P3p DET fs Figure 7 26 Text automaton linearized with morph tagger data 14 10 1 As we discuss in Section 7 4 1 you need to pay attention on tagset and morphology Before computing a statistical model you have to decide which dictionaries and normaliza tion graphs you will use to construct the text automaton And then you will have to do modifications on the annotated corpus if word forms or tagset do not match completely For example if the normalization graph transforms the word jusqu into jusque the corre sponding word into the annotated corpus must be jusque A French tagger is distributed with Unitex It has been created with an annotated corpus composed of tags without semantic and syntactic codes 7 5 Manipulation of text automata 7 5 1 Displaying sentence automata As we have seen above the text automaton is in fact the collection of the sentence automata of a text This structure can be represented using the format fst 2 also used for represent ing the compiled grammars This format does not allow the system to directly display the sentence automata Instead the system uses the Fst2Grf program to convert the sentence automaton into a graph that can b
87. text files encoded in Unicode Little Endian These files should not contain any opening or closing braces except for those used to mark a sentence separator S or a valid lexical tag aujourd hui ADV The newline needs to be encoded with the two special characters with hexadecimal values 000D and 000A 14 4 2 snt files snt files are txt files that have been processed by Unitex These files should not contain any tabs They should also not contain multiple consecutive spaces or newlines The only allowed braces in snt files are those of the sentence delimiter S and those of lexical labels au jourd hui ADV 14 4 3 File text cod The text cod file is a binary file containing a sequence of integers that represent the text Each integer i reflects the token with index i in the tokens txt file These integers are encoded in four bytes NOTE Tokens are numbered starting at 0 14 4 4 The tokens txt file The tokens txt file is a text file that contains the list of all lexical units of the text The first line of this file indicates the number of units found in the file Units are separated by a newline Whenever a sequence is found in the text with capitalization variants each variant is encoded as a distinct unit NOTE Newlines that might be in the snt file are encoded like spaces Therefore there is no unit encoding the newline 300 CHAPTER 14 FILE FORMATS 14 4 5 Thetok by alph txt and tok by freq txt
88. that allows execute again an ulp logfile It can also record a running session of an UnitexTool and create an ulp logfile If Uni texToolLogger is used like UnitexTool with just parameter with command line for Unitex external program and a file named unitex logging parameters count txt in current directory contain a path an ulp logfile will be created for the running session The ulp file in an uncompressed zipfile compatible with unzip and can be useful for debugging UnitexToolLogger RunLog OPTIONS lt ulp gt OPTIONS after RunLog e m quiet do not emit message when running e v verbose emit message when running e d DIR rundir DIR path where log is executed e r newfile ulp result newfile ulp name of result ulp created 13 48 UNITEXTOOLLOGGER 287 e c clean remove work file after execution e k keep keep work file after execution e s file txt summary file txt summary file with log compare re sult to be created e e file txt summary error file txt summary file with error com pare result to be created e b no benchmark do not store time execution in result log e n cleanlog remove result ulp after execution e 1 keeplog keep result ulp after execution e o NameTool tool NameTool run only log for NameTool e i N increment N increment filename ulp by 0 to N e t N threadzN create N thread
89. the Free Software Foun dation write to the Free Software Foundation we sometimes make exceptions for this Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally NO WARRANTY 15 BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE THERE IS NO WARRANTY FOR THE LIBRARY TO THE EXTENT PERMITTED BY APPLICA BLE LAW EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND OR OTHER PARTIES PROVIDE THE LIBRARY AS IS WITH OUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUD ING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABIL ITY AND FITNESS FOR A PARTICULAR PURPOSE THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU SHOULD THE LIBRARY PROVE DEFECTIVE YOU ASSUME THE COST OF ALL NECES SARY SERVICING REPAIR OR CORRECTION 14 13 VARIOUS OTHER FILES 331 16 IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER OR ANY OTHER PARTY WHO MAY MODIFY AND OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE BE LIABLE TO YOU FOR DAMAGES INCLUDING ANY GENERAL SPECIAL INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LIBRARY INCLUDING BUT NOT LIM ITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE
90. the state is final the line starts with t Otherwise the line starts with All transitions are written as pairs x y x being the number of the tag y being the number of the destination state Note that at the opposite of st 2 format lines have not to end with a space The and of state lines is marked by a line containing f Finally all tags are encoded By convention the first tag is always the epsilon one 14 5 TEXT AUTOMATON 301 lt E gt q 4 Other labels have to be either lexical units or entries in the DELAF format in braces They are encoded as follows STD4 Gcontent Ga BC x y 2 4 content is the tag content The a b c x y z information describe the zone in text cov ered by the tag e a start offset in tokens from the beginning of the sentence e b start offset in characters from the beginning of the first token of the tag c start offset in logical letters from the first character of the tag This informa tion is useful for Korean because a tag can represent a Jamo sequence that oc curs inside a Hangul character Thus the character offset is not precise enough x end offset in tokens from the beginning of the sentence y end offset in characters from the beginning of the last token of the tag z end offset in logical letters from the last character of the tag In Korean sen tence automata empty surface forms can occur that correspond to the empty word in the text In such cases z has
91. to make thoroughly clear what is believed to be a conse quence of the rest of this License 12 If the distribution and or use of the Library is restricted in certain countries either by patents or by copyrighted interfaces the original copyright holder who places the Library under this License may add an explicit geographical distribu tion limitation excluding those countries so that distribution is permitted only in or among countries not thus excluded In such case this License incorporates the limitation as if written in the body of this License 13 The Free Software Foundation may publish revised and or new versions of the Lesser General Public License from time to time Such new versions will be similar in spirit to the present version but may differ in detail to address new problems or concerns Each version is given a distinguishing version number If the Library specifies a version number of this License which applies to it and any later version you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation If the Library does not specify a license version number you may choose any version ever published by the Free Software Foundation 14 If you wish to incorporate parts of the Library into other free programs whose distribution conditions are incompatible with these write to the author to ask for permission For software which is copyrighted by
92. used by the Tagger program in order to compute probabilities 284 CHAPTER 13 USE OF EXTERNAL PROGRAMS and linearize the text automaton The tagged corpus file must follow the format described in section 14 10 1 Those files contain tuples unigrams bigrams and tri grams formed by tags and words In the first data file tags are cat tags i e gram matical syntactic and semantic codes In the second data file tags are morph tags i e grammatical syntactic semantic and inflectional codes OPTIONS e a all indicates whether the program should produce all data files de fault e c cat indicates whether the program should produce only data file with cat tags e m morph indicates whether the program should produce only data file with morph tags e n no binaries indicates whether the program should not compress data files into bin files in this case only dic data files are generated e b binaries indicates whether the program should compress data files into bin files default e o XXX output XXX pattern used to name output tagger data files XXX data cat bi and XXX data morph bin default filename of text corpus without exten sion e s semitic indicates that the semitic compression algorithm should be used 13 44 Txt2Tfst Txt2Tfst OPTIONS txt This program constructs an automaton of a text txt represents the complete path of a text file without omitting the
93. versions of your texts if needed and displays the frame shown on Figure 10 3 As you can see each text is presented as a list each cell representing a sentence 203 204 CHAPTER 10 TEXT ALIGNMENT Target text it Alignment file optional Figure 10 1 Text alignment selection frame Source text DAMy UnitexiFrenchiCorpusi4 funtana fr bd oa DAM ral Your source file is a txt one Please select the Alignt destination file to be used by XAlign TEI format _ OK Figure 10 2 Warning about raw texts 10 2 ALIGNING TEXTS D My UnitexiXAlign funtana xml Je vous demande pardon ch re madame de ne pas pouvoir vous r pondre dans otre langue Je suis sans doute sur cette ile la seule personne qui ait oubli la m moire d outre mer cer scuze stimat doamna c nu pot s v raspund in limba dumneavoastra Sint probabil sigura persoan de pe aceasta insula c reia i s a sters din memorie lumea de dincolo de mare Si insulele indepartate pierdute la geana orizontului 2hn lamer o Et ces iles l bas o Fara de veste vintul se pravale dinspre 8 All sentences Plain text All sentences Plain text 9 Matched sentences All sentences HTML O Matched sentences O All sentences HTML Aligned with target concordance Aligned with source concordance Locate Clear alignment Save alignment Save alignment as Locate Figure 10
94. with different range i e the two matches only overlap partially Green indicates an utterance that appears in only one concordance 6 10 APPLYING GRAPHS TO TEXTS 151 Figure 6 62 gives an example If you have no previous concordance the button is deactivated Violet identical sequences with different outputs Red similar but different sequences Green sequences that occur in only one of the two concordances Grey background previous matches White background new matches sa barbe de l eau quatre vingt quatre degr s Fahrenheit au sa barbe de l eau quatre vinat quatre degr s Fahrenheit au eau quatre vingt quatre degr s Fahrenheit au lieu de quat hrenheit au lieu de quatre vingt six _ et il attendait son hrenheit au lieu de quatre vingt six _ et il attendait son ieu de quatre vingt six et il attendait son se pr senter entre onze heures et onze heures et demie S ntre onze heures et onze heures et demie S Phileas Fogg c s son fauteuil les deux pieds rapproch s comme ceux d un so s son fauteuil les deux unit s pieds rapproch s comme ceux roch s comme ceux d un soldat la parade les mains appuy e roch s comme ceux d un unit s soldat la parade les mains es et l ann e S A onze heures et demie sonnant Mr Fogg d tique dit il S Un garcon g d une trentaine d ann es s tique dit il S Un unit s gar on g d une trentaine d Jean Passepartout un surnom qui m est
95. z Gen masc anim r Gen masc_inanim f Gen fem n Gen neu describe the equivalences between the previous Morphology txt file for Polish and French respectively and the single character features that might be used in DELA dictionaries for those languages under Unitex 216 CHAPTER 11 COMPOUND WORD INFLECTION 11 2 2 Decomposition of a MWU into Units The notion of an elementary graphical unit is controversial and varies across languages and NLP systems For instance in nitex an alphabet i e a set of characters is first defined for each language Each non alphabet character is called a separator A graphical unit is then either a single separator usually a punctuation mark a digit etc or a contiguous sequence of alphabet characters e g aujourd hui in French consists according to this definition of 3 units In other systems a graphical unit may contain a punctuation mark e g c est dire or a limit between two graphical units may occur within a sequence of alphabet characters widziat bym cf 76 This variety of possible definitions of a graphical unit obviously has an impact on the def inition of a multi word unit However we wish our formalism for MWUs to be adaptable to different morphological systems for simple words Thus the definition of a graphical unit is a parameter to our system each time MULTIFLEX is used with an external module for single units this module has to decide how a sequence of characte
96. 0 This License Agreement applies to any Linguistic Resource which contains a notice placed by the copyright holder or other authorized party saying it may be distributed under the terms of this Lesser General Public License for Lin guistic Resources also called this License Each licensee is addressed as you A linguistic resource means a collection of data about language prepared so as to be used with application programs The Linguistic Resource below refers to any such work which has been dis tributed under these terms A work based on the Linguistic Resource means either the Linguistic Resource or any derivative work under copyright law that is to say a work containing the Linguistic Resource or a portion of it ei ther verbatim or with modifications and or translated straightforwardly into another language Hereinafter translation is included without limitation in the term modification Legible form for a linguistic resource means the preferred form of the re source for making modifications to it 335 336 CHAPTER 14 FILE FORMATS Activities other than copying distribution and modification are not covered by this License they are outside its scope The act of running a program us ing the Linguistic Resource is not restricted and output from such a program is covered only if its contents constitute a work based on the Linguistic Re source independent of the use of the Linguistic Resource in a to
97. 0 Warning about a non portable graph name For portability you should not use or as separator in graph path names Use instead 96 CHAPTER 5 LOCAL GRAMMARS which is understood as a system independent separator In figure 5 10 and are internally converted by the graph compiler to E greek delta grf Graph repository When you need to call a grammar X inside a grammar Y a simple method is to copy all the graphs of X into the directory that contains the graphs of Y This method raises two problems e the number of graphs in the directory grows quickly e two graphs cannot share the same name To avoid that you can store the grammar X in a special directory called the graph repository This directory is a kind of library where you can store graphs and then call them using instead of To use this mechanism you first need to set the path to the graph repository Go into the Info Preferences Directories menu and select your directory in the Graph repository frame see Figure 5 11 There is one graph repository per language so feel free to share or not the same directory for all the languages you work with E Preferences for English Morphological dictionaries Directories Language amp Presentation Private Unitex directory where all user s dat home paumier unitex Set Graph repository C Produce log information in directory hom
98. 03 528 ToolbarCommands esos cane ew SE ab due Os 104 2o Display Options ea es s sararae ko AAA RARA AN 106 bal Sorting thelmnesofa BOE 23 E m Ea el Ede 106 5 92 QOO Lune de ew eS he HOW AE one de Dun e Us 106 A ok So oie oe Wk eee eme de ee res RESP open 107 504 Boxalignment uuo BGs m o SEERA sed ete 108 5 3 5 Display options fonts and colors cose Ex 109 BE Exporting graphs e ec boron ne e ee de JO C EORR eee AA 111 54 1 Inserting a graph ina document lt a a RE s 111 DAT Pinon Se A Gab eee KOR AK ARE Denk eo ac ec n eS 112 6 Advanced use of graphs 113 bl TINEO SEND vx Bak A he OS CRIAM NUR eme a OAS Eo eS 113 6 11 Inflection transd cerS cocos om eS be E 113 6 1 2 EEN Lis uoc oc x PES 6 Se eH SSeS Eee 114 6 13 Graphs for normalizing the text automaton 115 Ska sers x Rx ee ed pa EES Oe AE 116 619 BELAG pgammArS cases eRe E E Bw Sea med A 116 6 16 Parameterized graphs 2 rerea nea da ba Re SENS EX ES 117 52 Compilation of a grammat v s ss seces A RR ame AA 117 621 Compilatorotagraph s secs x uox aT a OO RE EE OS 117 6 2 2 Approximation with a finite state transducer 117 613 Consitamts OD GIANNA o se o E ere due RU ES we Ew 118 6 24 aca elo zu sermar eneak veces SE UE ee s 122 63 Contek sra ud ea a A die Abs BONO EUR IDE T 122 601 GIN GONE Lu cb oie sels E op eee e ER Ee AE 122 Go o v rd ake ad ce xps we awe qood ew PD we e 125 64 Themorphological mode
99. 10 Creating the personal work directory directory for which you will need to have the access rights this might mean that you need to ask your system administrator to do it On the other hand if the language is only used by a single user he can also copy the directory to his working directory He can work with this language without this language being shown to other users 1 8 Uninstalling Unitex No matter which operating system you are working with it is sufficient to delete the Unitex directory to completely delete all the program files Under Windows you may have to delete the shortcut to Unitex jar if you have created one on your desktop The same has to be done on Linux if you have created an alias 19 Unitex for developpers If you are a programmer you may be interested in linking your code with Unitex C sources To facilitate such operation you can compile Unitex as a dynamic library that contains all Unitex functions except mains of course Under Linux MacOS type make LIBRARY yes 1 9 UNITEX FOR DEVELOPPERS 29 and you will obtain a library named libunitex so If you want to produce a Windows DLL named unitex d11 use the following commands Windows make SYSTEM windows LIBRARY yes Cross compiling with mingw32 make SYSTEM mingw32 LIBRARY yes In all cases you will also obtain a program named Test 1ib exe If everything worked fine this program should display the following Expression conv
100. 2 2 TEXT FORMATS 33 encodings as for example UTF 8 in order for instance to create web pages The button Add Files enables you to select the files to be converted The button Remove Files makes it possible to remove a list of files erroneously selected The button Transcode will start the conversion of all the selected files If an error occurs with a file is processed for example a file which is already in Unicode the conversion continues with the next file E Transcode Files Source encoding CO Replace LATIN15 O Rename source with prefix LATIN2 1 2 Rename source with suffix LATIN3 z LATIN4 C2 Name destination with prefix LATINS amp Name destination with suffix LATIN LATING Prefix suffix LITTLE ENDIAN utf1 6 Selected files D iMy ca noue nca A D A Unitex EnglishiCorpus wiki monoide en txt Cancel Figure 2 3 Transcoding files To obtain a text in the right format you can also use a text processor like the free software from OpenOffice org 72 or Microsoft Word and save your document with the format Unicode text In OpenOffice Writer you have to choose the Coded Text txt format and then select the Unicode encoding in the configuration window as shown on figure 2 5 Options de filtre ASCII xj eat E rl E Annuler Saut de paragraphe cR amp LE CCR C LF o emer Aide Figure 2 4 Saving in Unicode with OpenOffice Write
101. 2XN2 C_2XN2 k NC 2XN NC 2XN2 NC 2XN2 NC 2XN2 C 2XN2 C 2XN2 NC 2XN2 C 2XN2 C_2XN2 N N Comp s2vm N Comp s3vm N Comp s4vm 2 N Comp s5vm 2 N Comp s6vm N Comp s7vm N Comp plvm N Comp p2vm N2 N Comp p3vm N Comp p4vm N Comp p5vm N2 N Comp p6vm N2 N Comp p7vm N Comp w2vm N Comp w4vm Comp sivm Comp s2vm N Comp s3vm NC_2XN2 NC 2XN2 N Comp s4vm N Comp s5vm N Comp s6vm Comp s7vm Comp plvm N Comp p2vm 2 N Comp p3vm N Comp p4vm Comp p5vm 231 232 CHAPTER 11 COMPOUND WORD INFLECTION avioprevoznicima avio prevoznik NC_2XN2 N Comp p6vm avioprevoznicima avio prevoznik NC_2XN2 N Comp p7vm avioprevoznika avio prevoznik NC_2XN2 N Comp w2vm avioprevoznika avio prevoznik NC_2XN2 N Comp w4vm predsednik drzxave predsednik drzxave NC_N2X1 N Comp slvm predsednika drzxave predsednik drzxave NC_N2X1 N Comp s2vm predsedniku drzxave predsednik drzxave NC_N2X1 N Comp s3vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp s4vm predsednicye drzxave predsednik drzxave NC_N2X1 N Comp s5vm predsednikom drzxave predsednik drzxave NC_N2X1 N Comp s6vm predsedniku drzxave predsednik drzxave NC_N2X1 N Comp s7vm predsednici drzxave predsednik drzxave N2X1 N Comp plvm C E predsednici drzxava predsednik drzxave NC_N2X1 N Comp plvm C C predsednika drzxave predsednik drzxave N2X1 N Comp p2vm predsednika drzxava predsedni
102. 3 grf that illustrates how the semitic inflection process works yalo2o3u aP3ms Figure 3 11 A toy semitic inflection grammar Such a grammar obey the following rules 1 All standard inflection operators can be used L R etc 2 A digit stands for a consonant of the skeleton 1 for the first 2 for the second etc In our example 1 2 and 3 will respectively stand for k t and b If you want to access to a letter after the ninth one you must protect its index with angles like lt 10 gt The DELAF output for this grammar is yakotobu ktb V aP3ms 3 6 Compression Unitex applies compressed dictionaries to the text The compression reduces the size of the dictionaries and speeds up the lookup This operation is done by the Compress program This program takes a dictionary in text form as input for example my_dico dic and produces two files e my_dico bin contains the minimal automaton of the inflected forms of the dictio naries e my_dico inf contains the codes extracted from the original dictionary 3 6 COMPRESSION 63 The minimal automaton in the my_dico bin file is a representation of inflected forms in which all common prefixes and suffixes are factorized For example the minimal automaton of the words me te se ma ta et sa can be represented by the graph shown in Figure 3 12 Figure 3 12 Representation of a minimal automaton To compress a dictionary open it and click on Compress into FST in th
103. 314 ElagComp 264 Equivalences txt 215 Evamb 264 Extract 264 F 52 Flatten 117 265 348 INDEX Fst2Check 265 Fst2Grf 179 Fst2List 266 Fst2Txt 38 40 267 G 52 Grf2Fst2 117 268 Hum 51 HumCo11 51 1 02 INTJ 51 ImplodeTfst 270 J 52 58 K 52 L 58 113 Locate 65 208 270 LocateTfst 273 Morphology txt 214 215 MultiFlex 274 N 51 Normalize 252 275 P 52 58 113 PREP 51 PRO 51 PolyLex 45 276 R 58 113 RebuildTfst 277 Reconstrucao 159 277 Reg2Grf 277 S 52 Seq2Grf 278 SortIxt 53 279 293 Stats 279 T 32 TEI2Txt 281 Table2Grf 280 Tagger 280 TagsetNormT fst 281 Tfst2Grf 281 Tfst2Unambig 182 282 Tokenize 40 282 TrainingTagger 283 Txt2Tfst 284 U 58 113 349 Uncompress 285 UnitexTool 286 UnitexToolLogger 286 Untokenize 285 Unxmlize 288 v 51 w 52 58 113 XMLizer 289 Y B2 48 71 48 48 49 401 en 51 f 52 1 94 m o2 n 52 ne 51 norm rul 174 Pp 04 s 52 se 51 t 51 tags ind 67 tokens txt 185 z1 51 z2 51 23 94 STOP 72 76 S 38 76 275 253 299 315 321 73 64 Acyclic automaton 153 Adding languages 27 Advanced search options 145 Algebraic languages 90 All matches 79 143 Alphabet 38 259 267 270 273 282 284 292 Korean 187 sort 55 350 sorted 293 Ambiguity rate 169 Ambiguity removal 163 Ambiguou
104. 3XN2 plmgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 plmgea hungry as a wolf kao vukovi gladan kao vuk AC A3XN2 plmgea hungry as a wol kao vuk gladan kao vuk AC A3XN2 plfgea hungry as a wolf kao vuci gladan kao vuk AC A3XN2 plfgea hungry as a wolf kao vukovi gladan kao vuk AC A3XN2 plfgea hungry as a wol kao vuk gladan kao vuk AC A3XN2 plngea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 plngea hungry as a wolf kao vukovi gladan kao vuk AC A3XN2 plngea hungry as a wol vuk gladan kao vuk AC A3XN2 p2mgea hungry as a wolf vuci gladan kao vuk AC A3XN2 p2mgea hungry as a wolf vukovi gladan kao vuk AC A3XN2 p2mgea hungry as a wo vuk gladan kao vuk AC A3XN2 p2fgea hungry as a wolf vuci gladan kao vuk AC A3XN2 p2fgea hungry as a wolf vukovi gladan kao vuk AC A3XN2 p2fgea hungry as a wo gladnih kao vuk gladan kao vuk AC A3XN2 p2ngea hungry as a wolf gladnih kao vuci gladan kao vuk AC A3XN2 p2ngea hungry as a wolf gladnih kao vukovi gladan kao vuk AC A3XN2 p2ngea hungry as a wo gladnima kao vuk gladan kao vuk AC A3XN2 p3mgea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p3mgea hungry as a wol gladnima kao vukovi gladan kao vuk AC A3XN2 p3mgea hungry as a gladnim kao vuk gladan kao vuk AC A3XN2 p3mgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p3mgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p3mgea hungry as a wo gladnima kao vuk gladan kao vuk AC A3XN2 p3fgea hungry as a wolf gl
105. 53 ZI Displaying TEAM sie we pre oe ede Dee oe eee eS 153 Z4 COMSUUCHON es mo dos o doe a a A Meer a he eee 155 7 21 Construction rules for text automata 155 722 Normalization of ambiguous forms lt lt cos xs 156 7 23 Normalization of clitical pronouns in Portuguese 157 724 Keeping the best pans o s ie mo RR eR bise ees 159 7 9 Resolving Lexical Ambiguities with ELAG cocos 163 7 3 1 Grammars For Resolving Ambiguities 163 od Compiling ELA Grammars o A Aug A he wo mes donee epa a 164 7 9 Resolving Ambiguines e 1 can due ede A dt 166 534 Grammar collections s 22s Shei aides be e Veo Ro aa 168 Poo Window Por ELAG Eeer gt soo ee a e UE ORE es 168 700 D sep onoffhefagsele roo cresta Re Re 169 FAT Grammar OPARMAAUOM x3 o x 9e 9 399 x a 175 7 4 Linearizing text automaton with the tagger s 244 se o Rem 176 7 4 1 Compatibility Sr thetagset oes iuo wwe eus eue made 177 742 Usenl thie Tagger uns tweed om oro Fue ec x OR P dose ees 178 743 Creatlonota few tagger sr incre 178 75 Manipulation of text anomalie lt lt lt scr aore kank Ok XE Oe GES 179 7 5 1 Displaying sentence automata seem errar EE be 179 75 2 Modifying the textautomaton se ce xx ed dede 180 F0 Display configuration gt a 2e de A RON Ue D 182 7 6 Converting the text automaton into linear text 182 77 Searching patterns in the text auto
106. ALPH specifies the alphabet file to be used for tok enizing the content of the grammar boxes into lexical units e c char by char tokenization will be done character by character If neither c nor a option is used lexical units will be sequences of any Unicode letters e d DIR pkgdir DIR specifies the repository directory to use see section 5 2 2 page 96 e e no empty graph warning no warning will be emitted when a graph matches the empty word This option is used by MultiFlex in order not to scare users with meaningless error messages when they design an inflection grammar that matches the empty word e tfst check checks wether the given graph can be considered as a valid sentence automaton or not e s silent grf name does not print the graph names needed for con sistent log files across several systems 13 22 GRFDIFF 269 e r XXX named repositories XXX declaration of named repositories XXX is made of one or more X Y sequences separated bu where Xis the name of the repository denoted by the pathname Y You can use this option several times e debug compile graphs in debug mode e v check variables check output validity to avoid malformed variable expressions The result is a file with the same name as the graph passed to the program as a parameter but with extension st 2 This file is saved in the same folder as gr f gt 13 22 GrfDiff GrfDiff lt grf1 gt l
107. CF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows zxiro racyun zxiro racyun NC_2XN1 N Comp slqm zxiro racyuna zxiro racyun NC_2XN1 N Comp s2qm zxiro racyunu zxiro racyun NC_2XN1 N Comp s3qm zxiro racyun zxiro racyun NC_2XN1 N Comp s4qm zxiro racyune zxiro racyun NC_2XN1 N Comp s5qm 11 3 INTEGRATION IN UNITEX zxiro racyunom zxiro racyun NC_2XN1 N zxiro racyunu zxiro racyuni zxiro racyuna zxiro racyune ZXiro racyun NC_ ZXiro racyun NC_ ZXiro racyun NC_ zxiro racyunima zxiro racyun NC 2 Zxiro racyun NC_ 2XN1 N C 2XN1 N C 2XN1 N C 2XN1 N C zxiro racyuni zxiro racyun NC_2XN1 N C Comp s6qm omp s7qm omp plqm omp p2qm 1 N Comp p3qm omp p4qm omp p5qm zxiro racyunima zxiro racyun NC_2XN1 N Comp p7qm zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyun zxiro racyun NC_2XN1 N Co zxiro racyuna zxiro racyun NC_2XN1 N C zxiro racyunu zxiro racyun NC_2XN1 N C zxiro racyun zxiro racyun NC_2XN1 N Co zxiro racyune zxiro racyun NC_2XN1 N C zxiro racyunom zxiro racyun NC_2XN1 N zxiro racyunu zxiro racyun NC_2XN1 N C zxiro racyuni zxiro racyun NC_2XN1 N C zxiro racyuna ZXiro racyun C 2XN1 N C zxiro racyunima zxiro racyun NC_2XN1 N zxiro racyune zxiro racyun NC_2XN1 N C zxiro racyuni ZXiro racyun C 2XN1 N C zxiro racyunima zxiro racyun NC_2XN1 N zxiro racyunima zxiro rac
108. CONTRIB UTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT INDIRECT INCIDENTAL SPECIAL EXEMPLARY OR CONSEQUEN TIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSI NESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIA BILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE 333 334 CHAPTER 14 FILE FORMATS Appendix C Lesser General Public License For Linguistic Resources This license was designed by the University of Marne la Vall e and it has received the approval of the Free Software Foundation 1 Preamble The licenses for most data are designed to take away your freedom to share and change it By contrast this License is intended to guarantee your freedom to share and change free data to make sure the data are free for all their users This license the Lesser General Public License for Linguistic Resources applies to some specially designated linguistic resources typically lexicons grammars thesauri and textual corpora TERMS AND CONDITIONS FOR COPYING DISTRIBUTION AND MODIFICATION
109. D 5 A Al arial fio et se p CID E 1 lt OPT gt Exemple avoir le fait que P Dnum Nmes Aux 1 1 NO V Daum Van Ce salon accepte vingt personnes Ce salon accueille vingt personnes Max accuse 80 kilos Max accuse ses trente ans On admet 50 personnes dans cette salle Ces cristawgaffectentgune forme g om trique Les valeurs ont affich un repli La planteSaimeSl eau Cette maison approche les deux millions Ce terrain arpente 30 arpents Ma atteint 80 kilos Max a une soeur une voiture des sous Ce sac avoisine les 20 kg La montre bat les secondes Son calme cache son une grande angoisse Ce bateau cale 80 cm y us gt A accepter accueillir accuser accuser admettre affecter afficher aimer approcher arpenter atteindre avoir avoisiner battre cacher 447 PageStyle c32NM Somme 0 Figure 9 1 Lexicon grammar Table 32NM 9 2 Conversion of a table into graphs 9 2 1 Principle of parameterized graphs The conversion of a table into graphs is carried out by a mechanism involving parameter ized graphs The principle is the following a graph that describes the possible constructions is constructed manually That graphs refers to the columns of the table in the form of param eters or variables Afterwards for each line of the table a copy of this graph is constructed
110. EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAM AGES END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Libraries If you develop a new library and you want it to be of the greatest possible use to the public we recommend making it free software that everyone can redistribute and change You can do so by permitting redistribution under these terms or alter natively under the terms of the ordinary General Public License To apply these terms attach the following notices to the library It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty and each file should have at least the copyright line and a pointer to where the full notice is found lt one line to give the library s name and a brief idea of what it does gt Copyright C lt year gt lt name of author gt This library is free software you can redistribute it and or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation either version 2 1 of the License or at your option any later version This library is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABILITY or FIT NESS FOR A PARTICULAR PURPOSE See the GNU Lesser General Public License for more details You should have received a copy of the GNU Lesser General Public License along
111. Enable all 22 jorginstitution fst2 v 23 lorgCtxtDico fst2 v 24 lorgCtxt fst2 v Close 25 lorgCommerceDroite fst2 v m Figure 12 2 Cassys configuration window with a list of transducers on the right hand side e a file explorer at the left of the frame permits to select the transducers to place in the cascade The file explorer only displays fst2 files all the graphs you want to place in 12 1 APPLYING A CASCADE OF TRANSDUCERS WITH CASSYS 241 the list of transducers must be compiled in fst2 format To edit the cascade select the graphs in the file explorer at the left and drag and drop them into the right frame of the window e The table at the right displays the list of transducers in the current cascade This table is obviously empty for a new cascade The list of transducers is ordered the graphs are applied in the order of the list The different columns in the table give information on each graph to apply and or permit to choose its behavior Rank of the transducer in the cascade the resulting files of each graph have the number Disabled checkbox to disable the current graph Disabled meaning not applied in the cascade The disabled graphs appears not numbered in light grey and striked out Name The name of the transducer graph with extension fst2 The other columns for different behavior explained later Graphs which source file
112. For example a noun inflects for number and case and has a fixed gender The presence of 11 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 215 such a file is necessary if we wish to express the fact that a certain word inflects for number gender or case without having to explicitly enumerate each time which inflectional values singular plural masculine etc it can take Similarly for French the Morphology txt file may be as follows French lt CATEGORIES gt Nb s p Gen m f lt CLASSES gt noun Nb lt var gt Gen lt var gt adj Nb lt var gt Gen lt var gt adv However in the existing systems for computational morphology such a description of classes categories and values is not always present For example according to the DELA conven tions 20 the morphological values of each simple word are plain sequences of characters e g ms for masculine singular without any explicit mention of their corresponding cate gories In order for the program to be compatible with such systems we use a list contained in a file called Equivalences txt that describes which foreign inflectional feature corre sponds to which category value pair in our description For example the following lists Polish French s Nb sing s Nb s p Nb pl p Nb p M Case Nom f Gen f D Case Gen m Gen m C Case Dat B Case Acc I Case Inst L Case Loc V Case Voc 0 Gen masc pers
113. G BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PUR POSE THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LINGUISTIC RESOURCE IS WITH YOU SHOULD THE LIN GUISTIC RESOURCE PROVE DEFECTIVE YOU ASSUME THE COST OF ALL NECESSARY SERVICING REPAIR OR CORRECTION IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER OR ANY OTHER PARTY WHO MAY MODIFY AND OR REDISTRIBUTE THE LINGUIS TIC RESOURCE AS PERMITTED ABOVE BE LIABLE TO YOU FOR DAM AGES INCLUDING ANY GENERAL SPECIAL INCIDENTAL OR CON SEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LINGUISTIC RESOURCE INCLUDING BUT NOT LIM ITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LINGUISTIC RESOURCE TO OPERATE WITH ANY OTHER SOFTWARE EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES END OF TERMS AND CONDITIONS Bibliography 1 Free Software Foundation http www fsf org 14 13 9 2 Anna ANASTASSIADIS SYMEONIDIS Tita KYRIACOPOULOU Elsa SKLAVOUNOU lasson THILIKOS and Rania VOSKAKI A system for analysing texts in modern greek representing and solving ambiguities In Proceedings of COMLEX 2000 Workshop on Computational Lexicography and Multimedia Dictionaries Patras 2000 3 8 3 Jean Claude ANSCOMBRE Pourquoi un moulin a vent n est pas un ventilateur La
114. Le pari du s e l on parl t de lui cas particulier Le gouverne jusqu la stupeur cas particulier Le tour du m agner le temps perdu cas particulier Le train Le dent sur la lumi re cas particulier Le pavillon le r cit fut achev cas particulier Le sacrifices comme dans un r ve cas particulier Le Carnatic aborder Passepartout cas particulier le premier bien celle du booby cas particulier le plus neuf Figure 5 20 weights in graphs variable name preceded by and followed by a parenthesis Then link these boxes to the zone of the grammar to store In the graph in figure 5 21 you see a sequence of digits before dollar or dollars This sequence will be stored in a variable named var1 E varl varl VALUE varl Figure 5 21 Using the variable var1 Variable names may contain latin letters without accents upper or lower case numbers or the _ underscore character Unitex distinguishes between uppercase and lowercase characters When a variable is defined you can use it in transducer outputs by surrounding its name with The grammar in figure 5 22 recognizes a date formed by a month and a year and produces the same date as an output but in the order year month If you want to use the character in the output of a box you have to double it as shown on figure 5 21 102 CHAPTER 5 LOCAL GRAMMARS January February March April May June
115. Lebewesen ne os X ue rene e 3 74 Morphological dictionary graphs 25 Bibliography oss sa ue sue mue se note Rire d Searching with regular expressions 4 Defiiloh 4 444444 444 ua us x d ys S2 TORAS Sod d A d 3 edd de nouns qM nter Som wh owns 43 Lexical masks 31 Special symbols o Rs 4 3 2 References to information in the dictionaries 4 3 3 Grammatical and semantic constraints 43 4 Inflectional constraints 4 3 5 Negation of a lexical mask 2 2 dd Conatena looo a RE a X dre Z5 UNO 29 99 9 EUG Swe e we OS eS 46 Kleenestar 47 Morphological filters ses eoo E sacia ewes S3 DOS ed Xj id dan a de ae dien dius 451 Search configuration s ors se mad noi ows 4 8 2 Presentation of the results 45 3 SAUSU S Lions iaa 34 XE E Local grammars 5 1 The local grammar formalism so raw Sli Algebrai Grammars 226 x Re AC 5 1 2 Extended algebraic grammars Ee E 52 BOI TP se a A Aa ee Y ok CONTENTS Mb as E o 45 CONTENTS 5 DI CAMARO era a pid orales ERED 90 or GUP GrAPIS oc se sa loue exe xxu qe A ADA 95 525 Manipulating DOxeS sc mons d ue er dede Bed OG EN E 98 524 noc 9 dee Ros die onc wu ke Ree E ein aapi 99 e Using Variables sunu ds Roe gehe mom ELGAR EYES 100 526 Copying 224299 4 34 AGERE AGREGUE C RR e dos 102 Su Sp BUB S x ocv Oe eh She Sur RF v Re eed 1
116. Looking up a word in a dictionary 53 Lowercase see Respect of lowercase uppercase 116 Matrices 195 MERGE 38 65 135 144 305 Meta characters 103 Meta symbols 38 72 Modification of the text 149 257 Morphological dictionaries 130 Morphological dictionary graphs 67 Morphological filters 65 77 Morphological mode 65 129 Moving word groups 139 Multi word units 211 Multiple selection 98 copy paste 99 MWU 211 Negation 74 Negation of a lexical mask 74 Non terminal symbols 89 Normalization clitics in Portuguese 277 of ambiguous forms 284 INDEX of ambiguous forms 115 156 of clitics in Portuguese 157 of non ambiguous forms 39 of separators 37 275 of the text automaton 115 156 284 Normalization rule file 321 Norwegian compound words 45 Occurrences extraction 150 number of 80 144 Operator DK lt I gt 58 lt R gt 58 lt X n gt 58 C 98 113 D 58 115 J 58 L 98 113 P 58 113 Roe 113 U 98 113 W 58 113 concatenation 76 disjunction 77 Kleene star 77 Optimizing ELAG Grammars 175 Options configuration 109 Output associated to a subgraph call 119 Output variables 141 Overlapping occurrences 136 Parameterized graphs 196 Parenthesis 76 Paste 99 102 104 Pattern search 270 273 Pixellisation 107 PNG graph export 111 Portuguese normalization of clitics 157 277 POSIX 77 INDEX Preferences 110 Printing a graph 112
117. Messages with a colored background are generated by the interface not bythe external programs _ Compiling graph DetN Compiling graph DetSimple Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Cannot open the graph DetSimple grf D My Unitex English Graphs DetSimple grf Figure 6 4 Compilation window If the graph references subgraphs those are automatically compiled The result is a Car file that contains all the graphs that make up a grammar The grammar is then ready to be used by Unitex programs 6 2 2 Approximation with a finite state transducer The FST2 format conserves the architecture in subgraphs of the grammars which is what makes them different from strict finite state transducers The Flatten program allows 118 CHAPTER 6 ADVANCED USE OF GRAPHS you to turn a FST2 grammar into a finite state transducer whenever this is possible and to construct an approximation if not This function thus permits to obtain objects that are easier to manipulate and to which all classical algorithms on automata can be applied In order to compile and thus transform a grammar select the command Compile amp Flatten FST2 in the Tools submenu of the FSGraph menu The window of Figure 6 5 allows you to configure the approximation process Compile amp Flatten x 2 Expected resul
118. NC_AXN3 N Comp NProp Org fp6q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp7q Ujedinxenim nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp7q Kosovo i Metohija Kosovo i Metohija NC_N3XN N Comp NProp ToptReg nslq Kosova i Metohije Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns2q Kosovu i Metohiji Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns3q Kosovo i Metohiju Kosovo i Metohija NC_N3XN N Comp NProp ToptReg ns4q Kosovo i Metohijo Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns5q Kosovom i Metohijom Kosovo i Metohija NC_N3XN N Comp NProp ToptReg ns6q Kosovu i Metohiji Kosovo i Metohija NC_N3XN N Comp NProp Top Reg ns7q istrazxne sudije istrazxni sudija NC_AXNF N Comp lvfp istrazxnih sudija istrazxni sudija NC_AXNF N Comp 2vfp istrazxnima sudijama istrazxni sudija NC_AXNF N Comp 3vfp istrazxnim sudijama istrazxni sudija NC_AXNF N Comp 3vfp istrazxne sudije istrazxni sudija NC_AXNF N Comp 4vfp istrazxne sudije istrazxni sudija NC_AXNF N Comp 5vfp istrazxnima sudijama istrazxni sudija NC_AXNF N Comp 6vfp istrazxnim sudijama istrazxni sudija NC_AXNF N Comp 6vfp istrazxnima sudijama istrazxni sudija NC_AXNF N Comp 7vfp 11 3 INTEGRATION IN UNITEX is is is is is is is is is is is is US is istr e is dS Ly dis i LS Ly LS trazxni azxni su trazxni razxni s trazxni trazxni razx
119. NENE de Zei Cen QO Of lt gt sl E Figure 5 25 Toolbar The first two icons are shortcuts for saving and compiling the graph The following five correspond to the Copy Cut Paste Redo and Undo operations The following six icons correspond to edit commands for boxes The first one a white arrow corresponds to the boxes normal edit mode The next 5 icons correspond to specific tools In order to use a tool click on the corresponding icon The mouse cursor changes its form and mouse clicks are then interpreted in a particular fashion What follows is a description of these tools from left to right e creating boxes creates a box at the empty place where the mouse was clicked e deleting boxes deletes the box that you click on e connect boxes to another box using this utility you select one or more boxes and connect it or them to another one In contrast to the normal mode the connections are inserted to the box where the mouse button was released on e connect boxes to another box in the opposite direction this utility performs the same operation as the one described above but connects the boxes to the one clicked on in opposite direction e open a sub graph opens a sub graph when you click on a grey line within a box The next icon showing a key is a shortcut to open the window with the graph display op tions The following two icons allow you to view lists of graphs that are related to the current graph by a
120. Now go to the FSGraph menu and then to the Tools menu and click on Explore Graph paths The Window of figure 6 36 appears The upper box contains the name of the main graph of the grammar to be explored The following options are connected to the outputs of the grammar and to subgraph calls e Ignore outputs outputs are ignored e Separate inputs and outputs outputs are displayed after inputs a b c A B C 6 5 EXPLORING GRAMMAR PATHS 133 Graph X1ABOULOnRechercheinanuelunitexiresourcesigrfiglace drf 8 Ignore outputs O Separate inputs and outputs O Merge inputs and outputs Maximum number of sequences 100 8 Only paths Do not explore subgraphs recursively Figure 6 36 Exploring the paths of a grammar e Merge inputs and outputs each output is emitted immediately after the input to which it corresponds a A b B c C e Only paths calls to subgraphs are explored recursively e Do not explore subgraphs recursively calls to subgraphs are printed but not ex plored recursively If the option Maximum number of sequences is activated the specified number will be the maximum number of generated paths If the option is not selected all paths will be generated if they are in finite number Here you see what is created for the graph shown on Figure 6 37 with default settings ig noring outputs limit 100 paths z B boule B boule B boule B bou
121. PTER 1 INSTALLATION OF UNITEX System Preferences Edit View Window Help Java Preferences General Security Network Advanced Java Applet Plugin J2SE 5 0 32 bit Web browsers use this order to determine which version of Java SE 6 64 bit the Java Virtual Machine to use for applets and will load the J2st 5 0 64 bit first compatible architecture in this list J2SE 1 4 2 32 bit Options Restore Defaults Drag to change the preferred version Java Applications Java SE 6 64 bit Java applications Web Start applications and command line J2SE 5 0 32 bit tools use this order to determine the most appropriate J2SE 5 0 64 bit version of the Java Virtual Machine to use bI J2SE 1 4 2 32 bit Restore Defaults Drag to change the preferred version Figure 1 1 Checking and modify Java Preferences 15 2 SoyLatte SoyLatte is a functional X11 based port of the FreeBSD Java 1 6 patchset to Mac OS X Intel machines SoyLatte is initially focused on supporting Java 6 development however the long term view far more captivating open development of Java 7 for Mac OS X with a release available in concert with the official Sun release supported on all recent versions of Mac OS X Before you start Try it at your own risks This tutorial tells you what I have done to have Unitex running on my MacBook Pro Mac OS X 10 5 7 but it offers no guarantee that it will work on your computer and it does not even guarantee that you ca
122. Set Default allows you to define the current selection of dictionaries as the default This default selection will then be used during preprocessing if you activate the option Apply All default Dictionaries If you right click on a dictio nary name the associated documentation if any will be displayed in the lower frame of the window 2 6 OPENING A TAGGED TEXT 45 Lexical Resources Select the dictionaries to be applied You can sort them one by one using the arrows Note that system dictionaries are given to the Dico program before the user ones User resources System resources Dnum fst2 Prolex Toponymes bi profession bin NPr fst2 Suffixes fst2 prenom s bin motsGramf bin PfxV Lidia bin dico lidia bin Right click a dictionary to get information about it Graphe dictionnaire reconnaissant les chiffres romains Ce dictionnaire reconnait les chiffres romains en majuscules depuis 1 jusqu 4999 Son avantage par rapport au dictionnaire RomNum bin est qu il ne prend pas comme chiffres romains L C D M et MM dans les contextes suivants Set Default Figure 2 14 Parameterizing the application of dictionaries 2 5 6 Analysis of compound words in Dutch German Norwegian and Russian In certain languages like Norwegian German and others it is possible to form new com pound words by concatenating together other words For example the word aftenblad mean ing evening journal is obtained by combin
123. Some sequences close to those of the sequence corpus might appear in the text and be ignored because they are not in the sequence corpus These sequences should be included in the sequence automaton In order to find these sequences you can produce a graph that recognize all the sequences from the sequence corpus plus derived sequences that are the result of the application of three kind of wildcards Each wildcard makes it possible to apply an operation to generate new sequences e insertion for each sequence add to the automaton all the sequences where lt TOKEN gt was inserted between two words of the original sequence e replacement for each sequence add to the automaton all the sequences where i tokens have been replace by lt TOKEN gt e deletion for each sequence add to the automaton all the sequences where a token has been deleted Each of these operations can be applied several times to the original sequences Applying this grammar to a text will introduce approximations in the search of the sequences in the text When wildcards are used the produced graphs follow these rules e both original and derived sequences are included in the automaton e no empty sequence nor sequence made only with wildcards will be added to this graph such sequences could be produced by deletions or replacements on short se quences 8 3 SEARCH BY APPROXIMATION 193 e no insertion of a wildcard at the head or tail of a sequence e every toke
124. Step 1 the cursor is moved one position to the left LLDRRn 60 CHAPTER 3 DICTIONARIES e Step 2 the cursor is moved one position to the left again LLDRRn l Le Palo els ie e Step 3 one character is deleted everything to the right of the cursor is shifted one position to the left LLDRRn y c h o s le e Step 4 the cursor is moved to the right LLDRRn l c h ol s e e Step 5 and to the right again LLDRRn leln o s e e Step 6 the character n is pushed on the stack LLDRRn PRISE When all operations have been fulfilled the inflected form consists of all letters before the cursor here chosen The inflection program explores all paths of the inflectional grammar and tries all possible forms In order to avoid having to replace the names of inflectional grammars by the real grammatical codes in the dictionary used the program replaces these names by the longest prefixes made of letters if you have selected the Remove class numbers button Thus N4 3 5 AUTOMATIC INFLECTION 61 F Day UnitexiEnglishiDela delasfix dic aviatrices aviatrix N Hum p aviatrix aviatrix N Hum s matrices matrix N Math p matrix matrix N Math s radices radix N p radix radix N s Figure 3 9 Result of automatic inflection is replaced by N
125. TOP tag e TElLite files in which sequences are delimited by the following xml tag lt seg type sequence gt example lt seg gt 189 190 CHAPTER 8 SEQUENCE AUTOMATON Since the corpus contains specific sequences it must be done by hand This means that you have to either write all the sequences in a raw text file and separate them by an end of line figure 8 1 or insert the specific xml tag in an existing TEILite Document figure 8 3 The preprocessing of TXT or XML Documents will produce a SNT file that is used for the build of the Sequence Automaton figure 8 2 This File can be used as an input The produced graph will only recognize the sequences that are correctly delimited Production of local grammars is automatic only from a corpus of well defined sequences If you have such a corpus then the time saved is considerable Tomorrow Tomorrow STOP this week this week STOP twice a month twice a month STOP as soon as possible as soon as possible STOP in the next few days in the next few days Figure 8 1 TXT Figure 8 2 SNT lt xml version 1 0 encoding UTF 16LE gt lt DOCTYPE xml SYSTEM teilite dtd gt TEI 2 lang fr gt lt teiHeader gt lt fteiHeader gt lt text gt lt body gt lt p id 1 gt am going to see three of them seg type sequence gt tomorrow lt seg gt lt p gt p id 2 gt Here are suggestions of things to do seg type sequence gt this week lt seg gt
126. UNITEX 3 0 USER MANUAL Universit Paris Est Marne la Vall e http www igm univ m lw tt unitex unitex univ m LV fr S bastien Paumier English translation of version 1 2 by the local grammar group at the CIS Ludwig Maximilians Universit t Munich Oct 2003 Wolfgang Flury Franz Guenthner Friederike Ma Nagel Johannes St Ichok Clemens Marschner Sebastian iehler http www cis uni muenchen de Contents Introduction Content cud amp a OS RS Roe AR FEES Berea ri Be o S Unitex contributors lf you use Unitex in research Projects cs odo CA 9 RES RET EY GO s Installation of Unitex CH eeneg 5 2 o GS eine a SR a PUS AIS we Races 3 12 Java runtime environment 13 Installation on Windows LA Installation OB Linu ui uuu oso Dieu da bom Bde TS Pos 1 5 Installation on MacOSX ece uw Roe Rank rS A ee es 151 UsnsgtheApplelavalorunUme gt co lY 884423 0 RR GULMBE e eb Bde od A OS eS s 15 3 How to compile Unitex C programs ona Macintosh 15 4 How to makes all files visible on MacOS li Pei coccion risa han ra E a OA eX eae ues then De SOIN MMS Es es ero Net doe ere OR bete 18 Uninstalling LIRE lt lt ou ae e doe er AAA Da aq DTI Loading a text 21 Selec ngalajguage c ss s ode ok BE PEE RE qo RR Uh CUP Ce dee 22 o e x hc ce 4 modo erm RC tes tg He
127. V gt e a particular position in the text the beginning or the end e the concatenation of two regular expressions he smokes e the union of two regular expressions Pierre Paul e the Kleene star of a regular expression bye x 4 2 Tokens In a regular expression a token is defined as in 2 5 4 page 40 Note that the symbols dot plus star less than opening and closing parentheses and double quotes have a special meaning It is therefore necessary to precede them with an escape character if you want to search for them Here are some examples of valid tokens cat Mus lt N ms gt S 71 72 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS By default Unitex is set up to let lower case patterns also find upper case matches It is pos sibe to enforce case sensitive matching using quotation marks Thus peter recognizes only the form peter and not Peter or PETER NOTE in order to make a space obligatory it needs to be enclosed in quotation marks 4 3 Lexical masks A lexical mask is a search query that matches tokens or sequences of tokens 4 3 1 Special symbols There are two kinds of lexical masks The first category contains all symbols that have been introduced in section 2 5 2 except for the symbol lt PNC gt which matches punctuation signs and lt gt which matches a line feed Since all line feeds have been replaced by spaces this symbol cannot longer be useful when searching for lexical mas
128. W Pis P2s Pip P2p Lu ERR 413 unknown simple words Filter unknown words with tags DLC 274 compound lexical entries absolute necessity N XN z1 2 act of violence N NPN z1 5 Andalusia agnus castus N XN NX Conct andTermagaunt all around A DA z1 all comers N XN z1 p all in A z1 Anglo Saxon N XN Hum z1 is Anglo Saxons Anglo Saxon N as usual A asA z1 as was At tasV z1 ass s ears ass s ear N NsN 4 I Figure 2 13 Result after applying dictionaries to an English text As soon as the dictionary look up is finished Unitex displays the sorted lists of simple compound and unknown words found in a new window Figure 2 13 shows the result for an English text It is also possible to apply dictionaries without preprocessing the text In order to do this click on Apply Lexical Resources in the Text menu Unitex then opens a window see figure 2 14 in which you can select the list of dictionaries to apply The list User resources lists all dictionaries present in the directory current language Dela of the user The dictionaries installed in the system are listed in the scroll list named System resources Use the lt Ctrl click gt combination to select sev eral dictionaries System dictionaries will be applied prior to user dictionaries Within the system or user list you can fix the order of dictionaries using the up and down arrows as shown on figure 2 14 The button
129. Xuom que SMT JUSTIUe ur papua3xa 21349 uoq 13 y noya JOTI uosmri2 Jo sem 3T Ss UT irm 380192 241 39 pa1n0938 aTIUEN sty Jaag samosaq mq nom se yons uamspuo fuotatsoddo sat Aq saeb aouautua ayy Pau107 2pnatrfuoT Jo quem S31 YyITM pags Aq peor sty 1013 at burdaams jo aanseat S TTTm Jo pue aheanod Jo uoraasxa pau fapeTh ESTUZ jo asprm 243 ur soeds Wado s qeisprsuo23 Fis Zen Atay spem 43493 uorum 03 3an3 3 THE pauueq aya Jo pasodmos saaaaTs yata Je_oel 26010 E ABUTJ YONM sTetTrzsqem jo pasodmos qmq xuoj WetoazsqEsty E 4 qmq futanoaes U019981998 Ystaqanboo 30 ITE Uteqiss E 338 amos usaq peu 22393 uorum uodn any sTdand qypbtaq E Buraq aTqeuthemt wioj qsatdmts 241 30 JO 2eu2 sem ssaap sty S AMUEI ubru Jo maya futsodstp jo spom ya pue STET131 jo pauteqs usaq peu 124080 styis ang lunu pioauorqus 8ouueansnd rousifiuT ayun ANG SDUPPIOJUOS LJ Figure 4 8 Example concordance 4 8 SEARCH 85 4 8 3 Statistics If you select the Statistics tab in the Located sequences frame you will see the panel shown on figure 4 9 This panel allows you to get some statistics from the previously in dexed sequences Located sequences Concordance Mode 8 collocates by z score collocates by frequency contexts by frequency Sizes of contexts in non space tokens Left 1 Right Case sensitivity 8 case sensitive O case insensitive Compute statistics Figure
130. a hungry as a wolf gladnu kao vuk gladan kao vuk AC A3XN2 s3mgka hungry as a wolf gladnoj kao vuk gladan kao vuk AC A3XN2 s3fgea hungry as a wolf gladnome kao vuk gladan kao vuk AC A3XN2 s3ngda hungry as a wolf gladnom kao vuk gladan kao vuk AC A3XN2 s3ngda hungry as a wolf gladnu kao vuk gladan kao vuk AC A3XN2 s3ngka hungry as a wolf gladnu kao vuk gladan kao vuk AC A3XN2 s4fgea hungry as a wolf gladno kao vuk gladan kao vuk AC A3XN2 s4ngea hungry as a wolf gladni kao vuk gladan kao vuk AC A3XN2 s5mgea hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 s5fgea hungry as a wolf gladno kao vuk gladan kao vuk AC A3XN2 s5ngea hungry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 s6mgea hungry as a wolf 234 gladnom gladnim gladnome gladnom kao vu gladnu kao vuk gladnoj gladnome gladnom Kao vu Kao vu Kao vu Kao vu k gladan kao vu k gladan kao vu k gladan kao vu gladan kao vuk k gladan kao vu k gladan kao vu CHAPTER 11 COMPOUND WORD INFLECTION k AC A3X k AC A3X 2 s6fgea 2 s6ngea k AC A3XN2 s7mgda k AC A3XN2 s7fgea k AC A3XN2 s7ngda hungry as a wol hungry as a wol hungry as a wol AC A3XN2 s7mgka hungry as a wolf hungry as a wol hungry as a wol LE Lf kao vuk gladan kao vuk AC_A3XN2 s7mgda hungry as a wolf Lf Lf kao vuk gladan kao vuk AC_A3XN2 s7ngda hungry as a wolf Lf AC_A3XN2 s7ngka hungry as a wolf kao vuk gladan kao vuk AC_A
131. a since we can derive this word according to the axiom S by applying the following derivations Derivation 1 rewriting the axiom to aS S gt as Derivation 2 rewriting S at the right side of aS S aS gt aas 89 90 CHAPTER 5 LOCAL GRAMMARS Derivation 3 rewriting S to S aS aaS aa We call the set of words generated by a grammar the language generated by the grammar The languages generated by algebraic grammars are called algebraic languages or context free languages 5 1 2 Extended algebraic grammars Extended algebraic grammars are algebraic grammars where the members on the right side of the rule are not just sequences of symbols but regular expressions Thus the grammar that generates a sequence of an arbitrary number of a s can be written as a grammar consisting of one rule S a These grammars also called recursive transition networks RTN or syntax diagrams are suited for a user friendly graphical representation Indeed the right member of a rule can be rep resented as a graph whose name is the left member of the rule However Unitex grammars are not exactly extended algebraic grammars since they con tain the notion of transduction This notion which is derived from the field of finite state automata enables a grammar to produce some output With an eye towards clarity we will use the terms grammar or graph When a grammar produces outputs we will use the term transducer as an ext
132. a pious anchoret would make them sha d not dizzied thine understanding thou mightst know Clericus clericum non decimat 5 that is thine understanding thou mightst know Clericus clericum non decimat 5 that is to say we ch derstanding thou mightst know Clericus clericum non decimat 5 that is to say we churchmen d thou mightst know Clericus clericum non decimat 5 that is to say we churchmen do not exhaust ointed servants It is true replied Wamba that I being but an ass am nevertheless hon o How call d you your Franklin Prior Aymer Cedric answered the Prior 3 Cedric the Sa all d you your Franklin Prior Aymer Cedric answered the Prior 5 Cedric the Saxon T mer Cedric answered the Prior 5 Cedric the Saxon Tell me good fellow are we near road will be uneasy to find answered Gurth who broke silence for the first time and the f C RT La e Gamme A 4 gt Figure 4 1 Result of the search for lt DIC gt Here are some examples of lexical masks with the different types of constraints e A Hum s a non human adjective in the feminine singular e lt lire V P F gt the verb lire in the present or future tense e lt suis suivre V gt the word suis as inflected form of the verb suivre as opposed to the form of the verb tre e lt facteur N Hum gt all nominal entries that have facteur as canonical form and that do not have the
133. abet txt and is found in the root of the directory of a language Its presence is obligatory for Unitex to function Example the English alphabet file has to be in the directory English Each line of the alphabet file must have one of the following three forms followed by a newline symbol e HZ a hash symbol followed by two characters X and Y which indicate that all characters between X and Y are letters All these characters are con sidered to be in non capitalized and capitalized form at the same time This method is used to define the alphabets of Asian languages like Korean Chi nese or Japanese where there is no distinction between upper and lower case and where the number of characters makes a complete enumeration tedious e Aa two characters X and Y indicate that X and Y are letters and that X is a capitalized equivalent of the non capitalized Y form e 1 a unique character X defines X as a letter in capitalized and non capitalized form This form is used to define a single Asian character For certain languages like French it is possible that a lower case letter corresponds to multiple upper case letters For example in practice can have the upper case form E or E To express this it suffices to use multiple lines The reverse is equally true a capitalized letter can correspond to multiple lower case letters Thus E can be the capitalization of e e or Here is an excerpt of the French alphabet file wh
134. ace au portage de Java 6 r alis par SoyLatte Landonf bikemonkey org static soylatte Figure 1 3 Running Unitex on Mac 1 5 3 How to compile Unitex C programs on a Macintosh In order to install Unitex on Mac OS you will have to compile Unitex C sources This is not a problem if you have already the common gcc development tools installed on your computer but of course they are not in the standard installation If you don t know if these tools are present don t think too long about it just give a try open a shell window move to the Unitex sources directory cd path to Src C build and run the command that starts the compilation make install If the compilation does not start and you get an error message stating that the command make cannot be found you probably need to install development programs On Macintosh the all inclusive development bundel is Xcode current version is 2 2 and it includes a lot of stuff among which Your can donwload it from the Apple developer Web site The Xcode application includes a full featured code editor a debugger compilers and a linker The Xcode application provides a user interface to many industry standard and open source tools including gcc the Java compilers javac and jikes and gdb It pro 1 5 INSTALLATION ON MACOS X 25 00608 Terminal bash 62x8 81 Applicati uni C ced Figure 1 4 Compiling Unitex C programs vides all of
135. adabra INTJ Line 5 duplicate semantic code Paul N Hum HumY Line 6 an inflectional code is a subset of another eat V W P1ls Ps Plp P2p P3pf Stats 1 1 4 File D My Unitex English Dela axe dic Type DELAF 6 lines read 2 simple entries for 2 distinct lemmas O compound entry for 0 distinct lemma 4 ess All chars used in forms 4 0061 q 0063 4 0064 q 0065 q 0066 4 0067 q 0069 4 006C 4 006D 4 006E 4 006F 4 0070 4 0072 q HD Oo D 2 tat HD OO py 14 9 ELAG FILES 313 s 0073 t 0074 u 0075 4 x 0078 4 2 grammatical semantic codes used in dictionary 4 INTJS INTJ warning 1 suspect char 1 space SPACE IN T J 4 pues 0 inflectional code used in dictionary 4 Note that the inflectional codes of eat are not reported since an error occurred in this line 14 9 ELAG files 14 9 1 tagset def file See section 7 3 6 page 169 14 9 2 Ist files LST FILES ARE NOT UNICODE FILES A 1st file contains a list of grf file names If a file s path is not absolute it is relative to the location of the elag 1st file Here is the elag 1st file used for French PPVs PpvIL grff PPVs PpvLE grff PPVs PpvLUI grff PPVs PpvPR grff PPVs PpvSeq grff PPVs SE grff PPVs postpos grff 14 9 3 ele files elg files contain compiled ELAG rules These fi
136. adict the conditions of this License they do not excuse you from the conditions of this License If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations then as a consequence you may not distribute the Library at all For example if a patent license would not permit royalty free redistribution of the Library by all those who receive copies directly or indirectly through you then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Library If any portion of this section is held invalid or unenforceable under any particular circumstance the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims this section has the sole purpose of protecting the integrity of the free software distribution sys 330 CHAPTER 14 FILE FORMATS tem which is implemented by public license practices Many people have made generous contributions to the wide range of software distributed through that sys tem in reliance on consistent application of that system it is up to the author donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice This section is intended
137. adnima kao vuci gladan kao vuk AC A3XN2 p3fgea hungry as a wol gladnima kao vukovi gladan kao vuk AC A3XN2 p3fgea hungry as a gladnim kao vuk gladan kao vuk AC A3XN2 p3fgea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p3fgea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p3fgea hungry as a wo gladnima kao vuk gladan kao vuk AC A3XN2 p3ngea hungry as a wolf gladnima kao vuci gladan kao vuk AC A3XN2 p3ngea hungry as a wol gladnima kao vukovi gladan kao vuk AC A3XN2 p3ngea hungry as a gladnim kao vuk gladan kao vuk AC A3XN2 p3ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p3ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p3ngea hungry as a wol gladne kao vuk gladan kao vuk AC A3XN2 p4mgea hungry as a wolf gladne kao vuci gladan kao vuk AC A3XN2 p4mgea hungry as a wolf gladne kao vukovi gladan kao vuk AC A3XN2 p4mgea hungry as a wolf gladne kao vuk gladan kao vuk AC A3XN2 p4fgea hungry as a wolf gladne kao vuci gladan kao vuk AC A3XN2 p4fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC A3XN2 p4fgea hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 p4ngea hungry as a wolf gladna kao vuci gladan kao vuk AC A3XN2 p4ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC A3XN2 p4ngea hungry as a wolf gladni kao vuk gladan kao vuk AC A3XN2 p5mgea hungry as a wolf gladnu adan kao vuk gladni gladni gladni gladne gladne gladne gladna gladna gladna gladni
138. age if you are working in French Unitex proposes to convert your text assuming that it is coded using a French code page By default Unitex proposes to either replace the original text or to rename the original file by inserting old at the beginning of its extension For example if one has an ASCII file named biniou txt the conversion process will create a copy of this ASCII file named biniou old txt and will replace the contents of biniou txt with its equivalent in Unicode Transcoding home paumier Bureau biniou txt is not a Unicode Little Endian one Do you want to transcode it from FRENCH to Unicode Little Endian a Replace Rename source with suffix old Transcode Ignore file More options Figure 2 2 Automatic conversion of a non Unicode text If the encoding suggested by default is not correct or if you want to rename the file differ ently than with the suffix o1d you must use the More options button This allows you to choose source and target encodings of the documents to be converted see figure 2 3 By default the selected source encoding is that which corresponds to the current language and the destination encoding is Unicode Little Endian You can modify these choices by select ing any source and target encodings Thus if you wish you can convert your data into other Unitex also proposes to automatically convert graphs and dictionaries that are not in Unicode Little Endian
139. al stage TrainingTagger compresses these two data files into the bin format 14 11 CONFIGURATION FILES 317 14 11 Configuration files 14 111 The Config file Whenever the user modifies his preferences for a given languages these modifica tions are saved in a text file named Config which can be found in the directory of the current language The file has the following syntax the order of lines can vary Unitex configuration file of paumier for English Y Fri Oct 10 15 18 06 CEST 20084 TEXT FONT NAME Courier New TEXT FONT STYLE 04 TEXT FONT SIZE 104 CONCORDANCE FONT NAME Courier new CONCORDANCE FONT HTML SIZE 124 INPUT FONT NAME Times New Roman INPUT FONT STYLE 04 INPUT FONT SIZE 104 OUTPUT FONT NAME Arial Unicode MS OUTPUT FONT STYLE 14 OUTPUT FONT SIZE 124 DATE truef FILEN NAME trueY PATH NAME falseY FRAME truef RIGHT TON LEFT falsef BACKGROUND COLOR 14 FOREGROUND COLOR 167772164 AUXILIARY NODES COLOR 32896514 COMMENT NODES COLOR 655364 SELECTED NODES COLOR 167769614 PACKAGE NODES COLOR 23029764 CONTEXT NODES COLOR 167119364 CHAR BYN CHAR false ANTIALIASING falsef HTML VIEWER Y MAX TEXT FILE SIZE 20971524 ICON BAR POSITION WestY PACKAGE PATH D repository MORPHOLOGICAL DICTIONARY D MyUnitex English Dela zz bing MORPHOLOGICAL NODES COLOR 3911728 4 MORPHOLOGICAL USE OF SPACE falseY
140. all these features are not specified the program will consider nonexisting labels such as je PRO 3p gt lt je PRO PronQ gt etc in vain 7 4 Linearizing text automaton with the tagger By default the text automaton contains many paths of tags because of lexical ambiguity The linearization process consists in selecting a single path a sequence of tags with one tag per token and remove the others The output of the process is a text automaton with a 7 4 LINEARIZING TEXT AUTOMATON WITH THE TAGGER 177 single path see section 7 6 for converting a linear automaton into linear text The selection of a path depends on its score The path with the best score is chosen and the others are removed The score of a path is calculated using a statistical model trained on an annotated corpus This model uses tagger data files generated by the TrainingTagger program see section 13 43 For instance you can see on Figure 7 23 the original text automaton of the French sentence Les insectes nuisibles envahissent la maison The corresponding text automaton after linearization is shown on Figure 7 24 envahissent envahir nuisibles nuisible nuisibles nuisible DET fp mp N mp V T3p S3p P3p DET fs Figure 7 24 Text automaton linearized 7 4 1 Compatibility of the tagset The tagset of the tagger is identical to that of the training corpus or is a variant see below However in order to use the tagger
141. alphabet ALPH alphabet file e d DIR directory DIR the directory containing Morphology and Equivalences files and inflection graphs for single and compound words 13 28 NORMALIZE 275 e K korean tells Mult iFlex that it works on Korean e s only simple words the program will consider compound words as errors e c only compound words the program will consider simple words as errors e p DIR pkgdir DIR specifies the graph repository e rXXX named repositories XXX declaration of named repositories XXX is made of one or more X Y sequences separated by where X is the name of the repository denoted by the pathname Y You can use this option several times Note that st2 inflection transducers will automatically be built from correspond ing grf files if absent or older than grf files 13 28 Normalize Normalize OPTIONS text This program carries out a normalization of text separators The separators are space tab and newline Every sequence of separators that contains at least one newline is replaced by a unique newline All other sequences of separators are re placed by a single space This program also checks the syntax of lexical tags found in the text All sequences in curly brackets should be either the sentence delimiter 5 the stop marker STOP or valid entries in the DELAF format aujourd hui ADV Parameter text represents the comp
142. an see two outputs have been produced for the input sequence the noble Concordance D My Unitex EnglishiCorpu n m er Scott S IN THAT PLEASANT DET merry England which is DET N watered by is watered by the river DET N Don there ancient times a larqe DET A forest rest covering the qreater DET part of reater part of the beautiful DET A hills ls and valleys which lie DET N between Sheffield and the pleasant DET town of Doncaster 5 The remains DET N of this The remains of this extensive DET wood to be seen at the noble DET seats of to be seen at the noble DET N seats of Figure 6 55 Ambiguous outputs for the noble At the opposite with the Forbid ambiguous outputs option we will obtain the text order concordance shown of Figure 6 56 with only one arbitrarily chosen output for the input sequence the noble Concordance D My Unitex EnglishiCorpus iva n m er Scott S IN THAT PLEASANT DET A DISTRICT merry England which is DET N watered by the is watered by the river DET N Don there ex ancient times a larqe DET A forest coverin rest covering the qreater DET A part of the reater part of the beautiful DET AT hills and ls and valleys which lie DET N between Sheff Sheffield and the pleasant DET A town of Do Doncaster The remains DET N of this ext The remains of this extensive DET wood are to be seen at the noble DET N seats of Went aunted of yore the fabulous DET Dragon of
143. an take The codes declared for the same attribute must be exclusive In other words an entry cannot take more than one value for the same attribute On the other hand all the tags in a given part of speech don t necessarily take val ues for all the attribute of the part of speech For example to define the attribute niveau_de_langue which can take the values z1 z2 and z3 the following line can be written niveau_de_langue zl 22 23 but this attribute is not necessarily present in all words discr this part consists of a declaration of a unique attribute The syntax is the same as in the cat part and the attribute described here must not be repeated there This part allows for dividing the grammatical category in discriminating sub categories in which the entries have similar inflectional attributes For pronouns for example a person feature is assigned to entries that are part of the personal pronoun sub category but not to relative pronouns These dependencies are described in the complete part 172 CHAPTER 7 TEXT AUTOMATON e complete this part describes the inflectional part of the tags of the words in the current part of speech Each line describes a valid combination of inflectional codes by their discriminating sub category if such a category was declared If an attribute name is specified in angle brackets lt and gt this signifies that any value of this at tribute may occur It is possible as well to declare that a
144. and semantic properties exclude their processing merely in terms of the properties of their constituents For example in both examples below e chief justice e lord justice there are few automatically accessible hints indicating that the former one is morphologi cally a standard English Noun Noun phrase taking an s at its last constituent in plural while the plural of the latter has three variants e chief justices e lord justices lords justice lords justices Thus at least one of the above examples has to be considered as lexicalized in order for the automatic morphological processing to be reliable MULTIFLEX implements a unification based formalism for the description of the inflec tional behavior of MWUS presented in 85 Its features are described in section 11 2 This formalism requires the description to be fully lexicalized each MWU listed in a dictionary 214 CHAPTER 11 COMPOUND WORD INFLECTION obtains a code e g NC NN NC NN2 etc representing its inflectional paradigm for in stance in the DELA like format aircraft carrier carrier N1 5 NC NN chief justice justice N1 s NC NN lord lord N1 s justice justice N1 s NC NN2 However only a few codes which can be seen as a phrase grammar of the language repre sent the big majority of all MWUs Thus the lexicalization of the description mainly consists of pointing out the MWUS which respect or don t respect the grammar 112 Formalism for the Computatio
145. ano oo ayy ang gjarn DTJSEUON Dor E punoib ayy oi Area Pau9ea21 uorum 41019 13007 Dor e UTE 243 UT Ppa20TT03 oum asoa jo suo 1201014 AST Y Taneaq aya jo 21ed 1298216 ayy UTI13409 153107 2b18T E se pue S imoTTaA qayHtzq YATM peur PATIOS Teep pooh E 13px0 sty 03 1ado0o1d ssaap aTou ya pue adseo usp ob E S104 3243 23103234 SaATI1E sy usum querTeb qual 41313 Te azaymasTa pue PIOETA ai 03 anwmanm jo SSTOA 31439 E BAG 124 TS UTU PEU ay Z4iadezp Jo 39314 913581083 e daap e g fTTTm Jo pue abeanoo Jo UOTI13X3 peulMiejep E TY 03 SSaUU129 TRuoTaTppe aaeh m01q sty uo 1695 dasp E pue a3oueua3moo STU oq sssuuds3s TEUOT Pappe ay 124080 2393 OL sanojToo ausi se suoxes oThuy 343 30 30uaqstxa IJ YA Jo 28042 aXIT 212q Saa IA 2331 JTE YITM Pa312409 sem Peau sty Ss arom pues Jo ahaeyo a ur ST PUE SaATT and ay ogur uns 243 Jo sousn gur ayq Aq paqgozo vru ayhneq PEU UOTIENIATE pue uorssasgoad aTtym 32suoao2 e HuttTqmasaz xaom uado jo 56019 amp mot am UT MI Sem 213 SB paaToaut UT jo paezey UTEJM130 IUA Saye pue aouepueq saatnbaz ay usum uo Jo pue STBTI131EN 1299234 30 sem m103 U JO asuugm 243 ur S12PINOUS pue peay ay jo zeah peay aya ao beq A T9L e ao de Data WopTas su se pue s 13q30 10 g atiauaa P340T JEI aispraano uy ati du buru23s2 zaqge Asuano sty ue aq oum Hutaq adeys ur uorueduo2 sty jo 38491 p Aq peasao2 saam sag ed qzadns sta jo g pgeoa a uo UTIT2AE19 103
146. ao vuk AC_A3XN2 w4ngea hungry as a wolf zxiro racyun alii z ro racyun lt 3 Nb n Case c Anim a Gen g gt lt Nb n Case c Anim a Gen g gt Figure 11 28 Inflection graph NC_2XN1 for Serbian MWUs avio prevoznik i avioprevoznik 3 Nb n Case c Anim a Gen g Nb n Case c Anim a Gen g Figure 11 29 Inflection graph NC 2XN2 for Serbian MWUs predsednik drzxave plural predsednici dizxave i predsednici dizxava 1 Nb7s Case c Anim a Gen g lt Nb s Case c Anim a Gen g gt EZ lt Nb p Case c Anim a Gen g gt 32 lt Nb w Case c Anim a Gen g gt lt 1 Nb p Case c Anim a Gen gt E ES Es Figure 11 30 Inflection graph NC_N2X1 for Serbian MWUs lt 1 Nb w Case c Anim a Gen g gt 237 11 3 INTEGRATION IN UNITEX Novi Sad Crvena Zastava Ujeinxenxe Nacije lt 2 gt 3 Gen g Nb n Case c Anim a 1 Gen g Nb n Case c Anim g Det e lt Gen g Nb n Case c Anim a gt 1 Gen g Nb n Case c Anim g Det d masculine gender in accusative singular 1 Gen m Nb s Case 4 Anim a Det e 2 lt 3 Gen m Nb s Case 4 Anim a gt lt Gen m Nb s Case 4 Anim a gt lt 1 Gen m Nb s Case 4 Anim a Det d gt Figure 11 31 Inflection graph NC_AXN3 for Serbian MWUs Kosovo i Metohya H lt 1 Gen 1 Nb n Case c Anim a
147. aph antialiasing 107 approximation through a final state transducer 265 approximation with a finite state trans ducer 117 box alignment 108 calling a sub graph 95 comments in 91 compilation 117 268 connecting boxes 91 creating a box 90 deleting boxes 99 detection of errors 122 display 106 353 display options fonts and colors 109 error detection 265 268 format 294 including into a document 111 main 280 283 parameterized 117 196 printing 112 saving 94 syntactic 116 types of 113 variables in a 100 zoom 106 Graph repository 96 Graphical units 211 Grid 108 Hangul 59 275 301 4 Including a graph into a document 111 Inflectional codes 172 Inflectional constraints 74 Information grammatical 48 inflectional 48 semantic 48 Installation on Linux 18 on MacOS X 19 on Windows 18 Integrated text editor 34 Intervals 121 Jamo 59 301 Java Runtime Environment 17 Java virtual machine 17 JRE 17 Keeping the best paths 159 Kleene star 71 77 Korean MWU dictionary 254 LADL 11 47 195 Language selection 31 Lemma 48 Lexical 354 entries 47 labels 73 155 275 283 299 315 mask 72 resources see Dictionaries symbols 176 units 285 Lexical units 282 Lexicon grammar 195 tables 195 280 283 LGPL 17 323 LGPLLR 17 335 License BSD 333 LGPL 17 323 LGPLLR 335 Log file 322 Log Unitex programs 286 288 Longest matches 79 143
148. ar is normally present in the Unitex distribution and precompiled in the file norm rul Figure 7 20 ELAG grammar without any constraint 3This code indicates that the adjective must appear on the left of the nound to which it refers to as is the case for bel 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 175 The result of applying such a grammar is that the original is cleaned of all the codes which either are not described in the tagset def file or do not conform to this description be cause of unknown grammatical categories or invalid combinations of inflectional features By then replacing the text automaton by this normalized automaton one can be sure that later modifications of the automaton will only be effects of ELAG grammars 7 3 7 Grammar Optimization Compilation of ELAG grammars by the ElagComp program consists in building an au tomaton whose language is the set of the sequences of lexical tags or lexical analyses of a sentence which are not accepted by the grammars This task is complex and can take a lot of time It is however possible to appreciably speed it up by observing certain principles at the time of writing gramars Limiting the number of branches in the then part It is recommended to limit the number of then parts of a grammar to a minimum This can reduce considerably the compile time of a grammar Generally a grammar having many then parts can be rewritten with one or two then parts without a loss o
149. are not found appear in italic red font style Different options for each graph Figure 12 3 merge Whether the transducer should be applied in merge mode at the sense of unitex locate pattern replace Whether the transducer should be applied in replace mode at the sense of unitex locate pattern iter Whether the transducer should be applied once or re applied several times until no change occur in the text see 12 2 2 e Several buttons in the middle for different needs Uup Down Top Bottom buttons are used to modify the order of the trans ducers on the list it moves the selected transducer in the list Up and Down to move the selected transducer one line up or down and Top and Bottom to move the selection to the top or to the end of the list Delete permits to remove a selected transducer from the list of transducers Add adds a transducer previously selected in the explorer onto the list It replaces the drag and drop actions described above View opens the selected graph either in the file explorer or in the list of trans ducers of the window It is very useful to get a quick access to any transducer either to take a quick look at its content or to modify it Save and Save as permit to save the list of transducers By default the lists of transducers are stored in the CasSys folder of the current language e g En glish Cassys 242 CHAPTER 12 CASCADE OF TRANSDUCERS
150. aries provided with Unitex contain descriptions of simple and compound words These descriptions indicate the grammatical category of each entry optionally their inflec tional codes and various semantic information The following tables give an overview of some of the different codes used in the Unitex dictionaries These codes are the same for almost all languages though some of them are special for certain languages i e code for neuter nouns etc Code Description Examples A adjective fabulous broken down ADV adverb actually years ago CONJC coordinating conjunction but CONJS subordinating conjunction because DET determiner each INTJ interjection eureka N noun evidence group theory PREP preposition without PRO pronoun you V verb overeat plug and play Table 3 1 Frequent grammatical codes Code Description Example al general language joke z2 specialized language floppy disk z3 very specialized language serialization Abst abstract patricide Anl animal horse AnlColl collective animal flock Conc concrete chair ConcColl collective concrete rubble Hum human teacher HumColl collective human parliament t transitive verb kill i intransitive verb agree Table 3 2 Some semantic codes NOTE The descriptions of tense in table 3 3 correspond to French Nontheless the majority of these definitions can be foun
151. assys uses the compiled version of the graphs the fst2 files Cassys can handle the local grammars section 6 1 syntactic graphs presented in Chapter 6 These grammars can use subgraphs morphological filters and mode and allow to refer to information in dictionaries The grammars used in the cascade must follow the constraints of the grammars used in Unitex 12 2 2 Apply while concordance behaviour Cassys may apply a transducer on a text while concordances are found For instance consider the graph 12 7 which recognizes AB and replaces it with A Consider the text BB BAA A 12 2 DETAILS ON CASSYS 245 ap P A Figure 12 7 Transducer which modifies BA in A Applying the graph 12 7 ont this text with emphapply while concordance behaviour will have the following result initialtext B BBA A A iteration 1 B B A A A 1 concordance iteration 2 B A A A 1 concordance iteration 3 A A A 1 concordance iteration 4 A A A Oconcordance During the three first iterations a concordance is found so the graph re applied on the resulting text At the fourth iteration no concordance is found so the graph is not re applied Warning Be aware of the risk of livelock when applying this option For exam ple a transducer which recognizes A and replaces it with A would be caught in a livelock if applied on the example text 12 23 An xml like output text for lexical tags As an output the lex
152. at seven antagonists turning and wheeling approaches 5 But I trust soon t arrows shot in succession ten arrows placed on the string were ass s ears and which was placed attendants whose dark visages attendants bearing in a table co bands of equal numbers might fig Figure 6 22 Results of the application of the grammar shown on Figure 6 21 6 3 CONTEXTS 127 All the outputs produced in the left context are ignored as you can see in the concordance of Figure 6 24 showing the results obtained with the grammar of Figure 6 23 seven N eight nine ten Figure 6 23 Ignored output in a left context Concordance D My Unitex EnglishiCorpusivanhoe snticoncord htm e courses and cast to the ground three N antagonists 5 I add that seven of utes to keep at sword s point his three N antaqonists turning and wheeling with entinels to give the alarm when any one N approaches But I trust soon to ga omanlike and bravely 5 Of twenty four N arrows shot in succession ten were fi started up and bent their bows 3 Six N arrows placed on the string were pointe he back of which was decorated with two N ass s ears and which was placed about These two squires were followed by two N attendants whose dark visages white ber with a grave pace followed by four N attendants bearing in a table covered Figure 6 24 Results of the application of the grammar shown on Figure 6 23 However you can
153. ate and birth dates while the lower one represents the syntactic variants of the previous forms date of birth and dates of birth e g birth date 52 432 etas lt Nb n gt Saa cs Mero ae Figure 11 9 Inflection graph for birth date Interface with the Morphological System for Simple Words MULTIFLEX is an implementation of the formalism for the inflectional morphology of MWUs presented above It supposes the existence of a morphological system for single words which satisfies the following interface constraints e Fora given sequence of characters it returns its segmentation into indivisible graphical units tokens cf section 11 2 2 For instance in case of Unitex definition of a token 222 CHAPTER 11 COMPOUND WORD INFLECTION sequence Athens 04 is to be divided into 5 tokens Athens 04 Lei Athens mm mpg man e For a given simple inflected form it returns all its possible morphological identifica tions A morphological identification has to allow the generation of any other inflected form of the same lemma on demand by the same morphological module For instance in case of Unitex the form porte yields 7 morphological identifications 6 of which are factorized with respect to their inflection code porte porte porte N21 s porte porter V3 P1s P3s S15 S3s Y28 In case of ambiguity as above the proper identification has to be done for the time being by the user during the edition of t
154. ate try tolerate somes markup langage malformation comments IGNORE every comment is removed default comments SPACE every comment is replaced by a single space scripts IGNORE every script block is removed scripts SPACE every comment is replaced by a single space default for html by default script tags are handled as normal tags default for xml normal tags IGNORE every other tag is removed default for xml normal tags SPACE every other tag is replaced by a single space default for html 13 50 XMLizer XMLizer OPTIONS txt This program takes the raw text file t xt and produces a corresponding basic TEI or XML file The difference between TEI and XML is that TEI files will contain a TEI header OPTIONS x xml produces a XML file t tei produces a TEI file default n XXX normalization XXX specify the normalization rule file to be used see section 14 13 6 o OUT output OUT optional output file name default file txt gt file xml a ALPH alphabet ALPH alphabet file s SEG segmentation grammar SEG sentence delimitation grammar to be used This grammar should be like the Sentence grf one used during the preprocessing of a corpus but it can include the special tag P to indicate paragraph bounds 290 CHAPTER 13 USE OF EXTERNAL PROGRAMS Chapter 14 File formats This chapter presents the formats of files read or generated by Unitex
155. ath of the automaton Go into the Text menu and click on Convert FST Text to Text You can set the output text file in the window as shown on Figure 7 31 7 Convert Text Automaton to Text Output text file DAMy UnitexiEnglishiCorpusilinear snt Cancel Figure 7 31 Setting output file for linearization of the text automaton If the automaton is not linear an error message will give you the number of the first sentence that contain ambiguity Otherwise the T st 2Unambig program will build the output file according to the following rules e the output file contains one line per sentence e every line but the last is ended by 5 7 7 SEARCHING PATTERNS IN THE TEXT AUTOMATON 183 e for each box the program writes its content followed by a space Sm pe P 2 3 cats cat N inl p are be V P2s Pip P2p P3p white white i 1 sentence Sentence 1 Reset Sentence Graph Rebuild FST Text close elag frame Explode Implode Apply Elag Rule ES Figure 7 32 Example of a linear text automaton NOTE correcting spaces in the output text can only be done manually If the original text is the one of the text automaton shown on Figure 7 32 the output text will be 2 3 cats cat N Anl p are be V P2s Plp P2p P3p white white A 7 7 Searching patterns in the text automaton With the LocateT st program Unitex can perform search operations on the text
156. atter mode is used by default In the Search algorithm frame you can specify wether you want to perform the locate operation on the text using the Locate program or on the text automaton with LocateTfst By default search is done with the Locate program as Unitex always did until now If you want to use LocateTfst please read dedicated section 7 7 After you have selected the parameters click on SEARCH to start the search 6 10 APPLYING GRAPHS TO TEXTS 145 6 10 2 Advanced search options If you select the Advanced options tab you will see the frame shown on Figure 6 53 7 Locate Pattern Locate configuration Advanced options Ambiguous output policy Allow ambiguous outputs Forbid ambiguous outputs Variable error policy Note these options have no effect if outputs are ignored 8 Ignore variable errors Exit on variable error Backtrack on variable error Figure 6 53 Advanced search options The Ambiguous output policy option can be illustrated with the graph shown on Figure 6 54 When a determiner is followed by a word that can be either adjective or noun it can produce two distinct outputs for the same text input sequence DET N Figure 6 54 A graph with ambiguous outputs 146 CHAPTER 6 ADVANCED USE OF GRAPHS If we apply this graph on Ivanhoe with the Allow ambiguous outputs option the default one we will obtain the text order concordance shown of Figure 6 55 As you c
157. automaton 156 preprocessing 35 114 splitting into sentences 37 splitting into tokens 40 tokenization 40 282 Text alignment 203 Text automaton conversion into linear text 182 282 Text file encoding parameters 253 Token 40 71 Tokenization 40 Toolbar 104 Transducer 90 inflection 113 rules for application 135 356 with variables 100 unknown 43 75 Transducer output 109 with variables 137 X11 app 22 Transducers 99 Xcode 24 Transduction 90 7 Z 1 Types of graphs 113 Cones Underscore 101 137 Unicode 31 106 260 291 C Jnification variables 219 Jnion of regular expressions 71 Jnion of regular expression 77 Jnitex JNI 253 Jppercase see Respect of lowercase uppercase 116 Using the Apple Java 1 6 19 UTF 8 258 306 307 CCC C Variable error policy 146 Variable names 101 Variables comparison 143 Dictionary entry 131 in graphs 137 in parameterized graphs 197 test 142 within graphs 100 Verification of the dictionary format 256 Void loops 119 Web browser 81 148 Window for ELAG Processing 168 Words compound 43 72 free in Dutch 276 free in German 276 free in Norwegian 276 free in Russian 276 in Dutch 45 in German 45 in Norwegian 45 in Russian 45 with space or dash 49 simple 42 72 INDEX
158. ave to select the Table tab in the text automaton frame You will then see a table as shown on Figure 7 35 Automaton Table Filter grammatical semantic codes Always show POS category regardless filtering Export all text as POS list All Only POS category Use filter Form POS sequence 1 POS sequence 2 DANS DANS dans PREP Dnom z1 LEQUEL LEQUEL lequel DET Dnom z1 ms Phileas Fogg N Hum Phileas Fogg N Hum ET ET et CONJC PASSEPARTOUT PASSEPARTOUT EN se PRO PpvLE z1 3is 3ms 3fp 3mp 5 PRO PpvLUl z1 316 3m6 ACCEPTENT ACCEPTENT accepter V z1 P3p S3p RECIPROQUEMENT RECIPROQUEMENT r ciproquement ADV z 1 L la le DET Ddef z1 fs la le PRO PpvLE z1 3fs L UN L UN l un PRO Pind zl ms UN UN un A z2 ms UN un DET Dind zi ms COMME COMME comme ADV z 1 COMME comme CONJS 1 MAITRE MAITRE maitre N zl ms g L la le DET Ddef z1 fs la le PRO PpvLE z1 3fs AUTRE AUTRE autre DET Dadj ms fs COMME COMME comme ADV z 1 COMME comme CONJS 1 DOMESTIQUE DOMESTIQUE domestiquer V z1 Kms DOMESTIQUE domestique A I il I gt Figure 7 35 Table display This table is not fully equivalent to the sentence automaton since it only displays all possi ble POS for each simple or multiple word unit It should be considered as an approximate compact view of information contained in the automaton Y
159. bin dictionary to use to find forms in the future and conditional given their canonical forms It has to be obtained by compressing the dictionary of verbs in the future and conditional with the parameter flip see section 13 8 e d BIN dictionary BIN the bin dictionary to use e p PRO pronoun rules PRO the fst2 grammar describing pronoun rewriting rules e n PRO nasal pronoun rules PRO the fst2 grammar describing nasal pronoun rewriting rules e o OUT output OUT the name of the grf graph to be generated 13 32 Reg2Grf Reg2Grf txt This program constructs a grf file corresponding to the regular expression writ ten in file txt The parameter txt represents the complete path to the file 278 CHAPTER 13 USE OF EXTERNAL PROGRAMS containing the regular expression This file needs to be a Unicode text file The pro gram takes into account all characters up to the first newline The result file is called regexp grf and is saved in the same directory as txt 13 33 Seq2Grf Seq2Grf OPTIONS lt snt gt This program constructs a grf file corresponding to the sequences contained in file lt snt gt OPTIONS e a ALPH alphabet ALPH the alphabet file to use e o XXX output XXX output GRF file e s only stop only consider STOP separated sequences e b beautify apply the grf beautifying algorithm e n no_beautify do not apply the ert beautifying algorithm default
160. bject code 5 A program that contains no derivative of any portion of the Library but is designed to work with the Library by being compiled or linked with it is called a work that uses the Library Such a work in isolation is not a derivative work of the Library and therefore falls outside the scope of this License However linking a work that uses the Library with the Library creates an ex ecutable that is a derivative of the Library because it contains portions of the Li brary rather than a work that uses the library The executable is therefore covered by this License Section 6 states terms for distribution of such executables When a work that uses the Library uses material from a header file that is part of the Library the object code for the work may be a derivative work of the Library even though the source code is not Whether this is true is especially significant if the work can be linked without the Library or if the work is itself a library The threshold for this to be true is not precisely defined by law If such an object file uses only numerical parameters data structure layouts and accessors and small macros and small inline functions ten lines or less in length then the use of the object file is unrestricted regardless of whether it is legally a derivative work Executables containing this object code plus portions of the Li brary will still fall under Section 6 Otherwise if the work is a derivative
161. ble to see the commands that have been executed by clicking on Info gt Console It is also possi ble to see the options of the different programs on Info Help on commands see Figure 13 1 Note that that all Unitex programs support the h help option 7 Help on commands This program is part of Unitex 2 1 version Copyright 2001 2009 Universit Paris Est Marne la Vall e Contact lt unitex univ mlyv fr gt Usage Convert OPTIONS text 1 text 2 text 3 text i text file to be converted OPTIONS S Xi src X source encoding of the text file to be converted d X dest X encoding of the destination text file The default value is LITTLE ENDIAN Output options ri replace sources files will be replaced by destination files default ps PFX source files will be renamed with the prefix PFX 4 Ill WARNING many programs use the text directory my_text_snt This directory is created by the graphical interface after the normalization of the text If you work with the command line you have to create the directory manually before the execu Figure 13 1 Help on commands tion of the program Normalize 251 252 CHAPTER 13 USE OF EXTERNAL PROGRAMS WARNING 2 whenever a parameter contains spaces it needs to be enclosed in quotation marks so it will not be considered as multiple parameters WARNING 3 many programs need an Alphabet txt file For all those pro grams this
162. browser cf section 4 8 2 If you display concordances with the window provided by Unitex you can access a recog nized sequence in the text by clicking on the occurrence If the text window is not iconified and the text is not too long to be displayed you see the selected sequence appear cf Fig ure 6 62 Furthermore if the text automaton has been constructed and if the corresponding window 6 10 APPLYING GRAPHS TO TEXTS 149 Located sequences Concordance Statistics Modify text Resulting snt file Set File Extract units Set File Extract matching units Extract unmatching units Concordance presentation C Use a web browser to view the concordance better for more than 2000 matches Show differences with previous concordance Show matching sequences in context Context length Stop at Sort according to Len e I cometen y Right 55 chars S Build concordance Figure 6 61 Configuration for displaying the encountered occurrences is not iconified clicking on an occurrence selects the automaton of the sentence that contains this occurrence 6 10 4 Modification of the text You can choose to modify the text instead of constructing a concordance In order to do that type a file name in the Modify text field in the window of Figure 6 61 This file has to have the extension txt If you want to modify the current text you have to choose the corresponding
163. c Acid ACRONYM LADL Laboratoire d Automatique Documentaire et Linguistique ACRONYM UN United Nations ACRONYM 3 2 LOOKING UP A WORD IN A DICTIONARY 53 3 2 Looking up a word in a dictionary You can look up a word in one or several dictionaries by two means Check Format Transliterate Sort Dictionary Inflect Compress into FST Build Korean MWU dic graph Figure 3 1 DELA Unitex 3 0beta February 10 2011 FSGraph Lexicon Grammar XAlign File Edition Windows Info Menu If you have opend a dictionary the displayed window contains a field where you can enter a word to search If the word appears in the dictionary the Find Button will highlight the first entry that matches it If there is several entries for this word you can browse all matches by clicking on the two arrow buttons home paumier unitex French Dela dela fr pu blic dic phtirius phtalonitriles phtalonitrile N mp phtalyl PFX phtal ine N is phtal ines phtal ine N tp phtanite N fs phtanites phtanite N fp phtiriase N fs phtiriases phtiriase N fp phtiriasique A ms is phtiriasiques phtiriasique Asmp fp phtiriasis N ms mp phtirius Nims mp phtisie N z2 fs phtisie dorsale N NA fs phtisie tuberculeuse N NA fs phtisies phtisie N z2 fp phtisies galopantes phtisie galopante N NA z2 ip phtisies ulc reuses phtisie ulc reuse N NA fp phtisiog ne A ms f5 phtisiogenes phtisiogene A
164. c entries can have different mean ings as it is the case for the French word po le that describes a stove or a type of sheet in the masculine sense and a kitchen instrument in the feminine sense You can thus distinguish the entries in this case po le N z1 fs po le frire po le N zl ms voile linceul appareil de chauffage NOTE In practice this distinction has the only consequence that the number of entries in the dictionary increases For the different programs that make up Unitex these entries are equivalent to po le N z1 fs ms Whether this distinction is made is thus left to the maintainers of the dictionaries 3 12 The DELAS Format The DELAS format is very similar to the one used in the DELAF The only difference is that there is only a canonical form followed by grammatical and or semantic codes The canonical form is separated from the different codes by a comma There is an example horse N4 An1 The first grammatical or semantic code will be interpreted by the inflection program as the name of the grammar used to inflect the entry The entry of the example above indicates that the word horse has to be inflected using the grammar named N4 It is possible to add inflec tional codes to the entries but the nature of the inflection operation limits the usefulness of this possibility For more details see below in section 3 5 3 1 THE DELA DICTIONARIES 51 3 13 Dictionary Contents The diction
165. case the programs would loop infinitely only if they recognized the pattern an infinite number of times in the text which is impossible 6 2 COMPILATION OF A GRAMMAR 121 7 Det grf amp BOULOTRecherch e ri Bd DetCompose grf X BOULOTiRecherchel e G Bd DET less HeH Figure 6 10 Void loop caused by two graphs calling each other In order to recognize tokens sequences in which one word or one token of a specific gram matical category appears once several times or never you can set to a box an interval This means if you set the interval m M to a box containing lt ADJ gt that this path will match se quences with at least m consecutive adjectives and no more than M Intervals can be defined according to the following rules e m M at least m consecutive terms and no more than M e M OtoM e Im at least m Fait un express pr t partir par un extr me confort 5 D pvant un fait brutal qu il ne bera un fameux homme 6 Et fe d un fauve n en troublait le 1 41 E d un fin tissu ramassaient l un lt a gt 5 Un formidable juron s cha ait un fort galant homme et 1 S Un garcon g d une trent vu un gar on plus gai plus e fyant un garcon si d contenanc d un g teau farci de tiges 15 Un grand fumeur peut fume Lent un grand nombre de qom Want un tirant d eau x L Il D Fig
166. catch things with variables see section 6 7 5 and use them outside the left context as shown on grammar of Figure 6 25 So with left and right contexts you can make a distinction between the pattern used to match something and the thing you want to extract in your results For instance the gram mar shown on Figure 6 27 looks for expressions like the animal s butonly extract nouns as you can see on Figure 6 28 128 CHAPTER 6 ADVANCED USE OF GRAPHS gt num Det num Figure 6 25 Using a variable in a left context Concordance D My Unitex Englishi Corpusivanhoe snticoncord html e courses and cast to the ground three antaqonists Det three 5 I add that sia utes to keep at sword s point his three antaqonists Det three turning and whee entinels to give the alarm when any one approaches Det one 5 But I trust soon omanlike and bravely 3 Of twenty four arrows Det four shot in succession te started up and bent their bows 3 Six arrows Det ix placed on the string wer he back of which was decorated with two ass s ears Det tw0 and which was place ber with a grave pace followed by four attendants Det four bearing in a table these Jt 4 i i es were x owed D Jo a endantslbet o 014 da Sages Figure 6 26 Results of the application of the grammar shown on Figure 6 25 H k e 5 O Figure 6 27 A grammar with both left and right contexts Concordance D My Unitex English Corpusivan
167. cer outputs e p protect dic chars when M or R mode is used p protects some input characters with a backslash This is useful when Locate is called by Dico in order to avoid producing bad lines like 3 14 PI NUM e v X Y variable X Y sets an output variable named X with content Y Note that Y must be ASCII Ambiguous output options e b ambiguous outputs allows the production of several matches with same input but different outputs default e z no ambiguous outputs forbids ambiguous outputs In case of am biguous outputs one will be arbitrarily keeped depending on the internal state of the program Variable error options These options have no effect if the output mode is set with ignore otherwise they rule the behavior of the Locate program when an output is found that con tains a reference to a variable that is not correctly defined e X exit on variable error kills the program e Y ignore variable errors acts as if the variable has an empty con tent default e Z backtrack on variable errors stop exploring the current path of the grammar Variable injection e v X Y variable X Y sets an output variable named X with content Y Note that Y must be ASCII This program saves the references to the found occurrences in a file called concord ind The number of occurrences the number of units belonging to those occurrences as well as the percentage of recogn
168. ces Reset Sentence Graph Rebuild FST Text Elag Frame EN CR Figure 7 11 Automaton of figure 7 9 after cleaning 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 163 7 3 Resolving Lexical Ambiguities with ELAG The ELAG program allows for applying grammars for ambiguity removal to the text au tomaton This powerful mechanism makes it possible to write rules on independently from already existing rules This chapter briefly presents the grammar formalism used by ELAG and describes how the program works For more details the reader may refer to 6 and 62 7 3 1 Grammars For Resolving Ambiguities The grammars used by ELAG have a special syntax They consist of two parts which we call the if and then parts The if part of an ELAG grammar is divided in two parts which are divided by a box containing the lt gt symbol The then part is divided the same way using the lt gt symbol The meaning of a grammar is the following In the text automaton if a path of the if part is recognized then it must also be recognized by the then part of the grammar or it will be withdrawn from the text automaton If tu follows a verb in the 2nd person singular and a dash then it is a pronoun and not the past participle of taire Figure 7 12 ELAG grammar elag tu grf Figure 7 12 shows an example of a grammar The if part recognizes a verb in the 24 person singular followed by a dash and tu either as a pronoun
169. ch malformed entry the program out puts the line number the content of the line and an error message Results are saved in the file CHECK_DIC TXT which is displayed when the verification is finished In addition to eventual error messages the file also contains the list of all characters used in the in flectional and canonical forms the list of grammatical and semantic codes and the list of inflectional codes that appear in the dictionary The character list makes it possible to verify that the characters used in the dictionary are consistent with those in the alphabet file of the language Each character is followed by its value in hexadecimal notation The code lists can be used to check that there are no typing errors in the codes of the dictio nary The CheckDic program works with non compressed dictionaries i e the files in text for mat The general convention is to use the dic extension for these dictionaries In order to check the format of a dictionary you first open it by choosing Open in the DELA menu Let s load the dictionary as in figure 3 4 Then click on Check Format in the DELA menu A window like in figure 3 5 is opened You must select the type of dictionary you want to check After checking the dictionary in Figure 3 4 results are presented as shown in Figure 3 6 3 4 SORTING 55 Figure 3 4 Dictionary example The first error is caused by a missing period The second by th
170. conflicts if any in X e only cosmetic reports a conflict for any change that is not purely cos metic Tries to merge lt mine gt and lt other gt In case of success the result is printed on the standard output and 0 is returned In case of unresolved conflicts 1 is returned and nothing is printed 2 is returned in case of error 13 24 ImplodeTfst ImplodeTfst OPTIONS lt tfst gt This program implodes the specified text automaton by merging together lexical entries which only differ in their inflectional features OPTIONS e o OUT output OUT output file By default the input text automaton is modified 13 25 Locate Locate OPTIONS lt fst2 gt This program applies a grammar to a text and constructs an index of the occurrences found OPTIONS e t TXT text TXT complete path of the text file without omitting the snt extension e a ALPH alphabet ALPH complete path of the alphabet file 13 25 LOCATE 271 m DICS morpho DICS this optional parameter indicates which morpho logical dictionaries are to be used if needed by some st 2 dictionaries DICS represents a list of bin files with full paths separated with semi colons s start on space this parameter indicates that the search will start at any position in the text even before a space This parameter should only be used to carry out morphological searches x dont start on space forbids the
171. controversial Pragmatically we consider a MWU as a contiguous sequence of graphical units which for some application dependent reasons has to be listed described morphologically syntacti cally semantically etc and processed as a unit 211 212 CHAPTER 11 COMPOUND WORD INFLECTION 11 1 4 Formal Description of the Inflectional Behavior of Multi word Units The main issue in MULTIFLEX is the inflectional morphology of MWUs This phenomenon has been linguistically analyzed for English Polish and French in 84 Obviously a reliable inflection processing of single words is a necessary condition for the inflection processing of MWUs However this condition is rarely a sufficient one For ex ample in order to obtain the plural form of e battle cry e battle royal e battle of nerves in English not only do we need to know how to generate the plural of battle royal and cry but also to know how different inflected forms of these constituents combine e battle cries e battle royals or battles royal e battles of nerves but not battles cries battles royals battles of nerve_ Formally a fully explicit description of the inflectional paradigms of MWUs requires an answer to the following questions e What is the MWU s morphological class noun adjective etc and thus what inflec tion categories number gender case etc are relevant to it 76 argue for a mor phosyntactically motivated definition of mo
172. cribed in section 3 7 3 3 7 1 Priorities The priority rule says that if a word in a text is found in a dictionary this word will not be taken into account by dictionaries with lower priority This allows for eliminating a part of ambiguity when applying dictionaries For example the French word par has a nominal interpretation in the golf domain If you don t want to use this meaning it is sufficient to create a filter dictionary containing only the entry par PREP and to apply this with highest priority This way even if simple word dictionaries contain different entries they will be ignored given the priority rule There are three priority levels The dictionaries whose names without extension end with have the highest priority those that end with have the lowest one All other dictionaries are applied with medium priority The order in which dictionaries with the same priority are applied does not matter On the command line the command Dico ex snt alph txt ctr bin cities bin rivers bin regions bin will apply the dictionaries in the following order ex snt is the text to which the dictionar ies are applied and alph txt is the alphabet file used 1 cities bin 2 regions bin 3 rivers bin 4 ctr bin 3 7 APPLYING DICTIONARIES 65 3 7 2 Application rules for dictionaries Besides the priority rule the application of dictionaries respects upper case letters and spaces The upper case rule is as follows
173. ct the dictionary lines for this inflected form Example if the state refers to the compressed form with index 25133 the cor responding hexadecimal sequence is 00622D each leaving transition is then encoded in 5 bytes The first 2 bytes encode the character that labels the transition and the three following encode the byte po sition of the result state in the bin file The transitions of a state are encoded next to each other Example a transition that is labeled with the A letter and goes to the state of which the description starts at byte 50106 is represented by the hexadecimal sequence 0041 00C3BA By convention the first state of the automaton is the initial state 14 8 2 The inf files A inf file is a text file that describes the compressed files that are associated to a bin file Here an example of a inf file 00000000064 _10 0 0 7 N4 PREPY _3 PREPY PREP _3 PREPY 1 1 N Hum mpY 3er 1 N AN Hum fsY 310 CHAPTER 14 FILE FORMATS The first line of the file indicates the number of compressed forms that it contains Each line can contain one or more compressed forms If there are multiple forms they are separated by commas Each compressed form is made up of a sequence re quired to reconstruct a canonical knowing an inflected form followed by a sequence of grammatical semantic and inflection codes that are associated to the entry The mode of compression of the canonical form varies in function of t
174. cting lexicon grammars In Atkins and Zampolli edi tors Computational Approaches to the Lexicon pages 213 263 Oxford Univ Press 1994 9 1 50 Maurice GROSS The lexicon grammar of a language Application to french In R E Asher editor The Encyclopedia of Language and Linguistics volume 4 pages 2195 2205 Oxford New York Seoul Tokyo Pergamon 1994 9 1 51 Alain GUILLET and Christian LECLERE La structure des phrases simples en fran ais les constructions transitives locatives Droz Gen ve 1992 9 1 52 Benoit HABERT and Christian JACQUEMIN Noms compos s termes d nomi nations complexes probl matiques linguistiques et traitements automatiques Traitement Automatique des Langues 2 5 41 1993 11 1 53 IGM Lesser General Public License for Linguistic Resources http igm univ mlv unitex lgpllr html 1 1 BIBLIOGRAPHY 345 54 Text Encoding Initiative http www tei c org 10 1 55 Christian JACQUEMIN Spotting and Discovering Terms through Natural Language Processing MIT Press 2001 11 2 3 56 Gaby KLARSFLED and Mary HAMMANI MC CARTHY Dictionnaire lectron ique du ladl pour les mots simples de l anglais DELASa Technical report LADL Universit Paris 7 1991 3 8 57 Cvetana KRSTEV Du ko VITAS and Agata SAVARY Prerequisites for a Com prehensive Dictionary of Serbian Compounds LNCS 4139 552 563 2006 11 2 58 Tita KYRIACOPOULOU Les dictionnaires lectroniques la flexion verbale
175. ctionaries to use in Locate s morphological mode home paumier Unitex2 1beta English Dela dela en public bin lt Remove Cancel Figure 6 32 Configuration of morphological dictionaries 6 4 4 Dictionary entry variables You can associate variables to patterns that refer to the morphological dictionaries except lt DIC gt To do that you must set the output of the box with xxx where xxx is a valid variable name That defines a special variable named xxx that represents the dictionary entry that has matched with your pattern Now you can get the inflected form lemma and codes of the entry with xxx INFLECTEDS xxx LEMMAS and xxx CODES as shown on Figure 6 33 You can also use the following patterns e Sxxx CODE GRAMS provides only the first grammatical code supposed to be the POS category e Sxxx CODE SEMS provides all remaining grammatical codes if any separated with e Sxxx CODE FLEXS provides all inflectional codes if any separated with Moreover such variables can be used even after the end of the morphological mode as shown on Figure 6 35 They can also be tested as explained in section 6 7 5 lt r gt O a Inflected form a INFLECTED Lemma a LEMMA Codes a CODE Figure 6 33 Using a morphological variable Dictionary variables in LocateTfst In grammars to be applied with LocateTfst you have an extra feature Even if you are not
176. d as the end of the sentence which would be wrong To avoid the kind of problems caused by the ambiguous use of punctuation grammars are used to describe 38 CHAPTER 2 LOADING A TEXT the different contexts for the end of a sentence Figure 2 10 shows an example grammar for sentence splitting for French sentences When a path of the grammar recognizes a sequence in the text and when this path produces the sentence delimiter symbol 5 this symbol is inserted into the text The path shown at the top of figure 2 10 recognizes the sequence consisting of a question mark and a word beginning with a capital letter and inserts the symbol S between the question mark and the following word The following text What time is it Eight o clock will be converted to What time is it S Eight o clock A grammar for end of sentence detection may use the following special symbols e lt E gt empty word or epsilon Recognizes the empty sequence e MOT recognizes any sequence of letters e MIN recognizes any sequence of letters in lower case e MAJ recognizes any sequence of letters in upper case e lt PRE gt recognizes any sequence of letters that begins with an upper case letter e NB recognizes any sequence of digits 1234 is recognized but not 1 234 e lt PNC gt recognizes the punctuation symbols and the inverted exclama tion points and question marks in Spanish and some Asian punctuation letters
177. d in other languages infinitive present past participle etc In spite of a common base in the majority of languages the dictionaries contain encoding 52 CHAPTER 3 DICTIONARIES particularities that are specific for each language Thus as the declination codes vary a lot between different languages they are not described here For a complete description of all codes used within a dictionary we recommend that you contact the author of the dictionary directly Code Description masculine feminin neuter singular plural 1st 2nd 3rd person present indicative imperfect indicative present subjunctive imperfect subjunctive present imperative present conditional simple past indicative infinitive present participle past participle future indicative 3 x Ww ARIA S lalklalalh w n olalim Table 3 3 Common inflectional codes However these codes are not exclusive A user can introduce his own codes and create his own dictionaries For example for educational purposes one could use a marker faux ami false friend in a French dictionary blesser V faux ami injure casque N faux ami helmet journ e N faux ami day It is equally possible to use dictionaries to add extra information Thus you can use the inflected form of an entry to describe an abbreviation and the canonical form to provide the complete form DNA DeoxyriboNuclei
178. d in the first person In order to let a dictionary entry E be recognized by mask M it is necessary that at least one inflectional code of E contains all the characters of an inflectional code of M Consider the following example E pretext V W P1s P2s P1p P2p P3p M lt V P3s P3 gt No inflectional code of E contains the characters P 3 and s at the same time However the code P3p of E does contain both characters P and 3 The code P3 is included in at least one code of E mask M thus recognizes entry E The order of the characters inside an inflectional code is without importance 4 3 5 Negation of a lexical mask It is possible to negate a lexical mask by placing the character immediately after the char acter Negation is possible with the masks MOT MIN MAJ lt PRE gt DIC as well as with the masks that carry grammatical semantic of inflectional codes i e lt V z3 P3 gt The masks and are the negation of each other The mask lt MOT gt recognizes all tokens that do not consist of letters except for the sentence separator S and the STOP marker Negation has no effect on NB lt SDIC gt lt CDIC gt lt TDIC gt and lt TOKEN gt 4 3 LEXICAL MASKS 75 The negation is interpreted in a special way in the lexical masks lt DIC gt lt MIN gt lt MAJ gt and lt PRE gt Instead of recognizing all forms that are not recognized by the mask without negation these masks fi
179. d the option Apply the Normalization grammar 7 2 4 Keeping the best paths An unknown word can perturb the text automaton by overlapping with a completely la beled sequence Thus in the automaton of figure 7 8 it can be seen that the adverb aujourd hui overlaps with the unknown word aujourd followed by an apostrophe and the past participle of the verb huir Je n ai pas le temps aujourd hui 3653 sentences _ Restez r pondit Fix Sentence ES Reset Sentence Graph Rebuild FST Text Elag Frame Explode Implode Apply Elag Rule Figure 7 8 Ambiguity due to a sentence containing an unknown word This phenomenon can also take place in the treatment of certain Asian languages like Thai When words are not delimited there is no other solution than to consider all possible com binations which causes the creation of numerous paths carrying unknown words that are mixed with the labeled paths Figure 7 9 shows an example of such an automaton of a Thai sentence 160 CHAPTER 7 TEXT AUTOMATON i y ai EI w Waatriasassiuamauaxsan Gen gie Wiuuuedantunnumala Aumann sa li qt 1003 sentences Sentence Reset Sentence Graph Rebuild FST Text Elag Frame Explode Implode Apply Elag Rule Figure 7 9 Automaton of a Thai sentence It is possible to suppress parasite paths You have to select the option Clean Text FST in the configuration window f
180. de gt Input variable Save Left context Save as Right context Page Setup Negative right context Print Tools gt Format E Zoom D Figure 5 8 contextual menu 5 2 EDITING GRAPHS 5 2 2 Sub Graphs 95 In order to call a sub graph its name is inserted into a box and preceded by the character If you enter the text alphat betat tgammat E greek delta grf into a box you get a box similar to the one in figure 5 9 alpha beta gamma E greek delta erf Figure 5 9 Graph that calls sub graphs beta and delta You can indicate the full name of the graph E greek delta grf or simply the base name without the path beta in this case the sub graph is expected to be in the same directory as the graph that references it References to absolute path names should as a rule be avoided since such calls are not portable If you use such an absolute path name the graph compiler will emit a warning see figure 5 10 essages with a colored background are generated by the interface not by the external programs Compiling graph alpha Compiling graph beta Compiling graph E greek delta Recursion detection started Resolving E conditions Looking for E loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Absolute path name detected VVindows E greek delta grf Absolute path names are not portable OK Cancel Figure 5 1
181. distributing the Linguistic Resource or any work based on the Linguistic Resource you indicate your ac ceptance of this License to do so and all its terms and conditions for copying distributing or modifying the Linguistic Resource or works based on it Each time you redistribute the Linguistic Resource or any work based on the Linguistic Resource the recipient automatically receives a license from the original licensor to copy distribute link with or modify the Linguistic Re source subject to these terms and conditions You may not impose any further restrictions on the recipients exercise of the rights granted herein You are not responsible for enforcing compliance by third parties with this License If as a consequence of a court judgment or allegation of patent infringement or for any other reason not limited to patent issues conditions are imposed on 14 13 VARIOUS OTHER FILES 339 10 11 you whether by court order agreement or otherwise that contradict the con ditions of this License they do not excuse you from the conditions of this Li cense If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations then as a consequence you may not distribute the Linguistic Resource at all For example if a patent license would not permit royalty free redistribution of the Linguistic Resource by all those who receive copies directly or indirectly
182. dle languages with special spacing rules Lexical Parsing C Analyse unknown words as free compound words this option Cancel but tokenize text is available only for Dutch German Norwegian amp Russian C Construct Text Automaton Cancel and close text Figure 2 9 Preprocessing Window If you choose to preprocess the text Unitex proposes to parameterize it as in the window shown in figure 2 9 The option Apply FST2 in MERGE mode is used to split the text into 2 5 PREPROCESSING A TEXT 37 sentences The option Apply FST2 in REPLACE mode is used to make replacements in the text especially for the normalization of non ambiguous forms With the option Ap ply All default Dictionaries you can apply dictionaries in the DELA format Dictionnaires Electroniques du LADL The option Analyze unknown words as free compound words is used in Norwegian for correctly analyzing compound words constructed via concatenation of simple forms Finally the option Construct Text Automaton is used to build the text automaton This option is deactivated by default because it consumes a large amount of memory and disk space if the text is too large The construction of the text automaton is described in chapter 7 NOTE If you click on Cancel but tokenize text the program will carry out the normaliza tion of separators and split the text into tokens Click on Cancel and close text to cancel the operation 2 5 1 Normalization of separators
183. do mkdir usr local only if it does not exist sudo mv soylattel6 i1386 1 0 3 tar bz2 usr local 1 5 INSTALLATION ON MACOS X 23 e Uncompress the archive sudo tar jxvf usr local soylattel16 1i1386 1 0 3 tar bz2 or using the graphical interface finder e Move drag and drop the archive soylattel6 1386 1 0 3 tar bz2intothe Applications folder e Double click on it wait it s done Configure your system There are two approaches you could either decide to use SoyLatte only with Unitex and keep using the former version of Java that is already installed on your computer for all the others Java applications or decide to make SoyLatte the default Java distribution on your machine As I have no indication that SoyLatte is bug free and fully functional with all applications I chose to use SoyLatte only with Unitex and kept my Java and I recommend the same for you If you want to follow this advice DO NOT modify your PATH variable as suggested in the SoyLatte installation procedure Instead add an ALIAS in the configuration file of your shell Here is how to do that e Editthe bash configuration file which is in your home directory Users your dir name The easiest way to do it is to use a command line editor like vim or pico note bashrc is a hidden file So normally you will not see this file in the Finder pico bashre e In the file type the following lines alias java6 usr local soylatt
184. dy of idioms in italian In Sintassi e morfologia della lingua italiana Congresso internazionale della Societ di Linguistica Italiana Roma Bulzoni 1984 3 8 91 Du amp ko VITAS Svetla KOEVA Cvetana KRSTEV and Ivan OBRADOVIC Tour du monde through the dictionaries In Matthieu Constant Takuya Nakamura Michele De Gioia and Sara Vecchiato editors 27th International Conference on Lexis and Grammar LGC 08 pages 249 256 September 2008 10 Index 48 77 90 pie cat 171 complete 172 discr 171 flex 171 t 34 STOP 40 1 74 38 72 74 116 130 Sx 125 100 101 lt 129 129 SIL 12 S lp 122 x 77 48 50 64 48 58 76 48 99 1 54 2 54 3 92 48 95 lt CDIC gt 72 130 pics 72 75 130 E 38 72 74 77 90 114 116 lt I gt 58 lt L gt 114 MAJ 38 72 75 130 lt MIN gt 38 72 75 130 lt MOT gt 38 72 130 lt NB gt 38 72 74 130 lt PNC gt 38 lt PRE gt 38 72 75 130 lt R gt 58 SDIC gt 72 150 lt TDIC gt 72 TOKEN 130 185 lt X n gt 58 lt gt 38 114 49 197 197 A 51 ADV 51 Abst 51 Anl 51 AnlColl b51 BuildKrMwuDic 254 C 954 55 113 CONJC 51 CONJS 51 Cassys 254 CheckDic 54 256 311 Compress 49 62 256 308 Conc 51 ConcCol11 51 ConcorDiff 150 260 Concord 257 Convert 260 D 58 113 DET 51 Dico 43 65 262 Elag 263
185. e e cursentence txt text file containing the sentence e cursentence tok text file containing the numbers of the tokens that com pose the sentence 13 41 Tfst2Unambig Tfst2Unambig OPTIONS lt tfst gt This programs takes a t st text automaton and produces an equivalent text file if the automaton is linear i e with no ambiguity See section 7 6 page 182 OPTIONS e o TXT out TXT the output text file 13 42 Tokenize Tokenize OPTIONS txt This program tokenizes a tet text into lexical units txt the complete path of the text file without omitting the snt extension OPTIONS e a ALPH alphabet ALPH alphabet file e c char by char indicates whether the program is applied character by character with the exceptions of the sentence delimiter 5 the stop marker STOP and lexical tags like today ADV which are considered to be single units 13 43 TRAININGTAGGER 283 e w word_by_word with this option the program considers a unit to be either a sequence of letters the letters are defined by file alphabet or a char acter which is not a letter or the sentence separator S or a lexical label like aujourd hui ADV This is the default mode e t TOKENS tokens TOKENS specifies a tokens txt file to load and mod ify instead of creating a new one from scratch Offsets options e input_offsets base offset file to be used e output_offsets offset file t
186. e DELA menu The compression is independent from the language and from the content of the dictionary The messages produced by the program are displayed in a window that is not closed auto matically You can see the size of the resulting bin file the number of lines read and the number of inflectional codes created Figure 3 13 shows the result of the compression of a dictionary of simple words Messages with a colored background are generated by the interface not by the external programs Compressing Minimizing Minimization done Binary file 111437 bytes 13976 lines read 2179 INF entries created 11358 states 16340 transitions Cancel Figure 3 13 Results of a compression The resulting files are compressed to about 95 for dictionaries containing simple words and 50 for those with compound words 64 CHAPTER 3 DICTIONARIES NOTE for semitic languages a special compression algorithm is used to reduce the size of the output bin and inf files The fact that a language is considered as a semitic one can be configured in the global preferences 3 7 Applying dictionaries Dictionaries can be applied 1 after pre processing or 2 by explicitly clicking on Apply Lexical Resources in the Text menu see section 2 5 5 Unitex can manipulate compressed dictionaries bin and dictionary graphs st 2 We will now describe the rules for applying dictionaries in detail Dictionary graphs will be des
187. e displayed This program is called automatically when you select a sentence in order to generate the corresponding gr f file The generated grf files are not interpreted in the same manner as the gr f files that rep resent graphs constructed by the user In fact in a normal graph the lines of a box are separated by the symbol In the graph of a sentence each box represents either a lexical unit without a tag or a dictionary entry enclosed by curly brackets If the box only repre sents an unlabeled lexical unit this unit appears alone in the box If the box represents a dictionary entry the inflected form is displayed followed in another line by the canonical form if it is different The grammatical and inflectional information is displayed below the box as a transducer output Figure 7 27 shows the graph obtained for the first sentence of Ivanhoe The words Ivanhoe Walter and Scott are considered unknown words The word by corresponds to two en tries in the dictionary The word Sir corresponds to two dictionary entries as well but since the canonical form of these entries is sir itis displayed because it differs from the inflected form by a lower case letter 180 CHAPTER 7 TEXT AUTOMATON Figure 7 27 Automaton of the first sentence of Ivanhoe 7 5 2 Modifying the text automaton It is possible to manually modify the sentence automaton You can add or erase boxes or transitions When a graph is modified it is saved to the t
188. e drap di ie A a 253 134 Textile encoding parameters lt c lt erdot REE RE RE dede Ta 253 13 5 Build amp KrMwubiC o 4 4 4 ea da be da AS o9 Romo ee S 254 Pe Cee Se LES NS SNS Oe Eee Oe MORE XE acc ex es 254 1947 X heck ae qe uy buo amie bent rac eue ee b exse bee x 256 ISOLE a2 we Ger o EDS SS Ec Rex Ex Pk dnb RE E OM Eee s 256 19 9 Concomd lt o caes exu ko kx OO Ro MU we e le m 257 19 10 C ORCOEDIME oe dor wk oo oec de da es a a aw s Redes 260 a 2 Domi sr de ee Sere ee aoe Ae a Eee E 3 bebe 260 WO a a BB GD Bnd he D he Bh DR E 262 EE 263 OMR Aog lt c Oh eek Labs OR eo aR or Rte honte eS 264 LA SE VAM russe et id due n e SE X RENEW te etes 264 ISIGERMACE me aa de 0 SE Ew RES HREM EOE Ae eee bow eb ow x 264 IS PASO 6536 snow Se ae e a Gee A GE ee eh ES ak dede 265 13 18Fst2Check ccoo eu dau aca OX Robe moo mm a 265 oL I GIRL CC pa aw ANR Lea dati Ges a eld amp eu 266 t5 32525 oe cis ca ste oD a PD de este A Ge WA e a ck vw GD a A 267 RME ar ha ee POS OPE Gee Rer doe ER EL OE Se qus 268 TS POGUE e Aug dome eh ek da dd wd CR S E 269 BAGAD TT 270 ee oes es os AR EO opo RE CRG De 270 IB olores cu ae gos hee Se bee A e WU Ewe E EUR 270 APOLO Cate LISO 25 som ad dues ms Gee A A m owed Ee al depen s 273 1927Mult Plex conoce ok ok ox aO OX ke hw whe m mom c m a 274 15 28 NoPmallsB uou 524 oven RE ke martine weld S eat a 0 275 0 9 11 P P Les es sis di dus ERR SS OS ERE BRE
189. e fact that no comma was found after the end of an inflected form The third error indicates that the program didn t find any grammatical or semantic codes 7 Check Dictionary Format Dictionary Type Check Dictionary DELAS DELAC Figure 3 5 Checking a dictionary 3 4 Sorting Unitex uses the dictionaries without having to worry about the order of the entries When displaying them it is sometimes preferable to sort the dictionaries The sorting depends on a number of criteria first of all on the language of the text Therefore the sorting of a Thai dictionary is done according to an order different from the alphabetical order So different in fact that Unitex uses a sorting procedure developed specifically for Thai see chapter 13 For European languages the sorting is usually done according to the lexicographical order although there are some variants Certain languages like French treat some characters as equivalent For example the difference between the characters e and is ignored if one wants to compare the words manger et mang s because the contexts r and s allow to decide the order The difference is only taken into account when the contexts are identical as they are when comparing p che and p che To allow for such effect the Sort Txt program uses a file which defines the equivalence of characters This file is named Alphabet sort txt and can be found in the user directory for the current language By
190. e first line indicates in which transduction mode the concordance has been con structed The three possible values are e 1 transducer outputs have been ignored e tM transducer outputs have been inserted before the corresponding inputs MERGE mode e R transducer outputs have replaced the recognized sequences REPLACE mode Each occurrence is described in one line The lines start with the start and end posi tions of the occurrence These positions corresponds to the offsets defined in t fst tags see 14 5 1 If the file has the heading line 1 the end position of each occurrence is immediately followed by a newline Otherwise it is followed by a space and a sequence of char acters In REPLACE mode that sequence corresponds to the output produced for the recognized sequence In MERGE mode it represents the recognized sequences into which the outputs have been inserted In MERGE or REPLACE mode this sequence is displayed in the concordance If the outputs have been ignored the contents of the occurrence is extracted from the text file 14 62 The concord txt file The concord txt file is a text file that represents a concordance Each occurrence is encoded in a line that is composed of three character sequences separated by a tab representing the left context the occurrence possibly modified by transducer outputs and the right context 306 CHAPTER 14 FILE FORMATS 14 6 3 The concord html file The concord h
191. e g avant garde lt Gen g Nb n gt Figure 11 21 Inflection graph NC_XXN for French MWUs Gen g Nb n gt e g bateau mouche Figure 11 22 Inflection graph NC_NN for French MWUs lt Gen g Nb n gt e g pomme de terre Figure 11 23 Inflection graph NC NXXXX for French MWUs lt Gen g Nb n gt e g assistant approvisionneur Figure 11 24 Inflection graph NC_NNmf for French MWUs En lt Gen g Nb n gt e g franc macon Figure 11 25 Inflection graph NC ANT for French MWUs 11 3 INTEGRATION IN UNITEX 229 lt Gen g Nb n gt e g microscope effet tunnel Figure 11 26 Inflection graph NC_NXXXXXX for French MWUs lt Gen m Nb p gt Figure 11 27 Inflection graph NC_VNm for French MWUs 11 3 3 Complete Example in Serbian Let us assume that the description of morphological features of Serbian is given by the fol lowing Morphology t xt file Serbian lt CATEGORIES gt Nb s p w Case 1 2 3 4 5 6 7 Gen m f n Anim v q g Comp a b c Det d k e lt CLASSES gt noun Nb lt var gt Case lt var gt Gen lt var gt Anim lt fixed gt adj Nb lt var gt Case lt var gt Gen lt var gt Anim lt var gt Comp lt var gt Det lt var gt adv The particuliarity of this morphological model is not only its reachness but also the existence of no care features like Anim g or Det e These features agree with all other features in the same category They are used
192. e it in any medium provided that you conspicuously and ap propriately publish on each copy an appropriate copyright notice and disclaimer of warranty keep intact all the notices that refer to this License and to the absence of any warranty and distribute a copy of this License along with the Library You may charge a fee for the physical act of transferring a copy and you may at your option offer warranty protection in exchange for a fee 2 You may modify your copy or copies of the Library or any portion of it thus forming a work based on the Library and copy and distribute such modifications or work under the terms of Section 1 above provided that you also meet all of these conditions a The modified work must itself be a software library b You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change c You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License d If a facility in the modified Library refers to a function or a table of data to be supplied by an application program that uses the facility other than as an argument passed when the facility is invoked then you must make a good faith effort to ensure that in the event an application does not supply such function or table the facility still operates and performs whatever part of its purpose remains meaningful For example a function in a library t
193. e outputs of your grammars if any For instance let us lookup for the pattern manger to eat in the French text of our ex ample First we see no result because we have not changed yet the display mode for the French text which by default is All sentences Plain text Clicking on Matched sentences we only see sentences that contain occurrences highlighted as usual in blue as shown on Figure 10 8 Clicking on All sentences HTML will display all sentences highlighting oc currences in blue D iMy UnitexiXAlign funtana xml mais nous assassinons tour de bras comme nous mangeons comme nous respirons comme nous accomplissons les gestes les plus quotidiens Apr s avoir mang le sien l un d entre nous commen ait Tante Desi crestini nu ne am pierdut bineinteles indeminarea daca e cazul sa sugrum m dar noi asasinam cu atita nongalant de parca am minca am respira am face un gest de zi cu zi Si apoi recurgem la c int si la tot ce ne ofera doctrinele noastre filosofice religioase si politice donne moi le dessus s il platt Elle O All sentences Plain text Si daca toate astea nu sintindeajuns avem si un fel de reminiscenta de reqret 108 All sentences Plain text 9 Matched sentences All sentences HTML amp Matched sentences O All sentences HTML Aligned with source concordance O Aligned with target concordance Locate Clear alignment Align Save ali
194. e paumier tmp Set Clear all logs Cancel Figure 5 11 Setting the path to the graph repository 5 2 EDITING GRAPHS 97 Let us assume that we have a repository tree as on Figure 5 12 If we want to call the graph named DET that is located in sub directory Johnson we must use the call Det Johnson DET see Figure 5 13 1 E repository E C3 Det o CD Smith Figure 5 12 Graph repository example DetdolmsonDEr E Figure 5 13 Call to a graph located in the repository TRICK If you want to avoid long path names like Det Johnson DET you can create a graph named DET and put it the repository root here D repository DET grf In this graph just put a call to Det Johnson DET Then you can just call DET in your own graphs This has two advantages 1 you do not have long path names 2 you can modify the graphs in your repository with no constrainst on your own graphs because the only graph that will have to be modified is the one located at the repository root Figure 5 14 Missing called sub graphs appear in red Calls to sub graphs are represented in the boxes by grey lines or brown lines in the case of graphs located in the repository If the GRF File of the sub graph is not found at the path you indicated Unitex will try to find a fst2 file of the same name If Unitex can t find any of the grf or fst2 file the call to the missing subgraph will be rep
195. e program or on the text automaton with LocateTfst By default search is done with the Locate program as Unitex always did until now If you want to use LocateTfst please read dedicated section 7 7 Enter an expression and click on Search in order to start the search Unitex will transform the expression into a grammar in the gr f format This grammar will then be compiled into a grammar of the st2 format that will be used for the search 4 8 2 Presentation of the results When the search is finished the window of figure 4 5 appears showing the number of matched occurrences the number of recognized tokens and the ratio between this number and the total number of tokens in the text 4 8 SEARCH 81 200 matches 644 recognized units 0 345 of the text is covered Figure 4 5 Search results After having clicked on OK you will see window 4 6 appear which allows you to configure the presentation of the matched occurrences You can also open this window by clicking on Located Sequences in the Text menu The list of occurrences is called a concordance The Modify text box offers the possibility to replace the matched occurrences with the generated outputs This possibility will be examined in chapter 6 The Extract units box allows you to create a text file with all the sentences that do or do not contain matched units With the button Set File you can select the output file Then click on Extract matching un
196. e text The number preceeding diff indicates the number of different units e simple forms the total number of lexical units in the text that are composed of letters The number in parentheses represents the number of different lexical units that are composed of letters e digits the total number of digits used in the text The number in parentheses indicates the number of different digits used 10 at most 14 13 4 The concord n file The concord n file is a text file in the directory of the text It contains information on the latest search of the text and looks like the following 6 matches 6 recognized units 0 004 of the text is covered The first line gives the number of found occurrences and the second the name of units covered by these occurrences The third line indicates the ratio between the covered units and the total number of units in the text 14 13 5 The concord_tfst n file The concord_tfst n file is a text file in the directory of the text It contains infor mation on the latest search on the text automaton and looks like the following 23 matches 45 outputs 14 13 6 Normalization rule file This file is used by the Normalization and XMLizer programs It represents replacement rules Each line stands for a rule according to the following format stands for the tabulation character input sequence output sequence If you want to use the tabulation or the new line you must protect them with a back
197. e16 1386 1 0 3 bin java alias unitex cd Applications Unitex3 0 App java6 jar Unitex jar The first line creates and alias that makes it easy to run SoyLatte Java and the second line creates and alias that makes it easy to start Unitex e Then quit pico type CTRL x answer yes to the question Do you want to save and then press Enter to proceed Changes will be active next time the bashrc file is loaded quit and re launch Terminal app or more simply type bash to open a new bash session Once you have a new bash session type the command unitex and let s start using Unitex on your Mac A 24 see Text DELA FSGraph Lexicon Grammar XAlign Edit File Edition Windows E3 fusers ctaironfunitex Fr e ar Bd CHAPTER 1 INSTALLATION OF UNITEX XN Unitex 2 1beta May 25 2009 current language is French Info og E 7 Token list o cm 7 Word Lists in Users cfairon unitex French Corpu sentence delimiter 10 7 diff tok DLF 15 simple word lexical ERR unknown simple words This file is empty By F By La belle porte le voile Ej rsr Te xt porter Vezl iPle Pjs 1 sentence Sentence Rebuild FST Text porter Vezi P3s 93s5 Nesi fs Narsi rm voile voiler V z1 Pis P3 1 dal IA Elag Frame Explode Implode 5j New File File Edit m 3 x 2 1Bera fonctionne sous MacOS X Intel sans mulation Windows gr
198. ecial symbol lt L gt that recognizes one letter as defined in the alphabet file e it is impossible to refer to information in dictionaries e it is impossible to use morphological filters e it is impossible to use morphological mode 6 1 TYPES OF GRAPHS 115 e it is impossible to use contexts The figures 2 10 page 39 and 2 11 page 41 show examples of preprocessing graphs 6 1 3 Graphs for normalizing the text automaton Graphs for normalizing the text automaton allow you to normalize ambiguous forms They can describe several labels for the same form These labels are then inserted into the text automaton thus making the ambiguity explicit Figure 6 3 shows an extract of the normal ization graph used by default for French de DET Dind z1 mp fp Figure 6 3 Extract of the normalization graph used for French The paths describe the forms that have to be normalized Lower case and upper case vari ants are taken into account according to the following principle uppercase letters in the graph only recognize uppercase letters in the text automaton lowercase letters can recog nize both lowercase and uppercase letters The transducer outputs represent the sequences of labels that will be inserted into the text automaton These labels can be dictionary entries or strings of characters The labels that 116 CHAPTER 6 ADVANCED USE OF GRAPHS represent dictionary entries have to respect the DELAF format and must be enclosed
199. ed by Unitex should be like in 12 10 For programming reasons ambiguities between characters in the curly brackets of the lexical tags we 248 CHAPTER 12 CASCADE OF TRANSDUCERS have no option but to place backslashes before all ambiguous characters that is why these symbols are protected with in the concordance to avoid problems in Unitex EN Concordance C appsimy unitex cassys FrenchiCorpusicassys exemple1 snticoncord html 4 matches bac a b c cc a b b ba ab a b bea a b AB c ABC abaabc bac a bX X ABX C ABC cc a b b ba ab a b boa a b c aba bac a b c cc a b B b ba ab a b bca a b c abaabc bac a b c cc a b b ba ab fa b B bca BC a b c abaabc Figure 12 10 The concordance resulting from this cascade This examples shows that the writing of graphs using the lexical tags created by preceeding graphs is very simple The tags can overlap each other 12 2 6 Interest of a cascade of transducer Unitex grammars are known as Context free grammars and contain the notion of transduction derived from the field of finite state automata A grammar with trans duction a transducer is enabled to produce some ouput Cassys is dedicated to the application of transducers in the form of a cascade Transducers are interesting because they allow the association of a recognized se quence to informations found in the outputs of the graphs These outputs can e Be merged into the recognized sequence found
200. ei dt 276 13 30 RD EISE oe LE A SM Xue x er AU Rom Aes Oe ex RES dus 277 13 31Reconsfr caO ess oe oem o mo om de mo m Rom Roo omo 6 me 6 277 00 color et at ee ER CS CR Se A ERE OEE EEE E es 277 IBIS dr Se hes ED EOS A ee Ee Bee OED e OP EG 278 I3 MPSGREEXE 2 22 Dau Me AS Aa hea tg Sd deg 279 E e ee oh rca he A ne E Oe he de Peter ee a haa eh TT 279 195 96 fable GT du alee ee ree Moe epee Rage was la We bue pas et Ses 280 DOGO e A ai AAA AAA EES KEEL Se PUES AAA 280 13 38 TagsetN ormi ISt s o oc he da o oe Rok de eroe EE eu A 281 IOS DEIA ES 2224289 903 5 BO Yes dde don Dor RL GSM TMS us 281 ch ijr 25 C ee ee hee oe ee REA Te Oe E oe E 281 CONTENTS 9 Pe Tie aes ee A Bee eRe XE Ea PEE 282 19 42TG8KGnbIZe o meos du da ex Eum de a kou mos ha o 3 9 m E E uem os 282 134 raining lagger iua uon ave demie SU PERE d RS REED EE PR RS 283 ISAZA oc 4h a du due dar d in ee phir h aapa 284 RONDES 3 AR ARA ELS EE EE 285 JEE eebe E Ne AE Ale A e e A e 285 IS A7 UnitexTool 222 dou a aaa ee eee S 286 Peo TIENNE Lieu OS XUACE ORES EN ER 286 19 dS D eeneg 4 xh hu x bw Pd esed Bede we bee Ew x 288 SO EL PER whe heh NO Sh ees AN A A ny Se A BS A 289 14 File formats 291 14 1 Unicode encoding oo as BY EG Eve Xe PAGERS AR S 291 112 Alphabets uox xum AAA AA AA DAR 292 DIET Alphabet a ae KU Poe eee Oe ee Eee CIS EEN 292 14 22 Sorted UE acu s eect anse E me CREE eee ES phe r 293 T43 Graphs
201. ension of the definition of a transducer in the area of finite state automata 5 2 Editing graphs 5 2 1 Creating a graph In order to create a graph click on New in the FSGraph menu You will then see the window coming up as in figure 5 2 The symbol in arrow form is the initial state of the graph The round symbol with a square is the final state of the graph The grammar only recognizes expressions that are described along the paths between initial and final states In order to create a box click inside the window while pressing the Ctrl key A blue rectangle will appear that symbolizes the empty box that was created see figure 5 3 After creating the box it is automatically selected If you use Unitex on a Macintosh device you must press the Command key instead of Ctrl in every action involving the Ctrl key You see the contents of that box in the text field at the top of the window The newly created box contains the E symbol that represents the empty word epsilon Replace this symbol by the text I you he she it we they and press the Enter key You see that the box now contains seven lines see figure 5 4 The character serves as a separator The box is 5 2 EDITING GRAPHS 91 Unitex 2 1 current Text DELA Lexicon Grammar XAlign File Edition Windows Info Open Save Save as Save All Page Setup Print Ctrl P Print All Close all Figure 5 1 F5Graph menu
202. ent al align File Edition Windows Info Open Compile to GRF Text DELA FSGraph Close Figure 9 5 Menu Lexicon Grammar on your screen it may be hidden by other Unitex windows E Ew ams NO N hum WA E lt ENT gt NO est V ant NO est Vpp NO V de NOpe abando abuser acquie ladouber agioter agoniser archaiser arquer arriver atermoyer badauder baisser bambocher Pande TT PEEP ep ePeyeye rp ETE Ea U i LiT E 238 I I I I I I ijijiji i i i np liji li l lrititttyi tiie hihihi DRRESREESERREENREREREENESE Pprpri gprprprprpriprprfrjparjo LL Figure 9 6 Displaying a table To automatically generate graphs from a parameterized graph click on Compile to GRF in the Lexicon Grammar menu The window in figure 9 7 shows this In the Reference Graph in GRF format frame indicate the name of the parameterized graph to be used In the Resulting GRF grammar frame indicate the name of the main graph that will be generated This main graph is a graph that invokes all the graphs that are going to be generated When launching a search in a text with that graph all the generated graphs are simultaneously applied The Name of produced subgraphs frame is used to set the name of each graph that will be generated Enter aname containing because for each line of the table
203. entence Unitex searches the dictionary of the simple words of the text for all possible interpretations Afterwards all combination of lexical units that have an interpretation in the dictionary of the compound words of the text are taken into account All the combinations of these information constitute the sentence automaton NOTE If the text contains lexical labels e g out of date A z1 these labels are reproduced identically in the automaton without trying to decompose them 156 CHAPTER 7 TEXT AUTOMATON In each box the first line contains the inflected form found in the text and the second line contains the canonical form if it is different The other information is coded below the box cf section 7 5 1 The spaces that separate the lexical units are not copied into the automaton except for the spaces inside compound words The case of lexical units is retained For example if the word Here is encountered the capital letter is preserved cf figure 7 1 This choice allows you to keep this information during the transition to the text automaton which could be useful for applications where case is important as for recognition of proper names 7 2 Normalization of ambiguous forms During construction of the automaton it is possible to effect a normalization of ambiguous forms by applying a normalization grammar This grammar has to be called Norm fst2 and must be placed in your personal folder in the subfolder Graph
204. eously in this file which will cause a crash When you click on OK the program will copy the graphs to the directory of the output grammar and will create subgraphs corresponding to the various sub directories as one can see in figure 6 39 which shows the output graph generated for our example 6 7 RULES FOR APPLYING TRANSDUCERS 135 Building Graph Collection x Source directory Set Resulting GRF grammar Set Cancel OK Figure 6 38 Building a graph collection One can observe that one box contains the calls with subgraphs corresponding to sub directories here directories Banque and Nourriture and that the other box calls all the graphs which were in the directory here the graph truc grf Grammars corresponding to sub directones Grammars corresponding to graphs Figure 6 39 Main graph of a graph collection 6 7 Rules for applying transducers This section describes the rules for the application of transducers along with the operations of preprocessing and the search for patterns The following does not apply to inflection graphs and normalization graphs for ambiguous forms 6 7 1 Insertion to the left of the matched pattern When a transducer is applied in REPLACE mode the output replaces the sequences that have been read in the text When a box in a transducer has no output it is processed as if 136 CHAPTER 6 ADVANCED USE OF GRAPHS it had an lt E gt
205. er 14 2 5 5 Applying dictionaries Applying dictionaries consists of building the subset of dictionaries consisting only of forms that are present in the text Thus the result of applying a English dictionary to the text Igor s father in law is ill produces a dictionary of the following simple words father N Hum s father V W P1s P2s Plp P2p P3p ill A ill ADV ill N s in A in N s in PART 2 5 PREPROCESSING A TEXT 43 Figure 2 12 Tokens of an English text sorted by frequency in PREP is be V P3s Ze Map law N s law V W Pls P2s Plp P2p P3p Sp NES as well as a dictionary of compound words consisting of a single entry father in law N NPN Humt zl s Since the sequence Igor is neither a simple English word nor a part of a compound word it is treated as an unknown word The application of dictionaries is done through the program Dico The three files produced 41 for simple words dlc for compound words and err for unknown words are placed in the text directory The d1f and dlc files are called text dictionaries 44 CHAPTER 2 LOADING A TEXT F Word Lists in home paumier unitex English Corpus ivanhoe snt e ri D DLF 13284 simple word lexical entri a DET Dind s a N s Aaron N PR Hum abandoned A abandoned abandon V K lis I abate V WsPis P2s PipsP2ps abated abate V K 116 126 13 abbey N Conc s abbot N Hum s abbots abbot N Hum p abide V
206. er x allows you to recognize zero one or several occurrences of an expression The star must be placed on the right hand side of the element in question The expression this is very cold recognizes this is cold this is very cold this is very very cold etc The star has a higher priority than the other operators You have to use brackets in order to apply the star to a complex expression The expression 0 0 1 2 3 4 5 6 7 8 9 x recognizes a zero followed by a comma and by a possibly empty sequence of digits WARNING It is prohibited to search for the empty word with a regular expression If you try to search for 0 1 2 3 4 5 6 7 8 9 x the program will raise an error as shown in figure 4 3 4 7 Morphological filters It is possible to apply morphological filters to the lexemes found For that it is necessary to immediately follow the lexeme found by a filter in double angle brackets lexical mask lt lt morphological pattern gt gt The morphological filters are expressed as regular expressions in POSIX format see 63 for the detailed syntax Here are some examples of elementary filters 78 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS lessages with a colored background are generated by the interface not by the external programs Expression converted Compiling graph regexp Recursion detection started Resolving E conditions Recursion detection completed ERROR the main graph regexp recognizes
207. ernet Explorer etc instead Check Use a web browser to view the concordance cf figure 4 6 This option is activated by default if the number of occurrences is greater than 2000 You can configure which web browser to use by 82 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS Located sequences Concordance Modify text Resulting snt file Po Extract units Extract matching units Extract unmatching units Concordance presentation _ Use a web browser to view the concordance better for more than 2000 matches Show differences with previous concordance Show matching sequences in context Context length Stop at Sort according to Len Lance BO Carter ter a Right 55 chars S Build concordance Figure 4 6 Result display configuration clicking on Preferences in the menu Info Click on the tab Language amp Presentation and select the program to use in the field Html Viewer cf figure 4 7 If you choose to open the concordance in Unitex you will see a window as shown on Figure 4 8 Utterances react as hyperlinks If you click on an occurrence the text frame is opened and the corresponding sequence is highlighted Moreover if the text automaton is available and if this window is not iconified the sentence automaton that contains the occurrence will be shown 4 8 SEARCH 83 Preferences for English Morphological dictionaries Directories Language amp Presentation
208. erted Reg2Grf exit code 0 Unigraph SIZE 1313 950 FONT Times New Roman E2 OFONT Times New Roman B 12 BCOLOR 16777215 FCOLOR 0 ACOLOR 12632256 D D SCOLOR 16711680 CCOLOR 255 BOXES y FRAME y DATE y FILE y DIR y RIG n RST n FITS 100 PORIENT L 7 E 100 100 1 5 156 1090 a 100 100 1 b 100 100 1 ren 100 100 1 lt E gt 100 100 lt E gt 100 100 Oooo Ei tuU PND SB 0 30 CHAPTER 1 INSTALLATION OF UNITEX Chapter 2 Loading a text One of the main functionalities of Unitex is to search a text for expressions To do that texts have to undergo a set of preprocessing steps that normalize non ambiguous forms and split the text in sentences Once these operations are performed the electronic dictionaries are applied to the texts Then one can search more effectively in the texts by using grammars This chapter describes the different steps for text preprocessing 2 1 Selecting a language When starting Unitex the program asks you to choose the language in which you want to work see figure 2 1 The languages displayed are the ones that are present in the Unitex system directory and those that are installed in your personal working directory If you use a language for the first time Unitex copies the system directory for this language to your personal directory except for the dictionaries in order to save disk space WARNING If you a
209. es and produces an HTML page that shows their differences see section 6 10 6 page 150 lt concor1 gt and lt concor2 gt con cordance index files must have absolute names because Unitex uses these names to deduce on which text there were computed OPTIONS e o X out X output HTML page e f FONT font FONT name of the font to use in output HTML page e s N size N font size to use in output HTML page e d diff only don t show identical sequences 13 11 Convert Convert OPTIONS text 1 text 2 text 3 With this program you can transcode text files OPTIONS e s X src X input encoding e d X dest X output encoding default LITTLE ENDIAN Transliteration options only for Arabic e F delaf the input is a DELAF and we only want to transliterate the in flected form and the lemma e 5 delas the input is a DELAS and we only want to transliterate the lemma Output options e r replace input files are overwritten default e o file output file name of destination file only one file to convert e ps PFX input files are renamed with the PFX prefix toto txt PFXtoto txt e pd PFX ouput files are renamed with the PFX prefix 13 11 CONVERT 261 e ss SFx input files are named with the SFX suffix toto txt totoSFX txt e sd SFx ouput files are named with the SFX suffix HTML options Convert offers s
210. ext file sentenceN grf where N represents the number of the sentence When you select a sentence if a modified graph exists for this sentence this one is displayed You can then reset the automaton of that sentence by clicking on the botton Reset Sentence Graph cf figure 7 28 2344 sentences Ivanhoe by Sir Walter Scott Sentence Reset Sentence Graph Rebuild FST Text close elag frame kack by fsi water sean N ProperNoun PREP N ProperNoun Apply Elag Rule Figure 7 28 Modified sentence automaton During the construction of the text automaton all the modified sentence graphs in the text 7 5 MANIPULATION OF TEXT AUTOMATA 181 file are erased NOTE After you reconstruct the text automaton you can save your manual modifications In order to do that click on the button Rebuild FST Text All sentences that have been modified are then replaced in the text automaton by their modified versions The new text automaton is then automatically reloaded Manually resolve Ambiguities The text automaton may contains many paths of tags because of lexical ambiguity You can resolve ambiguities with ELAG Grammars or manually select the right paths for one or each graph of the sentence automaton To do so you can perform a right click on the box you want to keep when several boxes with different tags are proposed The edges of the selected box will become more bold when the other boxes wil
211. ey Start the Paint program in the Windows Utilities menu Press lt Ctrl V gt Paint will tell you that the image in the clipboard is too large and asks if you want to enlarge the image Click on Yes You can now edit the screen image Select the area that interests you To do so switch to the select mode by clicking on the dashed rectangle symbol in the upper left corner of the window You can now select the area of the image using the mouse When you have selected the zone press lt Ctrl C gt Your selection is now in the clipboard you can now just go to your document and press lt Ctrl V gt to paste your image On Linux Take a screen capture for example using the program xv Edit your image at once using a graphic editor for example TheGimp and paste your image in your document in the same way as in Windows 112 CHAPTER 5 LOCAL GRAMMARS Vector graphics If you prefer vector graphics you can save your graph under the SVG file format which is editable with softwares like the Open Source one Inkscape 24 With this software you can obtain PostScript exports ready to use in pretty ITEX documents 5 4 2 Printing a Graph You can print a graph by clicking on Print in the FSGraph menu or by pressing lt Ctrl P gt WARNING You should make sure that the page orientation parameter portrait or land scape corresponds to the orientation of your graph You can setup the printing preferences by clicking on Page Setup
212. f legibility It is for example the case of the grammar in figure 7 21 which imposes a constraint between a verb and the pronoun which follows it En ETES Eme lt PRO PpvLE gt lt PRO PpvLUI gt lt PRO PpvPR gt Figure 7 21 ELAG grammar checking verb pronoun agreement As one can see in figure 7 22 one can write an equivalent grammar by factorizing all the 176 CHAPTER 7 TEXT AUTOMATON then parts into only one The two grammars will have exactly the same effect on the text automaton but the second one will be compiled much more quickly lt PRO PpvLE gt lt PRO PpvLUI gt lt PRO PpvPR gt lt PRO T on gt Figure 7 22 Optimized ELAG grammar checking verb pronoun agreement Using lexical symbols It is better to use lemmas only when it is necessary That is particularly true for some gram matical words when their subcategories carry almost as much of information as the lemmas themselves In any case it is recommended to specify its syntactic semantic and inflectional features as much as possible For example with the dictionaries provided for French it is preferable to replace symbols like je PRO 1s lt je PRO PpvIL 1s gt and lt je PRO gt with the symbol lt PRO Ppv11 1s gt Indeed all these symbols are identical insofar as they can recognize only the single entry of the dictionary je PRO PpvIL 1ms 1fs How ever as the program does not deduce this information automatically if
213. file is encoded in UTF 8 p SCRIPT script SCRIPT produces a HTML concordance file where occurrences are links described by SCRIPT For instance if you use phttp www google com search q you will obtain a HTML concor dance file where occurrences are hyperlinks to Google queries i index produces an index of the concordance made of the content of the occurrences with the grammar outputs if any preceded by the positions of the occurrences in the text file given in characters u offsets uima offsets produces an index of the concordance rela tive to the original text file before any Unitex operation Offsets is supposed to be the file produced by Tokenize s output offsets option 13 9 CONCORD 259 PRLG X Y produces a concordance for PRLG corpora where each line is prefixed by information extracted with Unxmlize s PRLG option X is the file produced by Unxmlize s PRLG option and Y is the file produced by To kenize s output_offsets option Note that if this option is used in addi tion with u the Y argument oferrides the argument of u e xml produces xml index of the concordance w xml with header produces xml index of the concordance with full xml header A axis quite the same as index but the numbers represent the me dian character of each occurrence Fore more information see 31 x xalign another index file used by the text alignment module Each li
214. files These two files are text files that contain the list of lexical units sorted alphabetically or by frequence In the tok by alph txt file each line is composed by a unit followed by a tab and the number of occurrences of the unit within the text The lines of the tok by req txt file are formed after the same principle but the number of occurrences is placed after the tab and the unit 14 4 6 The enter pos file This file is a binary file containing the list of positions of the newline symbol in the snt file Each position is the index in the text cod file where a newline has been replaced by a space These positions are integers that are encoded in 4 bytes 14 5 Text Automaton 14 5 1 The text tfst file The text t fst file represents the text automaton It is a text file that starts with a ten digit line indicating the number of sentence automata it contains Then for each sentence automaton you have the following header lines e SXXX4 XXX number of the sentence e foo foo foo 9 text of the sentence e a b c d e g h 4 for each token of the sentence we have a pair x y x is the token index in file tokens t xt y is the length of the token in charac ters e X Y X is the offset of the first token of the sentence in tokens from the begin ning of the text Y is the same but the offset is in characters from the beginning of the text Then all states of the automaton are encoded one per line If
215. for Dutch German Norwegian amp Russian C Construct Text Automaton Cancel and close text Figure 2 15 Preprocessing a tagged text Chapter 3 Dictionaries 3 1 The DELA dictionaries The electronic dictionaries distributed with Unitex use the DELA syntax Dictionnaires Elec troniques du LADL LADL electronic dictionaries This syntax describes the simple and compound lexical entries of a language with their grammatical semantic and inflectional information We distinguish two kinds of electronic dictionaries The one that is used most often is the dictionary of inflected forms DELAF DELA de formes Fl chies DELA of inflected forms or DELACF DELA de formes Compos es Fl chies DELA of compound inflected forms in the case of compound forms The second type is a dictionary of non inflected forms called DELAS DELA de formes simples simple forms DELA or DELAC DELA de formes compos es compound forms DELA Unitex programs make no distinction between simple and compound form dictionaries We will use the terms DELAF and DELAS to distinguish the inflected and non inflected dictionaries no matter they contain simple word compound words or both 3 1 1 The DELAF format Entry syntax An entry of a DELAF is a line of text terminated by a newline that conforms to the following syntax apples apple N conc p this is an example The different elements of this line are e apples is the inflected form of the entry it is mandat
216. fr outils ALIGN align html 10 67 Annie MEUNIER Nominalisation d adjectifs par verbes supports 1981 Th se de doctorat Universit Paris 7 9 1 68 Sun Microsystems Java http java sun com 1 2 346 BIBLIOGRAPHY 69 Christian MOLINIER and Francoise LEVRIER Grammaire des adverbes description des formes en ment Droz Gen ve 2000 9 1 70 Anne MONCEAUX Le dictionnaire des mots simples anglais mots nouveaux et variantes orthographiques Technical Report 15 IGM Universit de Marne la Vall e 1995 3 8 71 Marcello C M MUNIZ Maria das Gra as V NUNES and Eric LAPORTE UNITEX PB a set of flexible language resources for Brazilian Portuguese In Proceedings of the Workshop on Technology of Information and Human Language 2005 Sao Leopoldo Brazil Unisinos 3 8 72 OpenOffice org http www openoffice org 2 2 9 2 2 73 Dong Ho PAK Lexique grammaire compar fran ais cor en Syntaxe des construc tions compl tives PhD thesis UQAM Montr al 1996 9 1 74 Soun Nam PARK La construction des verbes neutres en cor en 1996 Th se de doctorat Universit Paris 7 9 1 75 S bastien PAUMIER and Dana Marina DUMITRIU Editable text alignments and powerful linguistic queries In Matthieu Constant Takuya Nakamura Michele De Gioia and Sara Vecchiato editors 27th International Conference on Lexis and Grammar LGC 08 pages 117 125 September 2008 10 10 2 76 Adam PRZEPI RKOWSKI and Ma
217. g and installing files Install X11 app X11 is available as an optional install on the Mac OS X v10 3 Panther and Mac OS X v10 4 Tiger install disks the disks you received when you bought your computer Run the In staller select the X11 option and follow the instructions After installation X11 will be available in Applications Utilities Download and install Unitex as usual Download Unitex and uncompress it in the same way that for Linux See section 1 5 3 for instructions on compiling C programs Download SoyLatte the Java 1 6 port It is available from http landonf bikemonkey org static soylatte Unless you know why you need another package you will chose the 32 Bit Binaries distri bution 32 bit JDK for Mac OS X 10 4 and 10 5 When prompted enter the username and password Username jr1 Password I am a Licensee in good standing Install SoyLatte Either in command line mode e Open the terminal Applications Utilities Terminal e Find the SoyLatte archive If it is on your desktop change the current directory by typ ing the following command in the terminal the character means home directory cd Desktop e Move mv the archive anywhere on your file system where you want to install it If you chose usr local like me it requires the administrative privileges use sudo and type your password when prompted Note if the directory does not exist you can create it before you move the file su
218. g with Regular Expressions in section 4 3 1 In Cassys we use the lexical tag in a special way A cascade of transducers is inter esting to locate the island of certainty first It is necessary for such a system to avoid that previously recognized patterns be ambiguous with patterns recognized by the following graphs To do that you can tag the patterns of your graphs surrounding 12 2 DETAILS ON CASSYS 247 them by and fag1 tag2 tagn in the outputs of the graph where tag1 tag2 etc are your own tags To explain this behavior here is a very simple example The text on which we work is bac a bc cc a b b ba ab a b bca a b c abaabc The graph grfAB recognizes the sequence ab in the text and tags this sequence with the lexical tag a b AB This graph is merged with the text and adds its outputs and AB to the text Figure 12 8 The graph grfAB The resulting text is bac a b AB c cc a b AB b ba ab a b AB bca a b AB c abaabc Now the pattern a b is tagged AB A part a or b alone of this pattern cannot be recognized because of the tagging of a b After that graph the cascade applies another graph named tagAB 12 9 contain ing the lexical masks AB It recognizes all the sequences lexically tagged by the previous graph Figure 12 9 The graph tagAB The resulting text is bac a b AB c ABC cc la b AB b ba ab a b AB bca BCA la b AB c ABC abaabc The concordance display
219. gnment Save alignment as Locate Figure 10 8 Displaying matched sentences To exploit parallel texts it is then interesting to retrieve sentences aligned with matched sentences This can be done by selecting for the other text the display mode Aligned with source concordance In this mode Unitex filters sentences that are not linked to matched sentences in the source text So it is easy to lookup for an expression in one text and to find the corresponding sentences in the other as shown on Figure 10 9 10 3 PATTERN MATCHING D My Unitex XAlign funtana xml mais nous assassinons tour de bras comme nous mangeons comme nous respirons comme nous accomplissons les gestes les plus quotidiens Apr s avoir mang le sien l un d entre nous commen ait Tante donne moi le dessus s il niait Elle All sentences Plain text amp Matched sentences All sentences HTML Aligned with target concordance sugrumam dar noi asasinam cu at ta nongalant de parca am minca am respira am face un gest de zi cu zi Dup ce isi manca portia unul dintre noi ncepea Tanti d mi te rog partea de deasupra M tuga detaga partea de sus ornat de zahar si buc ti de ciocolata si i o dadea ea multumindu se sa gi linga degetele murdare de zah r 209 All sentences Plain text Matched sentences All sentences HTML Aligned with source concordance 8 Locate
220. graph subgraph relation e The first displays a list of graphs called by the current graph e The second button shows the list of all the graphs calling the current graph as a sub graph The two green arrows button will refresh the current graph to load the latest version of the current graph If any graph has its grf file changed by any operation while displayed in an Unitex window a window will pop up to warn you and invite you to refresh its window 5 2 EDITING GRAPHS 105 The balance button allows you to compare the current graph to another graph or another version of the same graph This will display a new window as in Figure 5 26 containing both graphs with colours pointing out the different types of changes between the two graphs insertion removal moves of each state of the graph and change of the content of a state respectively in green red purple and yellow Graph Diff oct added removed M moved M content changed avocats Figure 5 26 DIFF The last six buttons are shortcuts to use variables morphological mode or insert contexts around one or several selected states These buttons will be clickable only when one or several states are currently selected e input variable see section 5 2 5 e output variable see section 6 8 e lt gt morphological mode see section 6 4 e left context see section 6 3 e 5 r
221. gry as a wolf gladnim kao vuk gladan kao vuk AC A3XN2 p7ngea hungry as a wolf gladnim kao vuci gladan kao vuk AC A3XN2 p7ngea hungry as a wolf gladnim kao vukovi gladan kao vuk AC A3XN2 p7ngea hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 w2mgea hungry as a wolf gladna kao vuci gladan kao vuk AC A3XN2 w2mgea hungry as a wolf gladna kao vukovi gladan kao vuk AC A3XN2 w2mgea hungry as a wolf gladne kao vuk gladan kao vuk AC A3XN2 w2fgea hungry as a wolf gladne kao vuci gladan kao vuk AC A3XN2 w2fgea hungry as a wolf gladne kao vukovi gladan kao vuk AC A3XN2 w2fgea hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 w2ngea hungry as a wolf gladna kao vuci gladan kao vuk AC A3XN2 w2ngea hungry as a wolf gladna kao vukovi gladan kao vuk AC A3XN2 w2ngea hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 w4mgea hungry as a wolf gladna kao vuci gladan kao vuk AC A3XN2 w4mgea hungry as a wolf 235 236 gladna gladne gladne gladne gladna gladna gladna CHAPTER 11 COMPOUND WORD INFLECTION kao vukovi gladan kao vuk AC_A3XN2 w4mgea hungry as a wolf kao vuk gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vukovi gladan kao vuk AC_A3XN2 w4fgea hungry as a wolf kao vuk gladan kao vuk AC_A3XN2 w4ngea hungry as a wolf kao vuci gladan kao vuk AC_A3XN2 w4ngea hungry as a wolf kao vukovi gladan k
222. gt lt h4 gt lt font color blue gt Blue lt font gt identical sequences lt br gt lt font color red gt Red lt font gt similar but different sequences lt br gt lt font color green gt Green lt font gt sequences that occur in only one of the two concordances lt br gt lt table border 1 cellpadding 0 style font family Courier new font size 12 gt lt tr gt lt td width 450 gt lt font color blue gt ed in ancient times lt u gt a large forest lt u gt covering the greater par lt font gt lt td gt lt td width 450 gt lt font color blue gt ed in ancient times lt u gt a largeforest lt u gt covering the greater par lt font gt lt td gt lt tr gt lt tr gt lt td width 450 gt lt font color green gt ge forest covering lt u gt the greater part lt u gt amp nbsp of the beautiful hills lt font gt lt td gt lt td width 450 gt lt font color green gt lt font gt lt td gt 308 CHAPTER 14 FILE FORMATS lt tr gt lt table gt lt body gt lt html gt 14 7 Text dictionaries The Dico program produces several files that represent text dictionaries 14 7 1 dif and dlc dif and dlc are simple and compound word dictionaries in the DELAF format see section 3 1 1 14 7 2 err This file is made of unkown words one per line 14 7 3 tags err This file is made of unkown words one per line The difference with the err file is that in this o
223. h window 5 2 e New window for concordance in Debug mode with the graph used for the Locate and the list of boxes that matches each token of the found sequences IMPORTANT as some file formats changed and some new files were introduced we rec ommend that you repreprocess your existing text files especially if you work with text au tomata Content Chapter 1 describes how to install and run Unitex Chapter 2 presents the different steps in the analysis of a text 14 CONTENTS Chapter 3 describes the formalism of the DELA electronic dictionaries and the different operations that can be applied to them Chapters 4 and 5 present different means for making text searches more effective Chapter 5 describes in detail how to use the graph editor Chapter 6 is concerned with the different possible applications of grammars The particu larities of each type of grammar are presented Chapter 7 introduces the concept of text automaton and describes the properties of this no tion This chapter also describes operations on this object in particular how to disambiguate lexical items with the ELAG program Chapter 8 describes the sequence automaton module the file formats that are accepted as input the user interface and introduces the search by approximation Chapter 9 contains an introduction to lexicon grammar tables followed by a description of the method of constructing grammars based on these tables Chapter 10 describes the te
224. he MWU lemma to be inflected in future this task will be partly automated For instance in case of porte fen tre the first constituent has to be identified by the user as a noun rather than a verb e For a given morphological identification and a set of inflectional values it returns all corresponding inflected forms For instance in Polish if the instrumental forms of the word reka are to be produced three forms should be returned rekq singular instru mental rekami and rekoma two variants of the plural instrumental reka lt Case Inst gt reka lt Nb sing Gen fem Case Inst gt rekami lt Nb pl Gen fem Case Inst gt rekoma lt Nb pl Gen fem Case Inst gt Such definition of an interface between the morphological system for simple words and the one for MWUs allows a better modularity and independence of one another The latter doesn t need to know how inflected forms of simple words are described analyzed and gen erated It only requires a set of correct inflected forms of a MWU s constituents Conversely the former system knows nothing about how the latter one combines the provided forms to produce multi word sequences 113 Integration in Unitex One of the major design principles of MULTIFLEX is to be as independent as possible of the morphological system for simple words However the existence of such a system is inevitable because MWUs consist of simple words which we need to be able to inflect in order to
225. he dictionaries must be described in the tagset def file otherwise the information in the corresponding entries will be discarded by ELAG 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 173 If words of the same subcategory differ by their inflectional profile it is necessary to write several lines into the complete part The disadvantage of this method of description is that it becomes difficult to make the distinction between such words in an ELAG grammar If one considers the description given by the previous example of a tagset def file certain adjectives of French take a gender and a number whereas others to not have any inflectional feature This allows for coding fixed sequences like de bonne humeur as adjective on the basis of their syntactic behavior Consider a French dictionary with such sequences as invariable adjectives without inflec tional features The problem is that if one wants to refer exclusively to this type of adjectives in a disambiguation grammar the lt A gt symbol is not appropriate since it will recognize all adjectives To circumvent this difficulty it is possible to deny an inflectional attribute by writing the character right before one of the possible values for this attribute Thus the lt A m p gt symbol recognizes all the adjectives which have neither a gender nor a number Using this operator it is possible to write grammars like those in figure 7 19 which imposes agreement in gender and number bet
226. he inflected form If the two forms are identical the compressed form contains only the gram matical semantic and inflectional information as in N Hum ms If the forms are different the compression program cuts up the two forms in units These units can be a space a hyphen or a sequence of characters that contains nei ther a space nor a hyphen This way of cutting up units allows the program to efficiently take into account the inflected forms of the compound words If the inflected and the canonical form do not have the same number of units the program encodes the canonical form by the number of characters to be removed from the inflected form followed by the characters to append For instance the line below is a line in the initial dictionary James Bond 007 N Since the sequence James Bond contains three units and 007 only one the canon ical form is encoded with _10 0 0 7 The _ character indicates that the two forms do not have the same number of units The following number here 10 indicates the number of characters to be removed The sequence 0 0 7 indicates that the sequence 007 should be appended The digits are preceeded by the character so they will not be confused with the number of characters to be removed Whenever the two forms have the same number of units the units are compressed two by two Each pair consists of a unit the inflected form and the corresponding unit in the canonical form If each of the two unit
227. he same path For example on Figure 11 6 the final output contains Gen but g may only take one value determined by the first constituent Unification variables are particularly useful in highly inflected languages For example in Polish most nouns inflect for number 2 values and case 7 values which implies at least 14 different forms if variants and syncretic forms are distinguished This score is even higher for adjectives which inflect for number case and gender 3 till 9 values according to different approaches If no unification mechanism were available each of these numerous forms would have to be described by a separate path in the graph The use of unification variables allows to dramatically reduce the size of the graph to one path only in most cases For example Figure 11 7 shows the graph for Polish compounds that inflect like pranie m zgu brainwashing or powozenie koniem horse coaching Their third constituent has its case fixed most often to genitive or instrumental Their first and third constituent inflect in number independently from each other pranie m zg w prania m zgu prania m zg w etc That s why either of them has a different unification variable for number inflection 111 and n2 The three variables n1 n2 and c may be instantiated to any value from their respective domains sing pl sing pl and Nom Gen Dat Acc Inst Loc Voc cf Morphology txt file in section 11 2 1 The whole MWU in
228. help Cassys applies a list of grammar to a text and saves the matching sequence index in a file named Concord indstored in the text directory The target text file has to be a preprocessed snt file with its _snt directory The transducer list file is a file in which each line contains the path to a transducer followed by the output policy to be applied to this transducer Instead a list file you can specify each file and each output policy by a set of couple of s transducer file and m transducer policy argument to enumerate the list The policy may be MERGE or REPLACE The file option the alphabet option and the transducer list file option are mandatory As the locate pattern program this program saves the references to the found oc currences in a file called concord ind stored in the _snt directory of the text The file concord ind produced is in the same format as described in the chapter 14 but the cascade may be constituted of graphs applied in merge or replace mode so the M or R at the first line of the file concord ind has no sense in this context 256 CHAPTER 13 USE OF EXTERNAL PROGRAMS 13 7 CheckDic CheckDic OPTIONS dic This program carries out the verification of the format of a dictionary of DELAS or DELAF type dic corresponds to the name of the dictionary that is to be verified OPTIONS delaf checks an inflected dictionary s delas checks a non inflected dictionary r strict
229. herits its gender number and case from its first con stituent Its gender is fixed Gen g while its number and case are instantiated to any of the 14 possible combinations The single path in this graph would have to be replaced by 28 different ones if the use of unification variables were not allowed 11 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 221 H 1 Gen g Nb nl Case c gt 2 3 Nb n2 O Gen g Nb n1 Case c gt e g pranie mozgu Figure 11 7 Inflection graph for pranie m zgu Orthographic and Other Variants Our formalism allows for any constituent to be omitted or moved within different inflected forms if there is a need for that It also enables the insertion of extra graphical units which do not appear in the base form of the MWU This allows to extend an inflection paradigm to a more general variation description e g orthographic or partly syntactic variation see 55 for an extensive study on term variation For example in English student union appears in corpus also as students union and students union in singular or plural in each case Our formalism allows to include both types of variation in one description cf Figure 11 8 lt Nb n gt Figure 11 8 Inflection graph for student union Figure 11 9 shows an example in which additionally to the insertion of a new constituent the order of constituents may be reverted The upper path allows to generate e g birth d
230. hile Locate only finds 5763 matches This is because some words have been normalized like au le or du de le So when you look for lt le DET gt LocateTfst matches those tags that were added to the text automaton by the nor malization grammar and Concord uses the original sequence in the text to produce the concordance file as shown on Figure 7 34 Concordance D My Unitex FrenchiCorpus 80jours snticoncord html m laient quelques jeunes Anglais qui le million en poche allaient fond i le million en poche allaient fonder au loin des comptoirs de commerce llion en poche allaient fonder au loin des comptoirs de commerce Le r au loin des comptoirs de commerce 5 Le purser l homme de confianc omptoirs de commerce 5 Le purser l homme de confiance de la Compagn Le purser l homme de confiance de la Compagnie l gal du capitaine l homme de confiance de la Compagnie l gal du capitaine bord faisai Figure 7 34 A surprising concordance for pattern lt le DET 7 8 TABLE DISPLAY 185 e lt TOKEN gt does not match tokens as defined in tokens txt It matches any tag of the text automaton Matched tags can be either longer than text tokens if they are compound word tags or even shorter if the text automaton contains morphological analysis like un as shown on Figure 3 18 page 69 7 8 Table display Sentence automata can be displayed in a table format To do that you just h
231. his submenu Apply CasSys cascade is active only if a text has previously been opened The CasSys window Figure 12 5 displays the contain of the CasSys folder of the current language It permits to choose the file containing the list of transducers to apply on the text The displayed files are in a special file type CaSCade configuration file csc When this list is chosen you can click on the Launch button to apply the cascade Any morphological dictionnaries added in your preferences is applied to your graphs This preferences may be edited from the main unitex frame info Preferences morpho logical dictionnaries 121 4 Displaying the results of a cascade The result of a cascade is an index file concord ind just as for the Locate pattern operation This index file contains all the sequences recognized using the restrictions imposed by the rules of unitex In order to display a concordance you have to click on the Build concordance button as described in Chapter 6 in the menu Text Located sequences 12 6 presents a sample of 12 1 APPLYING A CASCADE OF TRANSDUCERS WITH CASSYS 243 cl DELA FSGraph Lexicon G Open Ctri N Open Tagged Text Preprocess Text Change Language Apply Lexical Resources Co Locate Pattern Located Sequences Compile Elag Grammars Construct FST Text Convert FST Text to Text Close Text Quit Unitex Figure 12 4
232. hoe_snticoncord html said Athelstane upon whose memory the Abbot s good ale for Burton was ala mounted some by the dexterity of their adversary s lance some by the s ES The javelin inflicted a wound upon the animal s shoulder and narrowly mis the Templar aimed at the centre of his antagonist s shield and struck it r is not yet very far spent let the archer s shoot a few rounds at the he back of which was decorated with two ass s ears and which was placed taking their directions more from the Baron s eye and his hand than his Figure 6 28 Results of the application of the grammar shown on Figure 6 27 6 4 THE MORPHOLOGICAL MODE 129 6 4 The morphological mode 6 4 1 Why As Unitex works on a tokenized version of the text it is not possible to perform queries that need to enter inside tokens except with morphological filters see section 4 7 as shown on Figure 6 29 ors 2 This does not work We should use the following morphological filter lt lt un able gt gt Figure 6 29 Matching morphological things However even morphological filters cannot allow any query since they cannot refer to dic tionaries Thus it is impossible to formulate this way a query like a word made of the prefix un followed by an adjective suffixed with able To overcome this difficulty we introduced a morphological mode in the Locate program It consists of bounding a part of your grammar with the special symbol
233. iable before its beginning or absence of the beginning or end of a variable by default it will be ignored during the emission of outputs See section 6 10 2 for other variable error policies There is no limit to the number of possible variables The variables can be nested and even overlap as is shown in figure 6 47 140 CHAPTER 6 ADVANCED USE OF GRAPHS Monday Tuesday Wednesday Thursday lt NB gt P Friday Saturday Sunday November December DayAndNumber NumberAndMonth DayAndNumber NumberAndMonth Figure 6 47 Overlapping variables 6 8 OUTPUT VARIABLES 141 6 8 Output variables Normal variables declared with xxx and xxx capture portions of the input text It is also possible to capture portions of the outputs produced by your grammar with output variables Such variables are declared with xxx and xxx Those tags appear in blue as shown on Figure 6 48 Note that when an output variable is being declared the outputs are not emitted in the output occurrence they are just stored into the pending output vari able s Note that outputs are processed first so that if an output string contains something like SA LEMMAS the output variable will not contain this raw string but rather the lemma associated to variable A Moreover output variables only capture explicit outputs produced by your grammar Thus even if you work in MERGE mode output variables never capture the input text For instance thi
234. ical tag format is transformed into an xml like format This change is done in order to provide an easier to manipulate text to the end user From this format it is easier to apply transducers to get the output anyone wants More precisely The lexical tag has the following format forme lemme codel code2 flexl flex2 The xml like output of cassys has the following format lt csc gt lt form gt forme lt form gt lt lem gt lemme lt lem gt lt code gt codel lt code gt lt code gt code2 lt code gt lt inflect gt flex1 lt inflect gt lt inflect gt flex2 lt inflect gt lt csc gt 246 CHAPTER 12 CASCADE OF TRANSDUCERS 12 2 4 The Unitex rules used for the cascade In the cascade each successive graph is applied following the unitex rules e Insertion to the left of the matched patterns in the merge mode the ouput is inserted to the left of the recognized sequence e Priority of the leftmost match during the application of a local grammar over lapping occurrences are all indexed During the construction of a concordance all these overlapping occurrences are presentend but CasSys modifies the text with each graph of the cascade so it is necessary to choose among these occur rences the one that will be taken into account To do that the priority is given to the leftmost sequence e Priority of the longest match in CasSys during the application of a graph it is the longest sequence
235. ich defines different properties of letter e Ee 14 2 2 Sorted alphabet The sorted alphabet file defines the sorting priorities of the letters of a language It is used by the Sort Txt program Each line of that file defines a group of letters If a group of letters A is defined before a group of letters B every letter of group A is inferior to every letter in group B 294 CHAPTER 14 FILE FORMATS The letters of a group are only distinguished if necessary For example if the group of letters e amp has been defined the word bahi should be considered smaller than estuaire and also smaller than t Since the letters that follow e and determine the order of the words it is not necessary to compare letters e and since they are of the same group On the other hand if the words chant s and chantes are to be sorted chantes should be considered as smaller Itis therefore necessary to compare the letters e and to distinguish these words Since the letter e appears first in the group e it is considered to be smaller than chant s The word chantes should therefore be considered to be smaller than the word chant s The sorted alphabet file allows the definition of equivalent characters It is therefore possible to ignore the different accents as well as capitalization For example if the letters b c and d are to be ordered without considering capitalization and the cedilla it is possible to
236. ics Collocate Figure 4 12 collocate count and other information 87 88 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS Chapter 5 Local grammars Local grammars are a powerful tool to represent the majority of linguistic phenomena The first section presents the formalism in which these grammars are represented Then we will see how to construct and present grammars using Unitex 5 1 The local grammar formalism 5 1 1 Algebraic grammars Unitex grammars are variants of algebraic grammars also known as context free grammars An algebraic grammar consists of rewriting rules Below you see a grammar that matches any number of a characters S aS S The symbols to the left of the rules are called non terminal symbols since they can be replaced Symbols that cannot be replaced by other rules are called terminal symbols The items at the right side are sequences of non terminal and terminal symbols The epsilon symbol designates the empty word In the grammar above S is a non terminal symbol and a a terminal symbol S can be rewritten as either an a followed by a S or as the empty word The operation of rewriting by applying a rule is called derivation We say that a grammar generates a word if there exists a sequence of derivations that produces that word The non terminal that is the starting point of the first derivation is called an axiom The grammar above also generates the word a
237. icx NC_ImePrezime N Comp Hum PersName s6vf Dinkicx Mirosinki Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s7 vf Mirosinka Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName slvf Mirosinke Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s2vf Mirosinki Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s3vf Mirosinku Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s4vf Mirosinka Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s5vf Mirosinkom Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s6vf Mirosinki Dinkicx Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s7vf gladni kao vuk gladan kao vuk AC_A3XN2 slmgda hungry as a wolf gladan kao vuk gladan kao vuk AC_A3XN2 slmgka hungry as a wolf gladna kao vuk gladan kao vuk AC_A3XN2 slfgea hungry as a wolf gladno kao vuk gladan kao vuk AC A3XN2 s1ngea hungry as a wolf gladnoga kao vuk gladan kao vuk AC A3XN2 s2mgda hungry as a wolf gladnog kao vuk gladan kao vuk AC A3XN2 s2mgda hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 s2mgka hungry as a wolf gladne kao vuk gladan kao vuk AC A3XN2 s2fgea hungry as a wolf gladnoga kao vuk gladan kao vuk AC A3XN2 s2ngda hungry as a wolf gladnog kao vuk gladan kao vuk AC A3XN2 s2ngda hungry as a wolf gladna kao vuk gladan kao vuk AC A3XN2 s2ngka hungry as a wolf gladnome kao vuk gladan kao vuk AC A3XN2 s3mgda hungry as a wolf gladnom kao vuk gladan kao vuk AC A3XN2 s3mgd
238. iently linked with application programs which use some of those functions and data to form executables The Library below refers to any such software library or work which has been distributed under these terms A work based on the Library means either the Li brary or any derivative work under copyright law that is to say a work containing the Library or a portion of it either verbatim or with modifications and or trans lated straightforwardly into another language Hereinafter translation is included without limitation in the term modification Source code for a work means the preferred form of the work for making mod ifications to it For a library complete source code means all the source code for all modules it contains plus any associated interface definition files plus the scripts used to control compilation and installation of the library Activities other than copying distribution and modification are not covered by this License they are outside its scope The act of running a program using the Library is not restricted and output from such a program is covered only if its con tents constitute a work based on the Library independent of the use of the Library in a tool for writing it Whether that is true depends on what the Library does and what the program that uses the Library does 326 CHAPTER 14 FILE FORMATS 1 You may copy and distribute verbatim copies of the Library s complete source code as you receiv
239. if it concerns inflected forms lemmas grammatical and semantic and or inflectional codes Thus if you want to search for all the verbs which have the semantic feature t which indicates transitivity you just have to search for t by clicking on Grammatical code You will get the matching entries without confusion with all the other occurrences of the letter t 2 4 Opening a text Unitex deals with several types of documents The files with the extension snt are text files preprocessed by Unitex which are ready to be manipulated by the different system functions You can also load raw files ending with txt or XML and HTML files To open any of these files click on Open in the Text menu You can there choose the file type Raw Unicode Texts XML files HTML files Unitex Texts If you open XML or HTML files foo xml for example it will be preprocessed in order to remove non textual content This will produce a foo xml txt file containing only the textual content of the original file The resulting txt file will be processed to produce a snt file 2 5 PREPROCESSING A TEXT Find what Replace Occurrences Options Search from begining Y Grammatical code C Canonical form Search up C Inflected form C Flexional code e Search down en Open Tagged Text Preprocess Text Change Language Apply Lexical Resources Ctrl Locate Pattern Apply CasSys Cascade Located Seque
240. ight context see section 6 3 e 5 negative right context see section 6 3 106 CHAPTER 5 LOCAL GRAMMARS 5 3 Display options 5 3 1 Sorting the lines of a box You can sort the content of a box by selecting it and clicking on Sort Node Label in the Tools submenu of the FSGraph menu This sort operation does not use the Sort Txt program It uses a basic sort mechanism that sorts the lines of the box according to the order of the characters in the Unicode encoding Fit in screen Fit in window 60 80 100 120 O 140 Close all Figure 5 27 Zoom sub menu 5 3 2 Zoom The Zoom submenu allows you to choose the zoom scale that is applied to display the graph The Fit in screen option stretches or shrinks the graph in order to fit it into the screen The Fit in window option adjusts the graph so that it is displayed entirely in the window 5 3 DISPLAY OPTIONS 107 5 3 3 Antialiasing Antialiasing is a shading effect that avoids pixelization effects You can activate this effect by clicking on Antialiasing in the Format sub menu Figure 5 28 shows one graph displayed normally the graph on top and with antialiasing the graph at the bottom no_antialiasing grf X BOULOT Rechercheimanuelunitexyesourcesi S D D Figure 5 28 Antialiasing example This effect slows Unitex down We recommend not to use
241. igures 11 21 through 11 27 The DELACF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows avant garde avant garde NC_XXN SES avant gardes avant garde NC XXN fp bateau mouche bateau mouche NC NN ms bateaux mouches bateau mouche NC NN mp caf au lait caf au lait NC NXXXX ms caf s au lait caf au lait NC_NXXXX mp carte postale carte postale NC _NN fs cartes postales carte postale NC_NN fp cousin germain cousin germain NC_NNmf ms cousins germains cousin germain NC NNmf mp X cousine germaine cousin germain NC NNmf fs cousines germaines cousin germain NC NNmf fp franc ma on franc magon NC AN1 ms franc ma onne franc ma on NC_ANl fs franc macon franc magon NC_AN1 ms franc maconne franc ma on NC_ANl fs francs macons franc ma on NC_ANl mp francs maconnes franc magon NC_AN1 fp 228 CHAPTER 11 COMPOUND WORD INFLECTION E francs ma ons franc ma on NC_AN1 mp francs ma onnes franc ma on NC_AN1 fp n moire vive m moire vive NC_NN fs n moires vives m moire vive NC_NN fp nicroscope effet tunnel microscope effet tunnel NC_NXXXXXX ms nicroscopes effet tunnel microscope effet tunnel NC NXXXXXX mp porte serviette porte serviette NC VNm ms porte serviettes porte serviette NC VNm ms porte serviettes porte serviette NC VNm mp n n n n
242. ile are discarded by EI AC If a dictio nary entry contains such a code ELAG will produce a warning and will withdraw the code from the entry Consequently if two concurrent entries differ in the original text automaton only by unde clared codes these entries will become indistinguishable by the programs and will thus be unified into only one entry in the resulting automaton Thus the set of labels described in the file tagset def file is compatible with the dictio naries distributed with Unitex by factorizing words which differ only by undeclared codes and this independently of the applied grammars For example in the most complete version of the French dictionary each individual use of a verb is characterized by a reference to the lexicon grammar table which contains it We have considered until now that this information is more relevant to syntax than to lexical analysis and we thus don t have integrated them into the description of the tagset They are thus automatically eliminated at the time when the text automaton is loaded which reduces the rate of ambiguity In order to distinguish the effects bound to the tagset from those of the ELAG grammars it is advised to proceed to a preliminary stage of normalization of the text automaton before applying disambiguation grammars to it This normalization is carried out by applying to the text automaton a grammar not imposing any constraint like that of figure 7 20 Note that this gramm
243. iles are located in the XXX snt directory where XXX is txt without its extension OPTIONS e d X sntdir X uses directory X instead of the text directory note that X must be back slash terminated e n N number_token N adds tokens number each N token e r N range N emits only token from number N to end e r N M range N M emits only token from number N to M 286 CHAPTER 13 USE OF EXTERNAL PROGRAMS 13 47 UnitexTool UnitexTool lt utilities gt This program is a super program that allows you to invoke all Unitex external pro grams With it you can chain commands so that they will be invoked within a same system process in order to speed up processing This can done by invoking commands nested in round brackets as this UnitexTool cmd l args cmd 2 args etc For instance if you want to join a locate operation and the construction of the con cordance you can use the following command UnitexTool Locate tD My Unitex English Corpus ivanhoe snt D My Unitex English regexp fst2 aD MMy Unitex English Alphabet txt L I n200 morpho D Unitex2 0 English Dela dela en public bin b Y Concord D My Unitex English Corpus ivanhoe_snt concord ind fCourier new s12 140 r55 CL html aD MMy Unitex English Alphabet_sort txt 13 48 UnitexToolLogger UnitexToolLogger lt utilities gt This program is a superset of UnitexTool
244. in the FSGraph menu You can also print all open graphs by clicking on Print All Chapter 6 Advanced use of graphs 6 1 Types of graphs Unitex can handle several types of graphs that correspond to the following uses automatic inflection of dictionaries preprocessing of texts normalization of text automata dictionary graphs search for patterns disambiguation and automatic graph generation These differ ent types of graphs are not interpreted in the same way by Unitex Certain operations like transduction are allowed for some types and forbidden for others In addition special sym bols are not the same depending on the type of graph This section presents each type of graph and shows their peculiarities 6 1 1 Inflection transducers An inflection transducer describes the morphological variation that is associated with a word class by assigning inflectional codes to each variant The paths of such a transducer describe the modifications that have to be applied to the canonical forms and the corre sponding outputs contain the inflectional information that will be produced matrix matrices Figure 6 1 Example of an inflectional grammar The paths may contain operators and letters The possible operators are represented by the characters L R C D U P and W All letters that are not operators are characters The only 113 114 CHAPTER 6 ADVANCED USE OF GRAPHS allowed special symbol is the empty word lt E g
245. in the text au tomaton using the morphological mode see section 6 4 This functionality will be helpful for agglutinative languages like Korean The rule is simple any output of a dictionary graph that begins with a slash will be added to the file tags ind located in the text directory This file is used by the Txt 2Fst2 program CHAPTER 3 DICTIONARIES 68 HD lt lt 2GcTIILAITAILALALAIIIIDeCoxbooxTbocTkxThiixbooxkxbosdnalossdlosadlodlalaalasaloalo Grover 666 0001 lt lt s GaimalralralalaimiaidOxboortbarctihhxboodxxbomalanaalanalaalalaalanalanlo gt gt 666 001 lt lt GITLAITTAILALALATIIIIID Oxea haaa gt gt 66701 lt lt sGalralrialralalarimitiD gt gt 61 4 5 li shi Figure 3 16 Dictionary graph of roman numerals 3 8 BIBLIOGRAPHY 69 in order to add interpretations into the text automaton Let us consider the grammar shown on Figure 3 17 that matches words made of the prefix un followed by an adjective If we apply this grammar as a dictionary graph we obtain new paths in the text automaton as shown on Figure 3 18 Note that when two tags correspond to analysis within the same token the link between them is displayed with a dashed line gt G x x LEMMA x CODE Figure 3 17 Example of morphological dictionary graph DI FST Text 1S 2335 sentences It is unluck hare or ah Sentence 1 692 Away said Reset
246. inearize each sentence automaton You must also select the tagger data file with bin extension by clicking on the Set button Tagger data file suffixed by morph is the first variant of the tagger with inflectional codes and the one suffixed by cat is the second variant without inflectional codes If you want to use the morph data you also need to click on Normalize according to Elag tagset def for more details see section 13 37 about Tagger program 7 Construct the Text FST Normalization Build clitic normalization grammar available only for Portuguese Portugal Apply the Normalization grammar home sigogne unitex French Graphs Normalization Norm orf Set v Clean Text FST Normalize according to Elag tagset def Linearize with the Tagger home sigogne unitex French Dela corpus_data_cat bin Use Following Dictionaries previously constructed The program will construct the text FST according to the DLF DLC and tags ind files previously built by the Dico program for the current text Cancel Construct FST Figure 7 25 Configuration of the linearization of the text automaton For instance the text automaton shown on Figure 7 24 is the output of linearization of the text automaton shown on Figure 7 23 with cat tagger data Linearization of the automaton with morph tagger data is shown on Figure 7 26 7 4 3 Creation of a new tagger In order to create a new
247. inflect a MWU as a whole In its present version MULTIFLEX relies on the Unitex simple word inflection system e MULTIFLEX uses the same character encoding standards as Unitex i e Unicode 3 0 e MULTIFLEX uses the Unitex graph editor for the representation of inflectional paradigms of MWUs 11 3 INTEGRATION IN UNITEX 223 e MULTIFLEX admits similar principles of the morphological description as those ad mitted in the DELA system implemented in Unitex Thus an inflection paradigm is a set of actions to be performed on the lemma in order to generate its inflected forms and of corresponding inflection features to be attached to each generated form e MULTIFLEX allows to extend the Unitex dictionary treatment to the inflection of a DELAC DELA electronic dictionary of compounds into a DELACF DELA electronic dictionary of compounds inflected forms The format of the generated DELACF is compatible with Unitex while the format of the DELAC is novel but inspired from the one of the DELAS DELA electronic dictionary of simple words The following sections present for several languages complete examples of a DELAC into DELACF inflection within the MULTIFLEX Unitex interface 11 3 1 Complete Example in English Let us assume that the description of morphological features of English is given by the fol lowing Morphology txt file English lt CATEGORIES gt Nb s p lt CLASSES gt noun Nb lt var gt adj and that the e
248. information can be omitted In that case a default definition of letters is used see u_is_letter in Unicode cpp source file 13 1 Creating log files 7 Preferences for French Morphological dictionaries Directories Language amp Presentation Private Unitex directory where all user s data is to be stored home paumier unitex Set Graph repository y Produce log information in directory home paumier tmp Clear all logs Cancel Figure 13 2 Logging configuration You can create log files of external program launches These log files can be useful for debugging or regression tests You just need to enable this feature in the Preferences frame You have to choose a log directory where all log files will be stored and to select the Produce log check box Clicking on the Clear all logs button will remove all log files contained in this directory if any Then any further program execution will produce a unitex log XXX ulp file located in the log directory XXX stands for the log number that can be found in the console see next section 13 2 The console When Unitex launches an external program the invoked command line is stored in the console To see it click on Info Console When a command emits no error 13 3 UNITEX JNI 253 message it is displayed with a green icon Otherwise the icon is a red triangle that yo
249. ing loop detection e n no loop check disables error checking default e t tfst check checks wether the given graph can be considered as a valid sentence automaton or not 266 CHAPTER 13 USE OF EXTERNAL PROGRAMS e e no_empty_graph_ warning no warning will be emitted when a graph matches the empty word This option is used by Mult iFlex in order not to scare users with meaningless error messages when they design an inflection grammar that matches the empty word Output options e o file output file output file for error message e a append opens the message output file in append mode e s statistics displays statistics about fst2 file 13 19 Fst2List Fst2List o out p s f d a t s m m s a s 0s Str r s 1 Str 1 line i subname c SS O0xxxx fname This program takes a st 2 file and lists the sequences recognized by this grammar The parameters are e fname grammar name including st2 e o out specifies the output file 1st txt by default e a t s m indicates if the program must take into account t or not a the outputs of the grammars if any s indicates that there is only one initial state whereas m indicates that there are several ones this mode is useful in Korean The default value is a s e 1 line maximum number of lines to be printed in the output file e i subname indicates that the recursive exploration must end when the pro
250. ing the file the Unitex3 0 directory contains several subdirectories one of which is called App This directory contains a file called Unitex jar This file is the Java executable that launches the graphical interface You can double click on this icon to start the program To facilitate launching Unitex you may want to add a shortcut to this file on the desktop 1 4 Installation on Linux In order to install Unitex on Linux it is recommended to have system administrator per missions Uncompress the file Unitex3 0 zip ina directory named Unitex by using the following command unzip Unitex3 0 zip d Unitex Within the directory Unitex Src C build start the compilation of Unitex with the command make install or with the following if you have a 64 bits computer make install 64BITS yes You can then create an alias in the following way alias unitex cd Unitex App java jar Unitex jar 1 5 INSTALLATION ON MACOS X 19 1 5 Installation on MacOS X NOTE this short tutorial will tell you how to install and run Unitex on Mac OS X Your questions comments suggestions corrections are more than welcome Contact cedrick fairon uclouvain be There is an official Java 1 6 for MacOS X 10 5 64 bit Intel Core 2 Duo but there is no official solution for older OS X 10 4 or older PowerPC and 32 bit Intel Core Duo So 1 if you have OS X 10 5 and 64 bit Intel MacOS you can just get the Apple JRE 1 6 The only p
251. ing the words aften evening et blad journal The PolyLex program parses the list of unknown words after the application of dictionaries and tries to analyze each of these words as a compound word If a word has at least one analysis as a compound word it is removed from the list of unknown words and the lines produced for this word are appended to the simple word text dictionary 2 6 Opening a tagged text A tagged text is a text containing words with lexical tags enclosed in round brackets I do not like the square bracket N sign S 46 CHAPTER 2 LOADING A TEXT Such tags can be used to avoid ambiguities In the previous example it will be impossible to match square bracket as the combination of two simple words However the presence of these tags can alter the application of preprocessing graphs To avoid complications you can use the Open Tagged Text command in the Text menu With it you can open a tagged text and skip the application of preprocessing graphs as shown on Figure 2 15 Preprocessing amp Lexical parsing Preprocessing Sentence and Replace graphs should not be applied on tagged texts Tokenizing The text is automatically tokenized This operation is language dependant so that Unitex can handle languages with special spacing rules Lexical Parsing Apply All default Dictionaries C Analyse unknown words as free compound words this option Cancel but tokenize text is available only
252. ions of linguistic phenomena on the basis of recur sive transition networks RTN a formalism closely related to finite state automata Nu merous studies have shown the adequacy of automata for linguistic problems at all descrip tive levels from morphology and syntax to phonetic issues Grammars created with Unitex carry this approach further by using a formalism even more powerful than automata These grammars are represented as graphs that the user can easily create and update Lexicon grammar tables are matrices describing properties of some words Many such ta bles have been constructed for all simple verbs in French as a way of describing their rele vant syntactic properties Experience has shown that every word has a quasi unique behav ior and these tables are a way to present the grammar of every element in the lexicon hence the name lexicon grammar for this linguistic theory Unitex offers a way to automatically build grammars from lexicon grammar tables Unitex can be viewed as a tool in which one can put linguistic resources and use them Its technical characteristics are its portability modularity the possibility of dealing with lan guages that use special writing systems e g many Asian languages and its openness thanks to its open source distribution Its linguistic characteristics are the ones that have motivated the elaboration of these resources precision completeness and the taking into 11 12 CONTENTS account
253. is the possibly empty list of all syntax errors found in the dictionary absence of the inflected or the canonical form the grammatical code empty lines etc Each error is described by the number of the line a message describing the error and the contents of the line Here is an example of a message Line 12451 unexpected end of line garden N s The second and third parts display the list of grammatical codes and or semantic and inflectional codes respectively In order to prevent coding errors the program reports encodings that contain spaces tabs or non ASCII characters For instance if a Greek dictionary contains the ADV code where the Greek A character is used instead of the Latin A character the program reports the following warning ADV warning 1 suspect char 1 non ASC char 0391 D V Non ASCII characters are indicated by their hexadecimal character number In the example below the code 0391 represents Greek A Spaces are indicated by the SPACE sequence Km s warning 1 suspect char 1 space K m SPACE s 312 When the following dictionary is checked 1 2 et 3 INTJ abracadabra INTJ supercalifragilisticexpialidocious damned INTJ Paul N Hum Hum eat V W Pls Ps Plp P2p P3p the following CHECK DIC TXT file is obtained CHAPTER 14 FILE FORMATS INTJ Line 1 unprotected comma in lemmaY 1 2 et 3l INTJ 9 Line 2 unexpected end of line abrac
254. is therefore necessary to ensure that the headings of the columns occupy exactly one line If there is no line for the heading the first line of a table will be ignored anyway and if there are multiple heading lines from the second line on they will be interpreted as lines of the table 9 2 3 Parameterized graphs Parameterized graphs are graphs with variables referring to the columns of a lexicon grammar table This mechanism is usually used with syntactical graphs but nothing prevents the con struction of parameterized graphs for inflection preprocessing or for normalization Variables that refer to columns are formed with the symbol followed by the name of the column in capital letters the columns are named starting with 2 Example GC refers to the third column of the table Whenever a variable takes the value of a or sign the sign corresponds to the removal of a path through that variable It is possible to swap the meaning of these signs by typing an exclamation mark in front of the symbol In that case the path is removed when there is a sign and keeped where there is a one In all other cases the variable is replaced by the content of the table cell The special variable is replaced by the number of the line in the table The fact that its value is different for each line allows for its use as a simple characterization of a line That variable is not affected by an exclamation point to the left of it Figure
255. it if your machine is not powerful enough 108 CHAPTER 5 LOCAL GRAMMARS 5 3 4 Box alignment In order to get nice looking graphs it is useful to align the boxes both horizontally and vertically To do this select the boxes to align and click on Alignment in the Format sub menu of the FSGraph menu or press lt Ctrl M gt You will then see the window in Figure 5 29 The possibilities for horizontal alignment are e Top boxes are aligned with the top most box e Center boxes are centered on the same axis e Bottom boxes are aligned with the bottom most box alignment sl Horizontal Vertical Top Left Center Bottom Right _ Use Grid every 30 pixels OK Cancel Figure 5 29 Alignment window The possibilities for vertical alignment are e Left boxes are aligned with the left most box e Center boxes are centered on the same axis e Right boxes are aligned with the right most box Figure 5 30 shows an example of alignment The group of boxes to the right is quite a copy of the ones to the left that was aligned The option Use Grid in the alignment window shows a grid as the background of the graph This allows you to approximately align the boxes 5 3 DISPLAY OPTIONS 109 m EN more aset N Nan DY fits TM Figure 5 30 Example of box alignment grid grf X BOULOT Recherche manuelunitexresourcesiimg
256. its or Extract unmatching units depending on whether you are interested in sentences with or without matching units In the Show matching sequences in context box you can select the length in characters of the left and right contexts of the occurrences that will be presented in the concordance If an occurrence has less characters than its right context the line will be completed with the necessary number of characters If an occurrence has a length greater than that of the right context it will be displayed completely NOTE in Thai the size of the contexts is measured in displayable characters and not in real characters This makes it possible to keep the line alignment in the concordance despite the presence of diacritics that combine with other letters instead of being displayed as normal characters You can choose the sort order in the list Sort According to The mode Text Order displays the occurrences in the order of their appearance in the text The other six modes allow you to sort in columns The three zones of a line are the left context the occurrence and the right context The occurrences and the right contexts are sorted from left to right The left contexts are sorted from right to left The default mode is Center Left Col The concordance is generated in the form of an HTML file If a concordance reaches several thousands of occurrences it is advisable to display it in a web browser Firefox 11 Netscape 12 Int
257. iv de Li ges Denis Maurel LI Tours Figure 2 10 Sentence splitting grammar for French 2 5 3 Normalization of non ambiguous forms Certain forms present in texts can be normalized for example the English sequence I m is equivalent to I am You may want to replace these forms according to your own needs However you have to be careful that the forms normalized are unambiguous or that the removal of ambiguity has no undesirable consequences For instance if you want to normalize O clock to on the clock it would be a bad idea to replace OTT by on the because a sentence like John O Connor said it s 8 O clock would be replaced by the following incorrect sentence John on the Connor said it s 8 on the clock Thus one needs to be very careful when using the normalization grammar One needs to pay attention to spaces as well For example if one replaces re by are the sentence You re stronger than him 40 CHAPTER 2 LOADING A TEXT will be replaced by Youare stronger than him HU To avoid this problem one should explicitly insert a space i e replace re by are The accepted symbols for the normalization grammar are the same as the ones allowed for the sentence splitting grammar The normalization grammar is called Replace fst2 and can be found in the following directory home directory active language Graphs Preprocessing Replace As in the case of sentence splitting this grammar is a
258. ized units within the text are saved in a file called concord n These two files are stored in the directory of the text 13 26 LOCATETFST 273 13 26 LocateTfst LocateTfst OPTIONS lt fst2 gt Applies a grammar to a text automaton and saves the matching sequence index in a file named concord ind just as Locate does OPTIONS e t TFST text TFST complete path of the text automaton without omit ting the t st extension e a ALPH alphabet ALPH complete path of the alphabet file e K korean tells LocateTfst that it works on Korean e g X negation operator X specifies the negation operator to be used in Locate patterns The two legal values for X are minus and tilde default Using minus provides backward compatibility with previous versions of Uni tex Search limit options e 1 all1 looks for all matches default e n N number_of_matches N stops after the first N matches Matching mode options e S shortest matches e L longest matches default e A all matches Output options e 1 ignore ignore transducer outputs default e M merge merge transducer outputs with text inputs e R replace replace texts inputs with corresponding transducer outputs Ambiguous output options e b ambiguous outputs allows the production of several matches with same input but different outputs default 274 CHAPTER 13 USE OF EXTERNAL PROGRAMS e z
259. k drzxave N2X1 N Comp p2vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p3vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p3vm predsednike drzxave predsednik drzxave NC_N2X1 N Comp p4vm predsednike drzxava predsednik drzxave NC_N2X1 N Comp p4vm predsednici drzxave predsednik drzxave NC_N2X1 N Comp p5vm predsednici drzxava predsednik drzxave NC_N2X1 N Comp p5vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p6vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p6vm predsednicima drzxave predsednik drzxave NC_N2X1 N Comp p7vm predsednicima drzxava predsednik drzxave NC_N2X1 N Comp p7vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp w2vm predsednika drzxava predsednik drzxave NC_N2X1 N Comp w2vm predsednika drzxave predsednik drzxave NC_N2X1 N Comp w4vm predsednika drzxava predsednik drzxave NC_N2X1 N Comp w4vm Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fplq Ujedinxenih nacija Ujedinxene nacije NC_AXN3 N Comp NProp Org fp2q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp3q Ujedinxenim nacijama Ujedinxene nacije NC_AXN3 N Comp NProp Org fp3q Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fp4q Ujedinxene nacije Ujedinxene nacije NC_AXN3 N Comp NProp Org fp5q Ujedinxenima nacijama Ujedinxene nacije NC_AXN3 N Comp NProp tOrg fp6q Ujedinxenim nacijama Ujedinxene nacije
260. k the current matching operation if the condition is not verified This is done by inserting the sequence xxx SETS in the output of a graph box Then if a variable named xxx has been defined this sequence will be ignored in the output and the matching process will go on otherwise matching will be stopped and the program will backtrack This operates on normal variables as well as on output ones and dictionary entry variables defined in morphological mode You can check out if a variable has not been defined in the same way using xxx UNSETS Figure 6 50 shows a graph that use a such a variable test Figure 6 51 shows results obtained with this graph in MERGE mode a UNSET ADV FALSE Figure 6 50 Testing a variable Ej Concordance D My Unitex EnglishiCorpus ivanhoe 200 matches upon which he had hitherto ridden ADV TRUE to the served that he had included ADY FALSE in his P h the Conquest had inflicted ADV FALSE and to 1 Rebecca who had joined ADY FALSE him at Ashby mpanion Wamba had just entered ADY TRUE the hall ess as the Jew had laid ADV FALSE aside on the Figure 6 51 Results of a variable test 6 10 APPLYING GRAPHS TO TEXTS 143 6 9 2 Comparing variables Another kind of test you can perform consists of variable comparison You can compare a variable normal one output one or dictionary one against a constant string or another variable To do that you have to use the follo
261. ke Arabic significantly re duces the size of the output dictionary e v1 produces an old style bin file e v2 produces a new style bin file with no file size limitation to 16 Mb and a smaller size default This program takes a DELAF dictionary as a parameter and compresses it The compression of a dictionary dico dic produces two files e dico bin a binary file containing the minimum automaton of the inflected forms of the dictionary e dico inf a text file containing the compressed forms required for the re construction of the dictionary lines from the inflected forms contained in the automaton For more details on the format of these files see chapter 14 13 9 Concord Concord OPTIONS index This program takes a concordance index file produced by the program Locate and produces a concordance It is also possible to produce a modified text version tak ing into account the transducer outputs associated to the occurrences Here is the description of the parameters OPTIONS e FONT font FONT the name of the font to use if the output is an HTML file e s N fontsize N the font size to use if the output is an HTML file The font parameters are required if the output is an HTML file e only ambiguous Only displays identical occurrences with ambiguous outputs in text order e only matches this option will force empty right and left contexts More over if used with t
262. ks These symbols also called meta symbols are the following e E the empty word or epsilon Matches the empty string e lt TOKEN gt matches any token except the space used by default for morphological filters e MOT matches any token that consists of letters e MIN matches any lower case token e MAJ matches any upper case token e lt PRE gt matches any token that consists of letters and starts with a capital letter e DIC matches any word that is present in the dictionaries of the text e lt SDIC gt matches any simple word in the text dictionaries e CDIC matches any composed word in the dictionaries of the text e lt TDIC gt matches any tagged token like XXX XXX XXX e NB matches any contiguous sequence of digit 1234 is matched but not 1 234 e prohibits the presence of space NOTE as described in section 2 5 4 NO meta can be used to match the STOP marker not even TOKEN 4 3 LEXICAL MASKS 73 4 3 2 References to information in the dictionaries The second kind of lexical masks refers to the information in the text dictionaries The four possible forms are e lt be gt matches all the entries that have be as canonical form Note that this pattern is ambiguous if be is also a grammatical or semantic code e lt be gt matches all the entries that have be as canonical form This pattern is not ambiguous as the previous one e lt be V gt
263. l Af assessment N 5s4 of PREPY the DET Ddef s4 14 10 TAGGER FILES 315 behavior N sY of PREPY foreign_countries N pY PONCTY 4 She PRO Nomin 3fsf closed V 13s4 easily ADVY her DET Poss3fs pY eyes N pY when CONJY some DET 4Dadj pf infractions N pf might V 13p appear V W4 justified V K4 against PREPY higher Aq interests N pf PONCTY 4 NOTE Sentences must be delimited by empty lines The txt file format can also be used see section 14 4 1 Each word of the text must be represented by a valid lexical label aujourd hui ADV and sentences are delimited by 5 Here is the previous example in the t xt file format The DET Ddef s GATT N s had V I3s formerly ADV la DET Dind s political A assessment N s of PREP the DET Ddef s behavior N s of PREP foreign countries PONCT S She PRO Nomin 3fs closed V 13s easily ADV her DET Poss3fs p eyes N p when CONJ some DET Dadj p infraction N p might V 13p appear V W justified V K against PREP higher A interests N p PONCT S 14 10 2 The tagger data file The TrainingTagger program generates two data files by default used by the Tagger program in order to compute a second order hidden Markov model These files contain unigram bigram and trigram tuples extracted from the
264. l appear grayed see Figure 7 29 rsigext gt Here haunted of yore the fabulous Dragon of Wantley 2607 sentences Sentence 44H Reset Sentence Graph Rebuild FST Text Elag Frame K Remove greyed state 3 Automaton Table Explode Implode Apply Elag Rule DET Dd f s p Figure 7 29 Manually resolve ambiguities in sentence automaton You can then click on the Remove greyed states button to keep only the selected boxes as in Figure 7 30 182 CHAPTER 7 TEXT AUTOMATON Figure 7 30 Ambiguous boxes removed in sentence automaton 7 5 3 Display configuration Sentence automata are subject to the same presentation options as the graphs They use the same colors and fonts as well as the antialiasing effect In order to configure the ap pearance of the sentence automata you modify the general configuration by clicking on Preferences in the Info menu For further details refer to section 5 3 5 You can also print a sentence automaton by clicking on Print in the FSGraph menu or by pressing lt Ctrl P gt Make sure that the printer s page orientation is set to landscape mode To configure this parameter click on Page Setup in the FSGraph menu 7 6 Converting the text automaton into linear text If the text automaton does not contain any lexical ambiguity it is possible to build a text file corresponding to the unique p
265. l be such as subgraph 0013 grf By default subgraph names look like result 0013 grf where result grf designates the re sult main graph 13 37 Tagger Tagger OPTIONS lt tfst gt The input of this program is the text automaton in the specified t fst The pro gram applies the Viterbi Path algorithm to it and produces a linear automaton The automaton is pruned in a probabilistic way based on a second order hidden Markov model If the specified tagger data file contains tuples of cat tags the tagger prunes transitions on the basis of grammatical syntactic and semantic codes for example that DET Ddem versus that PRO Pdem Else if it contains tuples of morph tags so the tagger prunes transitions on grammatical semantic syntactic and in flectional codes the DET Ddef s versus the DET Ddef p In that case the automaton needs to be exploded before applying the tagging process and a tagset file must be specified by the t option below OPTIONS e a ALPH alphabet ALPH alphabet file 13 38 TAGSETNORMTFST 281 e o OUT output OUT output text automaton e t TAGSET tagset TAGSET name of the tagset description file e d DATA data DATA a bin tagger data file that contains occurrence counts for unigrams bigrams and trigrams in order to compute probabilities This file is obtained with the TrainingTagger program see section 14 10 2 13 38 TagsetNormTfst TagsetNormTfst OPTIONS
266. le B boule B boule B boule B boule B boule La Fe Ee FS AA Z oo oe qe A SS de de de de de de de de de glace la pistache glace la fraise glace la vanille gl g lace vanille lace fraise glace pistache pistache fraise vanille lace la pistache lace la fraise 134 CHAPTER 6 ADVANCED USE OF GRAPHS glace a la vanille glace vanille glace fraise glace pistache ET Hg Figure 6 37 Sample graph 6 6 Graph collections It can happen that one wants to apply several grammars located in the same directory For that itis possible to automatically build a grammar starting from a file tree structure Let us suppose for example that one has the following tree structure e Dicos Banque carte grf Nourriture eau grf pain grf truc grf If one wants to gather all these grammars in only one one can do it with the Build Graph Collection command in the FSGraph Tools sub menu One configures this operation by means of the window seen in figure 6 38 In the Source Directory field select the root directory which you want to explore in our example the directory Dicos In the field Resulting GRF grammar enter the name of the produced grammar WARNING Do not place the output grammar in the tree structure which you want to ex plore because in this case the program will try to read and to write simultan
267. le The system dictionaries are in the system directory and in that directory in the current language Dela sub directory Here is an example of this file delacf bing delaf bin 14 11 3 The user_dic def file The user_dic def file is a text file that describes the list of dictionaries the user has defined to be applied by default This file is in the directory of the current language and has the same format as the system dic def file The dictionaries need to be in the current language Dela sub directory of the personal directory of the user 14 11 4 The user cfg and unitex cfg files Under Linux Unitex expects the personal directory of the user to be called unitex and expects it to be in his root directory HOME If you want to change this default location a unitex cfg file is created in your home directory and it contains the path to your private Unitex directory This file is a UTF8 one Under Windows it is not always possible to associate a directory to a user per de fault To compensate for that Unitex creates a cfg file for each user that contains 320 CHAPTER 14 FILE FORMATS the path to his personal directory This file is saved under the name user login cfg in the Unitex Users system sub directory WARNING THIS FILE IS NOT IN UNICODE WARNING 2 THE PATH OF THE PERSONAL DIRECTORY IS NOT FOLLOWED BY A NEWLINE 14 12 Cassys files 14 12 1 Cassys configuration files csc To memorize the list
268. les are in the fst 2 format 314 CHAPTER 14 FILE FORMATS 14 9 4 rul files RUL FILES ARE NOT UNICODE FILES A rul file contains the different elg files that compose an ELAG rule set It contains one part per elg file Each part lists the ELAG grammars that correspond toa given elg file elg file names are surrounded with angles brackets The lines that start with a tabulation are considered as comments by the E1ag program Here is the elag rul file used for French PPVs PpvIL elgf PPVs PpvLE elgf PPVs PpvLUI elg s elag rul 0 elg PPVs PpvPR elgf PPVs PpvSeq elgY PPVs SE elg PPVs postpos elg lt elag rul 1 elg gt Y 14 10 Tagger files This section presents files produced and used by TrainingTagger and Tagger pro grams 14 10 1 The corpus txt file This file is used by the Training Tagger program in order to compute statistics for the Tagger program It contains sentences where each word is represented in a separate line Each line representing a word is composed of a word simple or compound followed by a slash and the tag of the word This tag is composed of a grammatical code sometimes followed by a and syntactic or semantic codes Inflectional codes are specified after a If the word is a compound simple words contained in it must be separated by a _ Here is an example of a corpus txt file The DET Ddef sY GATT N sY had V 13s4 formerly ADVY a DET Dind s politica
269. lete path of the text file without omitting the extension snt OPTIONS e y yes extracts all sentences containing matching units default 13 17 FLATTEN 265 e n no extracts all sentences that don t contain matching units e o OUT output OUT output text file e i X index X the ind file that describes the concordance By default x is the concord ind file located in the text directory The result file is a text file that contains all extracted sentences one sentence per line 13 17 Flatten Flatten OPTIONS lt fst2 gt This program takes a fst 2 grammar as its parameter and tries to transform it into a final state transducer OPTIONS e st the grammar is unfolded to the maximum depth and is truncated if there are calls to sub graphs Truncated calls are replaced by void transi tions The result is a fst2 grammar that only contains a single finite state transducer e r rtn calls to sub graphs that remain after the transformation are left as they are The result is therefore a finite state transducer in the favorable case and an optimized grammar strictly equivalent to the original grammar if not default e d N depth N maximum depth to which graph calls should be unfolded The default value is 10 13 18 Fst2Check Fst2Check OPTIONS lt fst2 gt This programs checks if a fst2 file has no error for Locate OPTIONS e y 1oop check enables error check
270. lete path of the text file The program creates a modified version of the text that is saved in a file with extension snt OPTIONS e n no carriage return every separator sequence will be turned into a single space e input offsets XXX base offset file to be used e output offsets XXX offset file to be produced e r XXX replacement rules XXX specifies the normalization rule file to be used See section 14 13 6 for details about the format of this file By default the program only replaces and by and 276 CHAPTER 13 USE OF EXTERNAL PROGRAMS e no separator normalization only applies replacement rules speci fied with r WARNING if you specify a normalization rule file its rules will be applied prior to anything else So you have to be very careful if you manipulate separators in such rules 13 29 PolyLex PolyLex OPTIONS lt list gt This program takes a file containing unknown words lt list gt and tries to analyse each of the words as a compound obtained by concatenating simple words The words that have at least one analysis are removed from the file of unknown words and the dictionary lines that correspond to the analysis are appended to file OUT OPTIONS e a ALPH alphabet ALPH the alphabet file to use e d BIN dictionary BIN bin dictionary to use e o OUT output OUT designates the file in which the produced dictionary lines are to be printed if
271. ll ask you to build a working version of your text as shown on Figure 10 6 This text version will be preprocessed according to the text language in particular the default dictionaries will be applied WARNING the text language is determined on the basis of the path name For instance if your text file is located in MyUnitex Klingon Corpus the language will be con sidered to be Klingon So if your text is not in a subdirectory of your personal Unitex directory its language will not be identified MA tt y 9 Unitex needs a text version of your xml text in order to locate expression Do you agree to build and preprocess D My Unitex FrenchiCorpus A funtana fr_xalign txt Figure 10 6 Unitex needs to build a working version of your text XAlign Locate Pattern Locate pattern in the form of CO Regular expression amp Graph Index O Shortest matches amp Longest matches All matches Search limitation a Stop after 200 matches SEARCH CO Index all utterances in text Figure 10 7 Pattern matching frame for aligned texts Once Unitex has created and preprocessed the working version of the text you can perform 208 CHAPTER 10 TEXT ALIGNMENT your query using the frame shown on Figure 10 7 As the matching operation is performed by the Locate program you can perform the same queries than you would perform on a normal corpus The only restriction is that you cannot exploit th
272. ll boxes of the selection will be connected to it You can perform a copy paste with several boxes Select them and press lt Ctrl C gt or click on Copy in the Edit menu The selection is now in the Unitex clipboard You can then paste this selection by pressing lt Ctrl V gt or by selecting Paste in the Edit menu NOTE You can paste a multiple selection into a different graph than the one where you copied it from In order to delete boxes select them delete the text that they contain i e the text presented in the text field above the window and press the Enter key The initial and final states cannot be deleted 5 2 4 Transducers A transducer is a graph in which outputs can be associated with boxes To insert an output use the special character All characters to the right of it will be part of the output Thus the text one two three number results in a box like in figure 5 19 The output associated with a box is represented in bold text below it 100 CHAPTER 5 LOCAL GRAMMARS Monday Tuesday Wednesday Thursday Friday Saturday Sunday Saturday Sunday number Figure 5 19 Example of a transducer Wheights You can assign wheights to the boxes of a transducer Thus when a sequence of tokens is matched by several paths only the one with the highest wheight will produce an output After a locate this will affect the concordance in which the matched sequences of words will appear only
273. lly applied to this text file The existing text dictionaries are not modified Thus if you have chosen to modify the current text the modifications will be effective immediately You can then start new searches on the text WARNING if you have chosen to apply your graph ignoring the transducer outputs all occurrences will be erased from the text 6 10 5 Extracting occurrences To extract from a text all sentences containing matches set the name of your output text file using the Set File button in the Extract units frame Figure 6 61 Then click on Extract matching units At the opposite if you click on Extract unmatching units all sentences that do not contain any match will be extracted 6 10 6 Comparing concordances With the Show differences with previous concordance option you can compare the current concordance with the previous one The ConcorDiff program builds both concordances according to text order and compares them line by line The result is an HTML page that presents alternatively lines from the two concordance lefting an empty line when a match appears in only one concordance Lines are greyed for the previous concodrance and left with a white background for the current one In each line only matched tokens are coloured You can click on each match to open the text at its position Blue indicates that an utterance is common to the two concordances Red indicates that a match is common to both concordances but
274. losed in the characters and and for the end of a variable In order to use a variable in a transducer output its name must be surrounded by the character cf Figure 6 43 Variables are global This means that you can define a variable in a graph and reference it in another as is illustrated in the graphs of Figure 6 43 138 CHAPTER 6 ADVANCED USE OF GRAPHS TitleName grf X BOULOT Recherche manuelunitex resources grf n T Figure 6 43 Definition of a variable in a subgraph Concordance D My Unitex EnglishiCorpuslivanhoe_snticoncord html lders and was silent S Prince John TITLE Prince resumed his retreat he hermit his name is Sir Anthony of Scrabelstone TITLE Sir as if I again passed round To Sir Athelstane of Coningsburgh TITLE Sir r shall call thee Saxon Sir Baron TITLE Sir replied Cedric offended to say lady answered Sir Brian de Bois TITLE Sir Guilbert ory Sir Palmer said Sir Brian de Bois TITLE Sir Guilbert so unsafe the escort of Sir Brian de Bois TITLE Sir Guilbert is not to er to be a handmaiden to Sir Brian de Bois TITLE Sir Guilbert after the ghts of the Temple and Sir Brian de BoisGuilbert TITLE Sir well knows have offended replied Sir Brian TITLE Sir I crave your Figure 6 44 Concordance obtained by application of graph TitleName 6 7 RULES FOR APPLYING TRANSDUCERS 139 If the graph Tit leName is ap
275. lready have a personal directory for a given language Unitex won t try to copy system data into it So if an update has modified a resource file other than a dictionary you will have to copy by yourself this file or to delete your personal directory for this language and let Unitex rebuild it properly Choosing the language allows Unitex to find certain files for example the alphabet file You can change the language at any time by choosing Change Language in the Text menu If you change the language the program will close all windows related to the current text if there are any The active language is indicated in the title bar of the graphical interface 2 2 Text formats Unitex works with Unicode texts Unicode is a standard that describes a universal character code Each character is given a unique number which allows for representing texts without having to take into account the proprietary codes on different machines and or operating 31 32 CHAPTER 2 LOADING A TEXT x User paumier Choose the language you want to work on Figure 2 1 Language selection when starting Unitex systems Unitex uses a two byte representation of the Unicode 3 0 standard called Unicode Little Endian for more details see 16 Texts that come with Unitex are already in Unicode format If you try to open a text that is not in Unicode the program proposes to convert it see figure 2 2 This conversion is based on the current langu
276. lt a href 116 124 2 gt extended lt a gt nbsp i amp nbsp lt br gt amp nbsp extended lt a href 125 127 2 gt in lt a gt nbsp ancient nbsp lt br gt amp nbsp Scott S lt a href 32 34 2 gt IN lt a gt amp nbsp THAT PL amp nbsp lt br gt STRICT of lt a href 61 66 2 gt merry lt a gt amp nbsp Engl nbsp lt br gt S IN THAT lt a href 40 48 2 gt PLEASANT lt a gt amp nbsp D nbsp lt br gt amp nbsp which is a href 84 91 2 gt watered lt a gt amp nbsp by amp nbsp lt br gt lt font gt lt td gt lt table gt lt body gt Y lt htm1 gt 4 Figure 14 2 shows the page that corresponds to the file below 14 6 CONCORDANCES 307 E concordance D o G Dd TRE COMME DOMESTIQUE _ tait habit e UN COMME MAITRE l un des membres la maison portant Figure 14 2 Example of a concordance 14 6 4 The diff html file The diff html file is an HTML file that presents the differences between two con cordances This file is encoded in UTF 8 Here is an example of file new lines have been introduced for presentation convenience html head meta http equiv Content Type content text html charset UTF 8 gt lt style type text css gt a blue color blue text decoration underline a red color red text decoration underline a green color green text decoration underline lt style gt lt head gt lt body
277. lt play V gt lt read V gt Figure 5 24 Box resulting from copying a list and applying contexts 5 2 7 Special Symbols The Unitex graph editor interprets the following symbol in a special manner 4 os lt gt 103 Table 5 1 summarizes the meaning of these symbols for Unitex as well as the ways to rec ognize these characters in texts Caracter Meaning Escape quotation marks mark sequences that must not be in SCH terpreted by Unitex and whose case must be taken verbatim separates different lines within the boxes mam introduces a call to a subgraph s Of Xs S indicates the start of a transduction within a box lt lt indicates the start of a pattern or a meta lt or lt gt gt indicates the end of a pattern or a meta gt or gt prohibits the presence of a space idi X escapes most of the special characters X Table 5 1 Encoding of special characters in the graph editor 104 CHAPTER 5 LOCAL GRAMMARS 5 2 8 Toolbar Commands The toolbar above a graph contains shortcuts for certain commands and allows you to ma nipulate boxes of a graph by using some tools This toolbar may be moved by clicking on the rough zone It may also be dissociated from the graph and appear in an separate window see figure 5 25 In this case closing this window puts the toolbar back at its initial position Each graph has its own toolbar EJES fy A 3
278. lui elle moi en y ou qui que quoi rien au bord des larmes A par exemple 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 171 AKK KH 0 CU lt pers gt lt nombre gt lt pers gt lt nombre gt lt pers gt lt nombre gt ls eusse dusse puisse fusse je L p 2 nombre genre nombre The symbol indicates that the remainder of the line is a comment A comment can appear at any place in the file The file always starts with the word NAME followed by an identifier rench for example This is followed by the POS sections for each part of speech Each section describes the structure of the lexical tags of the lexical entries belonging to the part of speech concerned Each section is composed of 4 parts which are all optional e flex this part enumerates the inflectional codes belonging to the grammatical cate gory For example the codes 1 2 3 which indicate the person of the entry are relevant for pronouns but not for adjectives Each line describes an inflectional attribute gen der time etc and is made up of the attribute name followed by the character and the values which it can take For example the following line declares an attribute pers being able to taking the values 1 2 or 3 pers 1 2 3 cat this part declares the syntactic and semantic attributes which can be assigned to the entries belonging to the part of speech concerned Each line describes an attribute and the values which it c
279. m dong xor e OR Sas Vo d ERR RC Oe cue De 308 DR TABS x9 oS SESE EUR RR 4o Ro Ro ea See EX X d 308 148 Dichonares oe onore ok oon ho RR homo be Go Roy xo ks dosed xo Yon RU 308 10 1451 The itt MES iii ee OA ae G3 1402 Themi files lt s ro 6 aoe tw a Bh una Bh we Em d 1483 Dictionary information file lt lt lt 6664 be eae EE E 14 8 4 The CHECK _DIC TXT file 149 ELAG files PER CREME La gek adit A EX X E Rd EA ee be 1492 eegen 224 dr e Lies sauna nanas 1193 elg fil s lt lt RG Ron o Remy Roy nas 14954 OL POS ou cena Da dou de doge Boe d o 8 RES ROGER BESS XO IRR EERE SS JLIUJIThecorpus xt Ble osos d Roms 15102 The tagger data BUG uus oso bee BEES Rn Re ES il 5 reco rh d Rn a met HIL See peo bs ee mis x e 1411 2 The system Se el Me eek ea e aa OEY 14 113 The user dicdef file lt 22 245 cee AN ee ed 14 11 4 The user cfg and unitex cfg files 14 126 assy SINS c A a9 ae ERS eA om m Pee Re dev 14 12 1 Cassys configuration files csc 14 13 Various other l uuu xw Li es he RE dd a 14 13 1 The diin dicen eren ettags ern files oo osci soono 14132 The stat Cie iile x sr eon Sw bow Boe ER ee we Re 14133 Thestatsnfile tada ou ok de 14 13 4 The concord n file 14135 The concord Ustn Dle 22 ov Du ob ale bd has 14 13 6 Normalization rule file 14 13 7 Forbidden w
280. m recursive calls to subgraphs Void loops due to transitions with the empty word can have two origins of which the first is illustrated by the Figure 6 8 This type of loops is due to the fact that a transition with the empty word cannot be eliminated automatically by Unitex because it is associated with an output Thus the transition with the empty word of Figure 6 8 will not be suppressed and will cause a void loop EL ve ADJ N Figure 6 8 Void loop due to a transition by the empty word with a transduction The second category of loop by epsilon concerns the call to subgraphs that can recognize the empty word This case is illustrated in Figure 6 9 if the subgraph Adj recognizes epsilon there is a void loop that Unitex cannot detect a a et G Figure 6 9 Void loop due to a call to a subgraph that recognizes epsilon The third possibility of void loops is related to recursive calls to subgraphs Look at the graphs Det and DetCompose in figure 6 10 Each of these graphs can call the other without reading any text The fact that none of these two graphs has labels between the initial state and the call to the subgraph is crucial In fact if there were at least one label different from epsilon between the beginning of the graph Det and the call to DetCompose this would mean that the Unitex programs exploring the graph Det would have to read the pattern described by that label in the text before calling DetCompose recursively In this
281. matches numbers that are not followed by th The difference with positive right contexts is that when Locate tries to match the expression described inside the context reaching the context stop will be considered as a failure because it would have matched a forbidden sequence At the opposite if the context stop cannot be reached then Locate will rewind at the position pos and go on exploring the grammar after the context end we 1 H9 9 Figure 6 14 Using a negative right context Right contexts can appear anywhere in the graph including the beginning of the graph Figure 6 15 shows a graph that matches an adjective in the right context of something that is not a past participle In other words this graph matches adjectives that are not ambiguous with past participles 124 CHAPTER 6 ADVANCED USE OF GRAPHS 1 Figure 6 15 Matching an adjective that is not ambiguous with a past participle This mechanism allows you to formulate complex patterns For instance the graph of figure 6 16 matches a sequence of two simple nouns that is not ambiguous with a com pound word In fact the pattern CDIC gt gt matches a compound word with exactly one space and the pattern N gt gt matches a noun with out space that is to say a simple noun Thus in the sentence Black cats should like the town hall this graph will match Black cats but not town hall which is a compound word e lt
282. maton 6 6 40 ss o os 183 AA ege cud ru SS ee SSeS eee coe x Ed p 185 79 Thespedalcase of KORG A ane od ee eee AAA 186 CONTENTS 7 8 10 11 12 Sequence Automaton 189 S1 Seguences CORPUS Ness hee Re eh RRR OR 4 A ORES OE EO X res 189 S2 o e sos a AAA OSS a EER AA AREY RAG 190 53 Search by approximation s se sa is ue eRe amp Oh ARTE oe OS OS 192 Lexicon grammar 195 91 Lexicon grammar tables lt a ses bse d ok Re pee RE t 195 9 2 Conversion ora fable into graphs os o aus Leeds ERED ESS 196 921 Pundpleao parameterized graphs ue exo aoma ea eer a ea 196 O22 bhonmatoftbetabl 1242 ad det D ange Shed Sti we ad di 196 823 EEN e es da ee ORE SA Oe Oe ets 197 924 Automatic generation of graphs cse de see kw 198 Text alignment 203 3101 Loading RMS iau dissem enve ee ES ne sv renier 203 10 2 Aligning texts Los aude di sus ride dde eee ta held pas 205 103 PATES vc d boe oem a a OE ESS hoe Ow Oe Nha ote 207 Compound word inflection 211 TLI MultbWord OMS 4e cece c eanes sa a se a et 211 11 1 1 Formal Description of the Inflectional Behavior of Multi word Units 212 11 12 Lexicalized vs Grammar Based Approach to Morphological Descrip 5 os nn ee BAe ee Se eee eae So eS D Dies 213 11 2 Formalism for the Computational Morphology of MWUs 214 11 2 1 Morphological Features of the Language e cr 4 214 11 2 2 Decomposition ora MWU into Unit 6 cea or o es 216 1123 Inflection paradigm of
283. maton The text automaton explicit all possible lexical interpretations of the words These different interpretations are the different entries presented in the dictionary of the text Figure 7 1 shows the automaton of the fourth sentence of the text Ivanhoe You can see in Figure 7 1 that the word Here has three interpretations here adjective ad verb and noun haunted two adjective and verb etc All the possible combinations are expressed because each interpretation of each word is connected to all the interpretations of the following and preceding words In case of an overlap between a compound word and a sequence of simple words the au tomaton contains a path that is labeled by the compound word parallel to the paths that express the combinations of simple words This is illustrated in Figure 7 2 where the com pound word courts of law overlaps with a combination of simple words By construction the text automaton does not contain any loop One says that the text au tomaton is acyclic NOTE The term text automaton is an abuse of language In fact there is an automaton for each sentence of the text Therefore the combination of all these automata corresponds to the automaton of the text This is why we use the term text automaton even if this object is not manipulated as a global automaton for practical reasons 153 154 CHAPTER 7 TEXT AUTOMATON FST Text E Here haunted of yore the fabulous Dragon of E antley
284. me character more than once In this way encoding the past participle using the code PP would be exactly equivalent to using P alone e this is an example is a comment Comments are optional and are introduced by the character These comments are left out when the dictionaries are compressed IMPORTANT REMARK It is possible to use the full stop and the comma within a dictionary entry In order to do this they have to be escaped using the character 1 000 one thousand NUMBER United Nations U N ACRONYM WARNING Each character is taken into account within a dictionary line For example if you insert spaces they are considered to be a part of the information In the following line 3 1 THE DELA DICTIONARIES 49 hath have V P3s old form of has The space that precedes the character will be considered to be part of a 4 character inflec tional code It is possible to insert comments into a DELAF or DELAS dictionary by starting the line with a character Example English designates a pool spin English N z3 s Compound words with spaces or dashes Certain compound words like acorn shell can be written using spaces or dashes In order to avoid duplicating the entries it is possible to use the character At the time when the dictionary is compressed the Compress program checks for each line if the inflected or canonical form contains a non escaped character If this is the case the program
285. mlined approach to developer productivity decreases your most common and time consuming tasks by fusing familiar user interface concepts with a unique mix or performance A strategies Read more E Tools Downloads e Figure 1 5 Xcode Once on your computer the Xcode package looks like the following Double click on the XCodeTools mpkg icon to install all the programs 26 606 lt gt m amp PS powerced m 59 R seau gt a Macintosh HD A Applications H Documents 6 Musique Images About Xcode Tools pdf acc E Xcode Tools 9 NE Bureau H WebObjects mpkg m ced CHAPTER 1 Xcode Tools INSTALLATION OF UNITEX XcodeTools mpkg 5 l ments 60 2 Mo disponibles Figure 1 6 Xcode package 1 5 4 How to makes all files visible on Mac OS Utilities Seehttp www macworld com article 51830 2006 07 showallfinder html Or try it right away Type defaults write com apple Finder AppleShowAllFiles ON Then restart the Finder killall Finder illall Finder Figure 1 7 Restarting the Finder Terminal bash 75x5 361 ron defaults write com apple Finder AppleShowALLFil To get back to the original configuration type defaults write com apple Finder AppleShowAllFiles OFF 1 6 FIRST USE 27 1 6 First use If you are working on Windows the program will ask you to choose a personal working directory which you can change later in I
286. mp All sentences Plain text O Matched sentences All sentences HTML CO Aligned with target concordance CHAPTER 10 TEXT ALIGNMENT Ne inc p tin m s le veneram pe amindou in timp ce ele se devor reciproc nu scrieti asta v rog cineva ar putea s m trag la r spundere intr o buna zi N am comandat nimic v agteptam pe dumneavoastr All sentences Plain text 8 Matched sentences All sentences HTML Aligned with source concordance Locate Clear alignment Save alignment Save alignment as Figure 10 4 Aligned sentences D My UnitexiXAlignifuntana xm 4 autrefois la Terre Ferme E Comme vous madame comme ous Ou comme Altea ma ch re comme Altea amp All sentences Plain text O Matched sentences All sentences HTML CO Aligned with target concordance Continentul numit o 10 reme Terra Ferma 11 12 Sau ca Altea doamna mea ca Altea 13 Pina mai ieri Leaganul civilizatiei noastre lingvistii sustin chiar ca apartinem unei arii italice All sentences Plain text 8 Matched sentences All sentences HTML Aligned with source concordance Locate Clear alignment Align Save alignment Save alignment as Locate Figure 10 5 Adding a link 10 3 PATTERN MATCHING 207 10 3 Pattern matching You can perform pattern matching queries on any of your texts by clicking on its Locate button The first time you click Unitex wi
287. mp fp phtisiologie N fs phtisiologies phtisiologie N fp phtisiologique A ms is Find a Le patisiologiques htisiologique A mp fp 4 Figure 3 2 Looking up for a word in a dictionnary You can also look up a word in several dictionnaries by clicking on the Lookup button of the 54 CHAPTER 3 DICTIONARIES DELA menu You can then select the dictionaries in which you want to look up the word you have entered E Dictionary Lookup E Select dictionaries to look up into User resources System resources tagger_data_simple bin tagger_data_cat bin tagger data compound bin profession bin pronouns FR bin Prolex PaysCapitales bin test bin motsGramf bin suf dc bin Prolex Toponymes bin 3 communesFR bin prenom s bin E testfix bin dela fr public bin rac_arabe bin tagger data morph bin new verbs FR bin test bin ajouts80jours bin Extrait DelquefM2 bin Clear selection Refresh lists Word Parie Paris N PR DetZ Toponyme Villet IsoFRims is Paris N PR DetZ Toponyme Ville ms fs paris pari N zl mp 4 ji D Figure 3 3 Looking up for a word in several dictionnaries 3 3 Checking dictionary format When dictionaries become large it becomes tiresome to check them by hand Unitex con tains the program CheckDic that automatically checks the format of DELAF and DELAS dictionaries This program verifies the syntax of the entries For ea
288. mpile the selected grammars and create a file named elag rul by default If you have selected grammars in the right frame you can search patterns whith them by clicking on the Locate button This opens the window Locate Pattern and automatically enters a graph name ending with conc fst2 This graphs corresponds to the if part of the grammar You can thus obtain the occurrences of the text to which the grammar will apply NOTE The conc fst2 file used to locate the if part of a grammar is automatically gen erated when ELAG grammars are compiled by means of the Compile button It is thus necessary to have your grammar compiled before searching using the Locate button 7 3 3 Resolving Ambiguities Once you have compiled your grammar into an elag rul file you can apply it to a text automaton In the text automaton window click on the Apply Elag Rule button A dialog box will appear which asks for the ru1 file to be used see figure 7 17 The default file is elag rul This will launch the E1ag program which will try to resolve the ambiguity Once the program has finished you can view the resulting automaton by clicking on the Open Elag Frame button As you can see in figure 7 18 the windows is separated into two parts The original text automaton can be seen on the top and the result at the bottom 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 167 Implose resulting text automaton PREP z1
289. must include any data and utility programs needed for reproducing the package from it How ever as a special exception the materials to be distributed need not include anything that is normally distributed in either source or binary form with the major components compiler kernel and so on of the operating system on which the executable runs unless that component itself accompanies the executable It may happen that this requirement contradicts the license restrictions of pro prietary libraries that do not normally accompany the operating system Such a contradiction means you cannot use both them and the Linguistic Resource together in a package that you distribute You may not copy modify sublicense link with or distribute the Linguistic Re source except as expressly provided under this License Any attempt otherwise to copy modify sublicense link with or distribute the Linguistic Resource is void and will automatically terminate your rights under this License How ever parties who have received copies or rights from you under this License will not have their licenses terminated so long as such parties remain in full compliance You are not required to accept this License since you have not signed it How ever nothing else grants you permission to modify or distribute the Linguistic Resource or its derivative works These actions are prohibited by law if you do not accept this License Therefore by modifying or
290. n characters Here is the content of this file for the first sentence of Ivanhoe 0 74 Ivanhoe 1 1 2524 by 1 1 3 34 Sir 1 1 4 64 Walter 1 1 5 5 Scott 1 1 14 5 7 Thetfst tags by freq txt and tfst tags by alph txt files Those files contain all the tags that appear in the text automaton sorted by frequence and alphabetical order 14 6 Concordances 14 6 1 The concord ind file The concord ind file is the index of the occurrences found by either Locate or LocateTfst during the application of a grammar It is a text file that contains the starting and ending positions of each occurrence possibly accompanied by a sequence of letters if the construction of the concordance took into account the pos sible transducer outputs of the grammar Here is an example of such a file Aug 59 0 0 63 3 0 the ADJ greater part 14 6 CONCORDANCES 305 67 0 0 71 4 0 the beautiful hills 87 0 0 91 3 0 the pleasant town 123 0 0 127 4 0 the noble seats 157 0 0 161 5 0 the fabulous Dragon 189 0 0 193 3 0 the Civil Wars 455 0 0 459 11 0 the feeble interference 463 0 0 467 6 0 the English Council 566 0 0 570 10 0 the national convulsions 590 0 0 594 5 0 the inferior gentry 626 0 0 630 11 0 the English constitution 696 0 0 700 4 0 the petty kings 813 0 0 817 5 0 the certain hazard 896 0 0 900 5 0 the great Barons 938 0 0 942 3 0 the very edge Th
291. n entry does not take any inflectional feature by means of a line containing only the _ character underscore So for example if we consider that the following lines extracted from the section describ ing the verbs W K lt genre gt lt nombre gt They make it possible to declare that verbs in the infinitive indicated by the W code do not have other inflectional features while the forms in the past participle K code are also assigned a gender and a number Description of the inflectional codes The principal function of the discr part is to divide a part of speech into subcategories having similar inflectional behavior These subcategories are then used to facilitate writing the complete part For the legibility of the ELAG grammars it is desirable that the elements of the same sub category all have the same inflectional behavior in this case the complete part is made up of only one line per subcategory Let us consider for example the following lines from the pronoun description Pdem genre nombre Ppvll genre nombre pers PpvPr These lines mean e all the demonstrative pronouns PRO Pdem gt have only a gender and a number e clitic pronouns in the nominative lt PRO Ppv11 gt are labelled grammatically in per son gender and number e the prepositional pronouns eri y do not have any inflectional feature All combinations of inflectional features and discriminant subcategories which appear in t
292. n install it safely on your computer If you are not familiar with software installation you should ask for help As Windows would put it Contact your Administrator It works only on Intel based Macintosh If you do not know whether your Macintosh is an Intel machine or not select in the Apple Menu the item About this Macintosh then click on More info and in the next panel click on Hardware as shown on Figure 1 2 1 5 INSTALLATION ON MACOS X 21 Mac OS X Version 10 5 7 Processeur 2 4 GHz Intel Core 2 Duo M moire 4 Go 667 MHz DDR2 SDRAM TM amp 1983 2009 Apple Inc Tous droits r serv s 26 05 09 00 11 Informations mat riel Nom du mod le Identifiant du mod le Y R seau Nom du processeur Carte AirPort Vitesse du processeur Configurations recat de prec 3 ombre total de c urs mp Cache de niveau 2 Modems M moire Volumes Vitesse du bus Version de la ROM de d marrage Version SMC syst me Num ro de s rie syst me TE de mouvement brusque k Figure 1 2 Getting information on your computer 22 CHAPTER 1 INSTALLATION OF UNITEX Installation Some of the following installation steps will require the use of the terminal application On MacOS the terminal application is located in Applications Utilities Terminal app When you start this application double click a window is displayed You will have to type in this window some commands for moving editin
293. n is fundamental for the use of Unitex but limits the optimization of search operations for pat terns Regardless of the tokenization mode newlines in a text are replaced by spaces Tokenization is done by the Tokenize program This program creates several files that are saved in the text directory 2 5 PREPROCESSING A TEXT V V IN il AM ij N Md N WW Figure 2 11 Normalization of English verbal contractions 42 CHAPTER 2 LOADING A TEXT e tokens txt contains the list of tokens in the order in which they are found in the text e text cod contains an integer array every integer corresponds to the index of a token in the file tokens txt e tok_by_freq txt contains the list of tokens sorted by frequency e tok_by_alph txt contains the list of tokens in alphabetical order e stats ncontains some statistics about the text Tokenizing the text A cat is a cat returns the following list of tokens A SPACE cat isa You will observe that tokenization is case sensitive A and a are two distinct tokens and that each token is listed only once Numbering these tokens from 0 to 5 the text can be represented by a sequence of numbers integers as described in the following table Token number 0 1 2 1 3 1 4 1 2 5 Corresponding A cat is a cat token Table 2 1 Representation of the text A cat is a cat For more details see chapt
294. n of a sequence including the first and last can be replaced by a wildcard The graphs produced using wildcard contain many erroneous sequences and must be con fronted with corpora by a locate to keep only the relevant sequences These sequences might be used to produce a new graph you might want to keep The graph in figure 8 8 was produced with replacement of 1 token and with the beautify option activated a month lt TOKEN gt 4 soon as possible the H next few days Tomorro this week lt TOKEN gt twice a lt TOKEN gt as lt TOKEN gt soon as lt TOKEN gt lt TOKEN gt lt TOKEN gt the lt TOKEN gt next K few Figure 8 8 Automaton with one replacement allowed 194 CHAPTER 8 SEQUENCE AUTOMATON Chapter 9 Lexicon grammar The tables of lexicon grammar are a compact way for representing syntactical properties of the elements of a language It is possible to automatically construct local grammars from such tables due to a mechanism of parameterized graphs In the first part of the chapter the formalism of tables is presented The second part describes parameterized graphs and a mechanism of automatically lexicalizing them with lexicon grammar tables 9 1 Lexicon grammar tables Lexicon grammar is a methodology developed by Maurice Gross and the LADL team 9 10 38 51 49 50 48 47 44 43 42 41 40
295. nal Morphology of MWUs In 85 was proposed a formalism for describing the morphological paradigms of MWUs It has been based on studies of English Polish and French and further tested for Serbian 57 It consists of a language independent kernel which is to be completed by a set of morphological elements characteristic for the given language In this section we give an in depth description of this formalism 11 21 Morphological Features of the Language When processing MWUS of a given language we have to provide some general data about that language These data are included in two textual files The Morphology txt file gives the morphological classes noun adjective categories number gender case and values masculine feminine singular nominative Con sider the following example Polish CATEGORIES Nb sing pl Case Nom Gen Dat Acc Inst Loc Voc Gen masc pers masc anim masc inanim fem neu lt CLASSES gt noun Nb lt var gt Case lt var gt Gen lt fixed gt adj Nb lt var gt Case lt var gt Gen lt var gt adv The above file says that for Polish three inflection categories are considered the number Nb the case Case and the gender Gen Each category is given an exhaustive list of its possible values singular and plural for number etc Further each morphological class is described with respect to the categories it inflects for and those that are fixed for it
296. nces Compile Elag Grammars Construct FST Text Convert FST Text to Text Close Text Quit Unitex Figure 2 7 Text Menu 2 5 Preprocessing a text Find Next 35 After a text is selected Unitex offers to preprocess it Text preprocessing consists of perform ing the following operations normalization of separators splitting into sentences normal ization of non ambiguous forms tokenization and application of dictionaries If you choose not to preprocess the text it will nevertheless be normalized and tokenized since these op erations are necessary for all further Unitex operations It is always possible to carry out the preprocessing later by clicking on Preprocess Text in the Text menu 36 CHAPTER 2 LOADING A TEXT snt Ey novel snt y test franz txt bak xt D novel txt D test tagges snt p5 xml D novel txt bak D test tagges txt p5_xalign snt D skepticism txt D toto snt p5 xalign txt D test franz snt D uima_0 snt 5 test franz txt uima_0 txt File Name skepticism bd Fes fes mts Figure 2 8 Opening a Unicode text Preprocessing amp Lexical parsing wl Preprocessing Apply graph in MERGE mode lEnglishiGraphsiPreprocessingiSentencelSentence grf Set v Apply graph in REPLACE m lexEnglisiGraphsiPreprocessingiReplacelReplace g Setu Tokenizing The text is automatically tokenized This operation is language dependant s0 that Unitex can han
297. nces of the original corpus and what kind of joker can be used All the details about the use of jokers is detailed in section 8 3 e choose the directory where the graph will be saved Construct sequence automaton 1 Choose your sequence corpus Construct sequence automaton set 1 Choose your sequence corpus 2 Options set Apply beautifying algorithm Exact case matching 2 Options Apply beautifying algorithm A Exact case matching lc 8 C Insert Optional approximate matching options v oH joker s M 3 Choose your output directory C Delete home adrien unitex French Graphs Set 3 Choose your output directory Create graph ome adrier unitexFrench Graphs Set Create graph Figure 8 4 The sequence automaton menu Figure 8 5 Options of the sequence automa ton menu You can see in figures 8 6 and 8 7 the graphes without wildcards produced without or with beautify 192 CHAPTER 8 SEQUENCE AUTOMATON Figure 8 7 Automaton with beautify option Figure 8 6 Automaton without beautify op tion 8 3 Search by approximation When you perform a locate operation on a text using a graph produced with the Seq2Grf program you will find in the match occurences only sequences present in the original se quence corpus
298. ncordance 81 148 257 comparison 150 Concordance frame 82 Conservation of better paths 284 Console 252 Consonant skeleton 62 Constraints on grammars 118 Context 65 Context free languages 90 Contexts 122 concordance 81 148 257 copy of a list 102 INDEX Copy 99 102 104 Copying lists 102 Corpus see Text Creating a Box 90 Creating log files 252 Cut 104 Degree of ambiguity 155 DELA 37 47 DELAC 47 DELACE 47 DELAF 47 50 65 308 DELAS 47 50 Derivation 89 Dictionaries application of 262 applying 42 64 automatic inflection 57 274 checking 54 codes used within 51 comments in 48 compressing 256 compression 62 277 contents 51 default selection 44 DELAC 47 DELACE 47 DELAF 47 50 65 256 308 DELAS 47 50 filters 64 format 47 granularity 155 lookup 53 of the text 73 153 priority 64 refer to 73 reference to information in the 116 sorting 55 text 43 verification 256 Dictionary entry variables 131 Dictionary graphs 65 Dictionary information file 311 Directory 351 personal working 27 text 37 251 ELAG 116 163 ELAG tag sets 169 Epsilon see lt E gt Equivalent characters 55 Error detection in graphs 122 265 268 Errors in graphs 122 265 268 Evaluation of the ambiguity rate 169 Exclusion of grammatical and semantic codes 73 Exploring the paths of a grammar 132 External programs BuildKrMwuDic 254 Cassys 254 CheckDic 54 256 311 C
299. nd only forms that are sequences of letters Thus the mask lt DIC gt allows you to find all unknown words in a text These unknown forms are mostly proper names neologisms and spelling errors The negation of a dictionary mask like v G will match any word except for those that are matched by this mask For instance lt V G gt will not match the word being even if there are homonymic non verbal entries in the dictionaries being A being N Abst s being N Hum s 5 Concordance D My Unitex English Corpus yanhoe_snticoncord html istresses of the oppressed 5 If Prior Aaner rode hard in the chase or remained long at the b emained long at the banquet if Prior Aymer was seen at the early peep of dawn to enter the whatsoever to atone for them Prior Aymer therefore and his character were well known to beisance and received his benedicite mes filz in return But the singular appearance of ance and received his benedicite mes filz in return But the singular appearance of his y could scarcely attend to the Prior of Jorvaulx question when he demanded if they knew of an raising his voice and using the lingua Franca or mixed language in which the Norman and Saxo st servants of Mother Church repeated Wamba to himself but fool as he was taking care no iding would carry them to the Priory of Brinxworth where their quality could not but secure th ch would bring them to the hermitage of Copmanhurst where
300. ne on Figure 11 4 in the sense that it allows to generate the same inflected forms for the same MWUs However this time a single path represents both the singular and the plural form That is possible due to the unification variable n which may be instantiated to any value of the domain of its category Nb here n s or n p The instantiation is unique for all elements on a path if we fix the singular value for the first constituent the same value has to be set for the third one as well as for the whole MWU Similarly if we fix n to p while processing the first node it has to remain p until the end of the path Es mesa e g bateau mouche lt Gen m Nb n gt Figure 11 5 Inflection graph for bateau mouche with a unification variable The inflection graph on Figure 11 5 applies to most kinds of French compounds of types Noun Noun and Noun Adjective bateau mouche ange gardien circuit s quentiel etc which are of masculine gender That is because the output of the final node contains Gen m For all compounds of the same types but of feminine gender e g main courante moissoneuse batteuse etc a new graph has to be created which is identical to Figure 11 5 up to the final output containing lt Gen f Nb n gt That is not very intuitive since circuit s quentiel and main Up to the case when single constituents appearing in the lemma of a MWU are already in plural as in cross roads 220 CHAPTER 11 COMPOUND WORD INFLECTION
301. ne do not appear simple words that have been matched in the tags ind file 14 7 4 tags ind This file has the same format than a concord ind one obtained in MERGE or RE PLACE mode but its header is T Note that the outputs DO NOT BEGIN with a slash 14 8 Dictionaries The compression of the DELAF dictionaries by the Compress program produces two files a bin file that represents the minimal automaton of the inflected forms of the dictionaries and a inf file that contains the compressed forms required for the construction of the dictionaries from the inflected forms This section describes the format of these two file types as well as the format of the CHECK DIC TXT file which contains the result of the verification of a dictionary 14 8 DICTIONARIES 309 14 8 1 The bin files A bin file is a binary file that represents an automaton The first 4 bytes of the file represent an integer that indicates the size of the file in bytes The states of the automaton are encoded in the following way the first two bytes indicate if the state is final as well as the number of its out going transitions The highest bit is 0 if the state is final 1 if not The other 15 bits encode the number of transitions Example a non final state with 17 transitions is encoded by the hexadecimal sequence 8011 if the state is final the three following bytes encode the index in the inf file of the compressed form to be used to reconstru
302. ne is made of 3 integers X Y Z followed by the content of the occurrence X is the sentence number starting from 1 Y and Z are the starting and ending positions of the occurrence in the sentence given in characters m TXT merge TXT indicates to the program that it is supposed to pro duce a modified version of the text and save it in a file named TXT see section 6 10 4 Other options d DIR directory DIR indicates to the program that it must not work in the same directory than index but in DIR a ALPH alphabet ALPH alphabet file used for sorting T thai option to use for Thai concordances The result of the application of this program is a file called concord txt if the con cordance was constructed in text mode a file called concord html if the output mode was html glossanet or script and a text file with the name de fined by the user of the program if the program has constructed a modified version of the text In html mode the occurrence is coded as a hypertext link The reference associ ated to this link is of the form a href X Y Z gt Xet Y represent the beginning and ending positions of the occurrence in characters in the file text_name snt Z represents the number of the sentence in which the occurrence was found 260 CHAPTER 13 USE OF EXTERNAL PROGRAMS 13 10 ConcorDiff ConcorDiff OPTIONS lt concorl gt lt concor2 gt This program takes two concordance fil
303. nfo gt Preferences gt Directories To create a direc tory click on the icon showing a file see figure 1 10 If you are using Linux or MacOS the program will automatically create a unitex directory in your HOME directory This directory allows you to save your personal data For each language that you will be using the program will copy the root directory of that language to your personal directory except the dictionaries You can then modify your copy of the files without risking to damage the system files Welcome x Welcome paumier To use Unitex you must choose a private directory to store your data that you can change later if you want Click on OK to choose your directory Figure 1 8 First use under Windows K Welcome Welcome paumier Your private Unitex directory where you can store your own data is fhome thesards paumier unitex Figure 1 9 First use under Linux 17 Adding new languages There are two different ways to add languages If you want to add a language that is to be accessible by all users you have to copy the corresponding directory to the Unitex system 28 CHAPTER 1 INSTALLATION OF UNITEX m Choose your private directory E x C Mes vid os Downloads Cf Updater5 C Ma musique C Visual Studio 2005 Mes eBooks J Mes fichiers re us Mes images Mes sites Web File Name CiDocuments and SettingsipaumieriMes documents Figure 1
304. ngue Francaise 86 1990 11 1 4 Laurie BAUER English Word Formation Cambridge University Press 1983 11 1 5 Emile BENVENISTE Fondements syntaxiques de la composition nominale Formes nouvelles de la composition nominale pages 145 176 Gallimard Paris 1974 11 1 6 Olivier BLANC and Anne DISTER Automates lexicaux avec structure de traits In Actes RECITAL 2004 2004 7 3 7 Xavier BLANCO Noms compos s et traduction francais espagnol Lingvistice Investigationes 21 1 1997 Amsterdam Philadelphia John Benjamins Publish ing Company 3 8 8 Xavier BLANCO Les dictionnaires lectroniques de l espagnol DELASs et DELACs Lingvistice Investigationes 23 2 2000 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 9 Jean Paul BOONS Alain GUILLET and Christian LECLERE La structure des phrases simples en francais classes de constructions transitives Technical report LADL Paris 1976 9 1 10 Jean Paul BOONS Alain GUILLET and Christian LECLERE La structure des phrases simples en francais constructions intransitives Droz Gen ve 1976 9 1 11 Firefox Web browser http www mozilla com firefox 4 8 2 12 Netscape Web browser http www netscape com 4 8 2 341 342 BIBLIOGRAPHY 13 Pierre CADIOT A entre deux noms vers la composition nominale Lexique 11 193 240 1992 11 1 14 Folker CAROLI Les verbes transitifs compl ment de lieu en allemand Lingvistice Investigati
305. ni s trazxni trazxni sudija NC_AXNF dija NC_AXNF N sudija NC_AXNF udija NC_AXNF sudija NC_AXNF sudija NC_AXNF udija NC_AXNF sudija NC_AXNF sudija NC_AXNF trazxnim sudijama istrazxni sudija NC_AXNF N Comp vfp trazxne sudije istrazxni sudija NC_AXNF N trazxne sudije istrazxni sudija NC_AXNF N trazxnoga sudiju is trazxnog sudiju istrazxni sudija NC_AXNF trazxni sudija trazxnoga sudij trazxnog sudije ist trazxnomu sudij trazxnome sudij trazxnom sudiji ist trazxnomu sudij trazxnome sudij trazxnom sudiji istrazxni sudija NC_AXNF Comp 2vfw Comp 4vfw Comp ms4v N Comp ms4v Comp 1vms Comp 2vms Comp 2vms Comp 3vms Comp 3vms N Comp 3vms Comp 7vms Comp 7vms Comp 7vms 233 istrazxni sudijo istrazxni sudija NC_AXNF N Comp 5vms istrazxni sudija istrazxni sudija NC_AXNF N Comp 5vms istrazxnim sudijom istrazxni sudija NC_AXNF N Comp 6vms Dinkicx Mirosinka Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName slvf Dinkicx Mirosinke Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s2vf Dinkicx Mirosinki Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s3vf Dinkicx Mirosinku Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s4vf Dinkicx Mirosinka Mirosinka Dinkicx NC_ImePrezime N Comp Hum PersName s5vf Dinkicx Mirosinkom Mirosinka Dink
306. nsducers If a graph contains two paths that are ambiguous one can create two graphs con taining one path each The first graph will contain the safest path the second graph the least safe path Cassys keeps all the text created by each graph of the cascade This can be useful to test debug or check the different results of the cascade It is possible to correct the errors on the order of the graphs or to find the errors in the writing of the graphs A good idea is to place the name of the transducer recognizing a pattern in the outputs of the graphs thanks to that you can see in the final results the name of the graph by which a pattern is recognized 12 2 8 Files resulting from CasSys If you apply a cascade on the text named example txt two folders are created ex ample_snt and example_csc The most important files produced in example_csc are the results obtained by each graphs These files are named according to the number of the graph which produced them if the third graph finds a pattern the result will be the file named example_3 snt 250 CHAPTER 12 CASCADE OF TRANSDUCERS Chapter 13 Use of external programs This chapter presents the use of the different programs of which Unitex is composed These programs which can be found in the Unitex App folder are automatically called by the interface in fact UnitexToolLogger is actually called in order to reduce significantly the size of the downloadable zip file It is possi
307. ntries not thus excluded In such case this License incorporates the limitation as if written in the body of this License The Free Software Foundation may publish revised and or new versions of the Lesser General Public License for Linguistic Resources from time to time Such new versions will be similar in spirit to the present version but may differ in detail to address new problems or concerns Each version is given a distinguishing version number If the Linguistic Re source specifies a version number of this License which applies to it and any later version you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Founda tion If the Linguistic Resource does not specify a license version number you may choose any version ever published by the Free Software Foundation If you wish to incorporate parts of the Linguistic Resource into other free pro grams whose distribution conditions are incompatible with these write to the 340 12 13 CHAPTER 14 FILE FORMATS author to ask for permission NO WARRANTY BECAUSE THE LINGUISTIC RESOURCE IS LICENSED FREE OF CHARGE THERE IS NO WARRANTY FOR THE LINGUISTIC RESOURCE TO THE EXTENT PERMITTED BY APPLICABLE LAW EXCEPT WHEN OTH ERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND OR OTHER PARTIES PROVIDE THE LINGUISTIC RESOURCE ASIS WITH OUT WARRANTY OF ANY KIND EITHER EXPRESSED OR IMPLIED INCLUDIN
308. o be produced The program codes each unit as a whole The list of units is saved in a text file called tokens txt The sequence of codes representing the units now allows the coding of the text This sequence is saved in a binary file named text cod The program also produces the following four files e tok_by_freq txt text file containing the units sorted by frequency e tok_by_alph txt text file containing the units sorted alphabetically e stats n text file containing information on the number of sentence sepa rators the number of units the number of simple words and the number of numbers enter pos binary file containing the list of newline positions in the text The coded representation of the text does not contain newlines but spaces Since a newline counts as two characters and a space as a single one it is necessary to know where newlines occur in the text when the positions of occurrences located by the Locate program are to be synchronized with the text file File enter pos is used for this by the Concord program Thanks to this when clicking on an occurrence in a concordance it is correctly selected in the text File enter pos is a binary file containing the list of the positions of newlines in the text All produced files are saved in the text directory 13 43 TrainingTagger TrainingTagger OPTIONS txt This program automatically generates two tagger data files from a tagged corpus text file They are
309. o compute square roots has a purpose that is entirely well defined independent of the application Therefore Subsection 2d requires that any application supplied function or table used by this function must be optional if the application does not supply it the square root function must still compute square roots These requirements apply to the modified work as a whole If identifiable sec tions of that work are not derived from the Library and can be reasonably con sidered independent and separate works in themselves then this License and its terms do not apply to those sections when you distribute them as separate works But when you distribute the same sections as part of a whole which is a work based on the Library the distribution of the whole must be on the terms of this License whose permissions for other licensees extend to the entire whole and thus to each and every part regardless of who wrote it Thus it is not the intent of this section to claim rights or contest your rights to work written entirely by you rather the intent is to exercise the right to control the distribution of derivative or collective works based on the Library In addition mere aggregation of another work not based on the Library with the Library or with a work based on the Library on a volume of a storage or distribu tion medium does not bring the other work under the scope of this License 3 You may opt to apply the terms of the ordinary GNU General
310. of the Library you may distribute the ob ject code for the work under the terms of Section 6 Any executables containing that work also fall under Section 6 whether or not they are linked directly with the Library itself 6 As an exception to the Sections above you may also combine or link a work that uses the Library with the Library to produce a work containing portions of the Library and distribute that work under terms of your choice provided that the terms permit modification of the work for the customer s own use and reverse 328 CHAPTER 14 FILE FORMATS engineering for debugging such modifications You must give prominent notice with each copy of the work that the Library is used in it and that the Library and its use are covered by this License You must supply a copy of this License If the work during execution displays copyright no tices you must include the copyright notice for the Library among them as well as a reference directing the user to the copy of this License Also you must do one of these things a Accompany the work with the complete corresponding machine readable source code for the Library including whatever changes were used in the work which must be distributed under Sections 1 and 2 above and if the work is an executable linked with the Library with the complete machine readable work that uses the Library as object code and or source code so that the user can modify the Library and then relink to
311. of transducer of a CasSys cascade we use a text file csc file in which each line contains the path to a transducer followed by the output policy merge replace to be applied to this transducer The generic format of a line of csc file is Name_and_path_of_transducer Merge Here is an example of cascade file csc C apps my_unitex French Graphs grfl fst2 Merge C apps my_unitex French Graphs grf2 fst2 Replace 14 13 Various other files For each text Unitex creates multiple files that contain information that are designed to be displayed in the graphical interface This section describes these files and some others 14 13 1 The dlf n dlc n err n et tags_err n files These files are text files that are stored in the text directory They contain the number of lines of the d1f dlc err and tags_err files respectively These numbers are followed by a newline 14 13 2 The stat_dic n file This file is a text file in the directory of the text It has three lines that contain the number of lines of the d1 dlc and err files 14 13 3 The stats n file This file is in the text directory and contains a line with the following form 3949 sentence delimiters 169394 9428 diff tokens 73788 9399 simple forms 438 10 digits The numbers indicated are interpreted in the following way 14 13 VARIOUS OTHER FILES 321 e sentence delimiters number of sentence separators S e tokens total number of lexical units in th
312. ol for writing it Whether that is true depends on what the program that uses the Linguistic Resource does You may copy and distribute verbatim copies of the Linguistic Resource as you receive it in any medium provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of war ranty keep intact all the notices that refer to this License and to the absence of any warranty and distribute a copy of this License along with the Linguistic Resource You may charge a fee for the physical act of transferring a copy and you may at your option offer warranty protection in exchange for a fee You may modify your copy or copies of the Linguistic Resource or any por tion of it thus forming a work based on the Linguistic Resource and copy and distribute such modifications or work under the terms of Section 1 above provided that you also meet all of these conditions a The modified work must itself be a linguistic resource b You must cause the files modified to carry prominent notices stating that you changed the files and the date of any change c You must cause the whole of the work to be licensed at no charge to all third parties under the terms of this License These requirements apply to the modified work as a whole If identifiable sections of that work are not derived from the Linguistic Resource and can be reasonably considered independent and separate works in them
313. ome special options dedicated to HTML files You can use a com bination of the following options e dnc Decode Normal Chars things like seacute amp 120 and amp xF8 will be decoded as the single equivalent unicode character except if it repre sents an HTML control character e dcc Decode Control Chars amp 1t amp gt amp amp and amp quot will be de coded as amp and the quote the same for their decimal and hexadecimal representations e eac Encode All Chars every character that is not supported by the output encoding will be encoded as a string like amp 457 e ecc Encode Control Chars lt gt amp and the quote will be encoded by amp 1t amp gt amp amp and amp quot All HTML options are deactivated by default Other options e m main names prints the list of the encoding main names e a aliases prints the list of the encoding aliases e A all infos prints all the information about all the encodings e i X info X prints all the information about the encoding X The encodings can take values in the following list non exhaustive see below FRENCH ENGLISH GREEK THAT CZECH GERMAN SPANISH PORTUGUESE TALIAN NORWEGIAN LATIN default latin code page 262 windows 1252 Microsoft Windows 1250 Central Europe Microsoft Windows 1257 Baltic Microsoft Windows 1251 Cyrillic Microsoft Windows
314. ompress 49 62 256 308 ConcorDiff 150 260 Concord 257 Convert 260 Dico 43 65 262 Elag 166 168 169 263 314 ElagComp 166 169 175 264 Evamb 264 Extract 264 Flatten 117 265 Fst2Check 265 Fst2Grf 179 Fst2List 266 Fst2Txt 38 40 267 Grf2Fst2 117 268 ImplodeTfst 270 Locate 65 270 LocateTfst 273 MultiFlex 274 Normalize 252 275 PolyLex 45 276 RebuildTfst 277 Reconstrucao 159 277 Reg2Grf 277 Seq2Grf 278 Sort Txt 35 279 290 Stats 279 352 TEI2Txt 281 Table2Grf 280 Tagger 280 TagsetNormTfst 281 Tfst2Grf 281 Tfst2Unambig 182 282 Tokenize 40 282 TrainingTagger 283 Txt2Tfst 284 Uncompress 285 UnitexTool 286 UnitexToolLogger 286 Untokenize 285 Unxmlize 288 XMLizer 289 Extracting occurrences 150 Factorized lexical entries 168 File tagset def 169 172 174 175 conc fst2 166 bin 62 256 263 309 319 c g 319 dic 54 62 256 ele 313 fst2 80 117 179 268 297 grf 80 122 179 268 277 294 shtml 259 inf 62 256 309 1st 168 313 rul 164 166 168 263 264 314 8nt 37 275 282 284 291 299 tfst 263 txt 149 259 291 299 Alphabet txt 293 Alphabet sort txt 55 CHECK DC TXT 54 256 311 Config 917 Equivalences txt 215 ForbiddenWords txt 322 Morphology txt 214 215 Sentence fst2 38 Unitex jar 18 28 Unitex3 0 zip l18 INDEX alphabet 65 arabic typo rules tx
315. on a text automaton one should pay attention to tagset and morphology The tagset of the model must be identical to that of the text automaton For example if the statistical model has been computed with the tag DET for the word the the corresponding tag in the text automaton must be DET Unitex provides functionality to modify word forms in the text for example to normalize doesn t into does not Apply ing replacing or normalization graphs could cause some morphological modifications on words If such processing is applied to the text it must have been applied to the training corpus as well If these rules are not respected the tagger might not be able to keep the good path from the text automaton The TrainingTagger program produces two variants of the tagger The first one prunes tran sitions on the basis of grammatical semantic syntactic and inflectional codes for example the DET Ddef s versus the DET Ddef p The second one prunes transitions on the 178 CHAPTER 7 TEXT AUTOMATON basis of grammatical semantic and syntactic codes that DET Ddem versus that PRO Pdem This option makes the training quicker and inflectional features are not needed for all applications 7 4 2 Use of the Tagger In order to linearize the text automaton you have to select the option Linearize with the Tagger in the configuration window for the construction of the text automaton cf figure 7 25 With this option the program will l
316. on of Fred W Householder s paper analysis synthesis and improvisation In Sture Allen editor Text Processing Proceedings of Nobel Symposium 51 pages 297 315 Stockholm Almqvist Wiksell 1982 9 1 42 Maurice GROSS On structuring the lexicon Quaderni di Semantica 4 1 107 120 1983 9 1 43 Maurice GROSS Lexicon grammar and the syntactic analysis of french In Pro ceedings of the 10 th International Conference on Computational Linguistics COL ING 84 Stanford California 1984 9 1 44 Maurice GROSS A linguistic environment for comparative romance syntax In Ph Baldi editor Papers from the XIIth Linguistic Symposium on Romance Lan guages volume IV 26 of Amsterdam studies in the theory and history of linguistic science pages 373 446 Amsterdam Philadelphia Benjamins 1984 9 1 45 Maurice GROSS Grammaire transformationnelle du francais 3 Syntaxe de l adverbe ASSTRIL Paris 1986 3 8 9 1 46 Maurice GROSS Lexicon grammar the representation of compound words In COLING 1986 Proceedings pages 1 6 Bonn 1986 9 1 47 Maurice GROSS Methods and tactics in the construction of a lexicon grammar In Linguistics in the Morning Calm 2 Selected papers from SICOL pages 177 197 Seoul Hanshin 1986 9 1 48 Maurice GROSS Linguistic representations and text analysis In Linguistic Unity and Linguistic Diversity in Europe pages 31 61 London Academia Europaea 1991 9 1 49 Maurice GROSS Constru
317. once with the apropriate output Figure 5 20 5 2 5 Using Variables It is possible to select parts of a text sequence recognized by a grammar using variables To associate a variable var1 with parts of a grammar use the special symbols var1 and var1 to define the beginning and the end of the part to store Create two boxes contain ing one var1 and the second var1 These boxes must not contain anything but the 5 2 EDITING GRAPHS 101 EI Concordance C Documents and Settings adurand Mes documents UNITEX Fren n pm Ex 17191 matches rester dans l Inde et cas g n ral votre salut n t Vous n abusez pas et cas g n ral votre pr sence 1 ort de mon ma tre Et cas g n rallvotre maitre t l air de la mer _ Et cas g n ral votre maitre jd r Bordeaux _ Et cas g n rallvotre cargaison e l oc an Indien 8 Et cas g n ral votre maitre My st moi _ Cet homme est cas g n rallvotre domestique man _ Et cet homme est cas g n ral votre domestique la monsieur Fix c est cas g n ral votre affaire y 1 cas particu nt sa marche C est cas g n ral votre m tier et offre caution _ C est cas g n rallvotre droit y r ve Le Carnatic cas particulier le Carnatic ppa la terre du pied cas particulier Le gueux g 0 cas g n ral d un homme du monde cas particulier Le capitaine articuler une parole cas particulier
318. ones 8 2 225 267 1984 Amsterdam Philadelphia John Benjamins Publishing Company 9 1 15 A CHROBOT B COURTOIS M HAMMANI Mc CARTHY M GROSS and K ZELLAGUI Dictionnaire electronique DELAC anglais noms compos s Technical Report 59 LADL Universit Paris 7 1999 3 8 16 Unicode Consortium http www unicode org 2 2 17 Matthieu CONSTANT and Anastasia YANNACOPOULOU Le dictionnaire lec tronique du grec moderne Conception et d veloppement d outils pour son enrichissement et sa validation In Studies in Greek Linguistics Proceedings of the 23rd annual meeting of the Department of Linguistics Faculty of Philosophy Aristotle University of Thessaloniki 2002 3 8 18 Danielle CORBIN Hypoth ses sur les fronti res de la composition nominale Cahiers de grammaire 17 26 55 1992 Universit de Toulouse Le Mirail 11 1 19 Blandine COURTOIS Formes ambigu s de la langue fran aise Lingvistice In vestigationes 20 1 167 202 1996 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 20 Blandine Courtois and Max Silberztein editors Les dictionnaires lectroniques du fran ais Larousse Langue fran aise vol 87 1990 3 8 11 2 1 11 2 2 21 Anne DISTER Nathalie FRIBURGER and Denis MAUREL Am liorer le d coupage en phrases sous INTEX In Anne Dister editor Revue Informatique et Statistique dans les Sciences Humaines volume Actes des 3 mes Journ es INTEX pages 181 199 2000 2 5 2
319. only for some particular sublasses of nouns or adjectives and are necessary for a better compactness of the inflection paradigms of simple words which are already considerably huge and would be even larger if no no care symbols were used Let us assume that the equivalences between the above features and their corresponding 230 CHAPTER 11 COMPOUND WORD INFLECTION codes in DELA dictionaries are given by the following Equivalences txt file Serbian s Nb s p Nb p w Nb w 1 Case 1 2 Case 2 3 Case 3 4 Case 4 5 Case 5 6 Case 6 7 Case 7 m Gen m f Gen f n Gen n v Anim v q Anim q g Anim g a Comp a b Comp b c Comp c d Det d k Det k e Det e Consider the following sample Serbian DELAC file the DELAS inflection codes may vary from those present in Unitex zxiro racyun racyun N1 mslq NC_2XN1 N Comp avio prevoznik prevoznik N10 mslv NC_2XN2 N Comp predsednik predsednik N10 mslv drzxave drzxava N600 fs2q NC_N2X1 N Comp Ujedinxene Ujedinxen Al aefplg nacije nacija N600 fplq NC_AXN3 N Comp NProp Org Kosovo Kosovo N308 nslq i Metohija Metohija N623 fslq NC_N3XN N Comp NProp Top Reg istrazxni istrazxni A2 admslg sudija sudija N679 mslv NC_AXNF N Comp Mirosinka Mirosinka N1637 fslv Dinkicx Dinkicx N1028 mslv NC_ImePrezime N Comp Hum PersName gladan gladan Al8 akmslg kao vuk vuk N128 mslv AC_A3XN2 hungry as a wolf The corresponding inflection graphs for MWUs are shown on figures 11 28 through 11 35 The DELA
320. ontaining an adjective as shown on Figure 6 60 148 CHAPTER 6 ADVANCED USE OF GRAPHS Cancel Figure 6 59 Exiting on variable error Concordance D My UnitexiEnglishiCorpus ivanhoe_snticoncord html n m bd party in whatever rash expedition ADJ rash NOUN expedition sessed by the great Barons ADJ qreat NOUN Barons that even to the very edqe ADJ very NOUN edqe of destruct of their less powerful neighbours ADJ powerful NOUN neighbours erings of the inferior classes ADJ inferior NOUN classes to blend the hostile blood ADJ hostile NOUN blood of nterests two hostile races ADJ hostile NOUN races on s one of which still felt ADJ still NOUN felt the ela Figure 6 60 Backtracking on variable error 6 10 3 Concordance The result of a search is an index file that contains the positions of all encountered occur rences The window of Figure 6 61 lets you choose whether to construct a concordance or modify the text In order to display a concordance you have to click on the Build concordance button You can parameterize the size of left and right contexts in characters You can also choose the sorting mode that will be applied to the lines of the concordance in the Sort According to menu For further details on the parameters of concordance construction refer to sec tion 4 8 2 The concordance is produced in the form of an HTML file You can parameterize Unitex so that concordance files can be read using a web
321. or as a past participle of the verb taire The then part imposes that tu is then regarded as a pronoun Figure 7 13 shows the result of the application of this grammar on the sentence Feras tu cela bient t One can see in the automaton at the bottom that the path corresponding to tu past participle was eliminated Synchronization point The if and then parts of an ELAG grammar are divided into two parts by lt gt in the if part and lt gt in the then part These symbols form a synchronization point This makes it possible to write rules in which the if and then constraints are not necessarily aligned as it is the case for example in figure 7 14 This grammar is interpreted in the following way if a dash is found followed by i1 elle or on then this dash must be preceded by a verb possibly 164 CHAPTER 7 TEXT AUTOMATON A An Y 3 sentences Feras tu cela bient t 7 Sentence Reset Sentence Graph Rebuild FST Text close elag frame Implode Feras j faire V z1 F2s PRO PpvIL z1 2fs 2ms Apply Elag Rule PE Explode 4 PRO PpvIL Implode Replace Figure 7 13 Result of applying the grammar in figure 7 12 followed by t So if one considers the sentence of the figure 7 15 beginning with Est il one can see that all non verb interpretations of Est were removed 7 5 2 Compiling ELAG Grammars Before an ELAG grammar can be applied to a text automaton the grammar must be com
322. or the construction of the text automaton cf figure 7 10 This option indicates to the automaton construction program that it should clean up each sentence automaton This cleaning is carried out according to the following principle if several paths are concur rent in the automaton the program keeps those that contain the fewest unlabeled tokens For instance the compound adverb aujourd hui is preferred to the sequence made of aujourd followed by a quote and nui because au jourd and the quote are both unlabeled 7 2 CONSTRUCTION 161 Construct the Text FST Normalization O Build clitic normalization grammar available only for Portuguese Portugal Apply the Normalization grammar home paumier unitex English Graphs Normalization Norm grf set Clean Text FST C Normalize according to Elag tagset def Linearize with the Tagger home paumier unitex English Dela tagger data cat bin Use Following Dictionaries previously constructed The program will construct the text FST according to the DLF DLC and tags ind files previously built by the Dico program for the current text Figure 7 10 Configuration of the construction of the text automaton tokens while the compound adverb path does not contain any unknown word Figure 7 11 shows the automaton of figure 7 9 after cleaning 162 CHAPTER 7 TEXT AUTOMATON acu ar aran Wawa tiuuauem antunnmala Amianan da lila ai Sentence 13 1003 senten
323. ord file 1138 Log Hle A ee ESE SER c Roe d Re DERE 14 13 9 Arabic typographic rules arabic_typo_rules txt Appendix A GNU Lesser General Public License Appendix B TRE s 2 clause BSD License Appendix C Lesser General Public License For Linguistic Resources CONTENTS Introduction Unitex is a collection of programs developped for the analysis of texts in natural language by using linguistic resources and tools These resources consist of electronic dictionaries grammars and lexicon grammar tables initially developed for French by Maurice Gross and his students at the Laboratoire d Automatique Documentaire et Linguistique LADL Similar resources have been developed for other languages in the context of the RELEX laboratory network The electronic dictionaries specify the simple and compound words of a language together with their lemmas and a set of grammatical semantic and inflectional codes The avail ability of these dictionaries is a major advantage compared to the usual utilities for pattern searching as the information they contain can be used for searching and matching thus de scribing large classes of words using very simple patterns The dictionaries are presented in the DELA formalism and were constructed by teams of linguists for several languages French English Greek Italian Spanish German Thai Korean Polish Norwegian Por tuguese etc The grammars used here are representat
324. ory 47 48 CHAPTER 3 DICTIONARIES e apple is the canonical form lemma of the entry For nouns and adjectives in French it is usually the masculine singular form for verbs it is the infinitive This information may be left out as in the following example apple N Conc s This means that the canonical form is the same as the inflected form The canonical form is separated from the inflected form by a comma N Conc is the sequence of grammatical and semantic information In our example N designates a noun and Conc indicates that this noun designates a concrete object see table 3 2 Each entry must have at least one grammatical or semantic code separated from the canonical form by a period If there are more codes these are separated by the character p is an inflectional code which indicates that the noun is plural Inflectional codes are used to describe gender number declination and conjugation This information is optional An inflectional code is made up of one or more characters that represent one information each Inflectional codes have to be separated by the character for instance in an entry like the following hang V W Pls P2s Plp P2p P3p The character is interpreted in OR semantics Thus W P1s P2s Plp P2p P3p means infinitive or 1st person singular present or 2nd person singular present etc see table 3 3 Since each character represents one information you must not use the sa
325. ou can also filter grammati cal semantic codes to be displayed Select All and you will see all codes Select Only POS category and only first codes supposed to represent the POS category will be displayed If you select Use filter and set a regular expression X codes that do not contain something matched by X will be discarded Any POSIX regular expression is accepted as filter Check Always show POS category and as said the POS category will be kept even if not matched by the filter if any For instance Figure 7 36 shows a filtering result obtained with the filter A Z that matches any code starting with an uppercase letter thus discarding codes like EAR The Export all text as POS list button can be used to export this table display of the whole text automaton as a text file following a special format Currently it is only an experimental 186 CHAPTER 7 TEXT AUTOMATON Automaton Table Filter grammatical semantic codes v Always show POS category regardless filtering Export all text as POS list all Only POS category Use filter A Z Form POS sequence 1 POS sequence 2 DANS DANS dans PREP Dnom LEQUEL LEQUEL lequel DET Dnom ms Phileas Fogg N Hum Phileas Fogg N Hum ET ET et CONJC PASSEPARTOUT Ss se PRO PpvLE 3fs 3ms 3fp 3mp se PRO PpvLUI 3f5 3ms 3fp 3m ACCEPTENT ACCEPTENT accepter V P3p S3p
326. ou must make sure that they too receive or can get the source code If you link other code with the library you must provide complete object files to the recipients so that they can relink them with the library after making changes to the library and recompiling it And you must show them these terms so they know their rights We protect your rights with a two step method 1 we copyright the library and 2 we offer you this license which gives you legal permission to copy distribute and or modify the library To protect each distributor we want to make it very clear that there is no war ranty for the free library Also if the library is modified by someone else and passed on the recipients should know that what they have is not the original version so that the original author s reputation will not be affected by problems that might be introduced by others Finally software patents pose a constant threat to the existence of any free pro gram We wish to make sure that a company cannot effectively restrict the users of a free program by obtaining a restrictive license from a patent holder Therefore we insist that any patent license obtained for a version of the library must be consistent with the full freedom of use specified in this license Most GNU software including some libraries is covered by the ordinary GNU General Public License This license the GNU Lesser General Public License ap plies to certain designated libraries
327. ox to itself see figure 5 7 To undo this double click on the same box a second time or use the Undo button Figure 5 7 Box connected to itself 94 CHAPTER 5 LOCAL GRAMMARS Click on Save as in the F5Graph menu to save the graph By default Unitex proposes to save the graph in the sub directory Graphs in your personal folder You can see if the graph was modified after the last saving by checking if the title contains the text Unsaved When editing a graph you can bring up a specific contextual menu to perform standard graph edition operations by right clicking in the background of the graph window This menu will offer several operations that are frequently used when editing a graph e create a new box e save print the current graph or set up the page parameters e the usual Tools Format and Zoom menu also accessible in the FSGraph menu If one or several box are currently selected the following menus will be accessible allowing you to apply specific operations to these sets of boxes Otherwise these menus are useless and therefore non accessible e surround selected boxes with input or output variable with contexts or with Mor phological mode delimiters These operations are also accessible via the Toolbar of the graph edition window see section 5 2 8 e merge selected boxes e export as a new graph Create box Surround with Merge boxes Output variable Export as new graph Morphological mo
328. peni you must use the sequence LDRi L will move the cursor on the a D will delete the a shifting the n on the left and then Ri will restore the n and add an i e U unaccent removes the accent of the current character if any For instance the se quence LLUx applied to the word mang s produces the inflected form mangex since U has turn the into a e e P uppercase uppercases the initial letter of the stack For instance the sequence Px will turn oo into Foox e W lowercase lowercases the initial letter of the stack e lt R gt replaces the initial letter of the stack by the letter e lt I gt inserts the letter before the initial letter of the stack e lt X n gt removes the first n letters of the stack There are also two operators dedicated to Korean e J removes a Jamo letter If the current character is a Hangul syllab character this character is first replaced by the equivalent Jamo sequence and then the last Jamo letter is removed If the current character is neither a Jamo nor a Hangul and error is raised e latin dot inserts a syllab bound As a side effect if the top of the stack contains Jamo letters they are first recombined into a Hangul character In the example below the inflection of choose is shown The sequence LLDRRn describes the form chosen e Step 0 the canonical form is copied on the stack and the cursor is set behind the last letter SRE e
329. plied in MERGE mode to the text Ivanhoe the concordance in Figure 6 44 is obtained Outputs with variables can be used to move word groups In fact the application of a trans ducer in REPLACE mode inserts only the produced sequences into the text In order to invert two word groups you just have to store them into variables and produce an out put with these variables in the desired order Thus the application of the transducer in Figure 6 45 in REPLACE mode to the text Ivanhoe results in the concordance of Figure 6 46 E E E ADJ ADJ NOUN Nouw PNOUN SADIS Figure 6 45 Inversion of words using two variables Ej Concordance D My Unitex EnglishiCorpus wanhoe_snticoncord html stopping the course of a brook small which glided smoothly round the foot when his return from his captivity lonq had become an event rather wished t heir gnarled arms over a carpet thick of the most delicious green sward 5 s ight as it were to the chains feudal with which they were loaded At c arance of that wild and character rustic which belonged to the woodlands gorget was engraved in characters Saxon an inscription of the following nd the sufferings of the classes inferior arose from the consequences of t mmm HR mmm dm m om m mmm omm Rm mnm NE The m nm mmm mnm os mmm nml omn nmm gt Figure 6 46 Result of the application of the transducer in figure 6 45 If the beginning or the end of variable is malformed end of a var
330. pplied using the Fst2Txt program but in REPLACE mode which means that input sequences recognized by the grammar are replaced by the output sequences that are produced Figure 2 11 shows a grammar that normalizes verbal contractions in English 2 5 4 Splitting a text into tokens Some languages in particular Asian languages use separators that are different from the ones used in western languages Spaces can be forbidden optional or mandatory In order to better cope with these particularities Unitex splits texts in a language dependent way Thus languages like English are treated as follows A token can be e the sentence delimiter S e the stop marker STOP This token is a special one that can NEVER be matched in any way by a grammar It can be used to bound elements in a corpus For instance if a corpus is made of news separated by STOP it will be impossible that a grammar matches a sequence that overlaps the end of a news and the beginning of the following news e alexicaltag aujourd hui ADV e a contiguous sequence of letters the letters are defined in the language alphabet file e one and only one non letter character i e all characters not defined in the alphabet file of the current language if it is a newline it is replaced by a space For other languages tokenization is done on a character by character basis except for the sentence delimiter S the STOP marker and lexical tags This simple tokenizatio
331. program to match expressions that start with a space default c char by char works in character by character tokenization mode This is useful for languages like Thai w word by word works in word by word tokenization mode default d DIR sntdir DIR puts produced files in DIR instead of the text direc tory Note that DIR must end with a file separator or K korean tells Locate that it works on Korean u X arabic_rules X Arabic typographic rule configuration file g X negation_operator x specifies the negation operator to be used in Locate patterns The two legal values for X are minus and tilde default Using minus provides backward compatibility with previous versions of Uni tex Search limit options 1 a11 looks for all matches default n N number_of_matches N stops after the first N matches Maximum iterations per token options e o N stop token count N stops after N iterations on a token e o N M stop token count N M emits a warning after N iterations on a token and stops after M iterations Matching mode options e S shortest matches e L longest matches default e A all matches 272 CHAPTER 13 USE OF EXTERNAL PROGRAMS Output options e 1 ignore ignore transducer outputs default e M merge merge transducer outputs with text inputs e R replace replace texts inputs with corresponding transdu
332. quivalences between these features and their corresponding codes in DELA dictionaries are given by the following Equivalences txt file English s Nb s p Nb p Consider the following sample English DELAC file angle angle N1 s of reflection NC_NXXXX Adam s apple apple Nl s NC_XXXXN air brake brake N1 s NC_XXN birth date date N1 s NC NN NofN criminal police NC XXXinv cross roads NC XXNs head head N1 s of government government N1 s NC NofNs notary notary N3 s public public Nl s NC_NsNs rolling stone stone N1 s NC XXN student student N1 s union union N1 s NC Ns N 224 CHAPTER 11 COMPOUND WORD INFLECTION The corresponding inflection graphs N1 and N3 for simple words are represented on fig ures 11 10 and 11 11 while those for compounds are shown on figures 11 12 through 11 20 The DELACF dictionary resulting from the inflection via MULTIFLEX of the above DELAC is as follows angle of reflection angle of reflection NC NXXXX s angles of reflection angle of reflection NC NXXXX p Adam s apple Adam s apple NC XXXXN s Adam s apples Adam s apple NC XXXXN p air brake air brake NC XXN s air brakes air brake NC XXN p date of birth birth date NC NN NofN s dates of birth birth date NC NN NofN p birth date birth date NC NN NofN s birth dates birth date NC NN NofN p criminal police criminal police NC XXXinv p cross roads cross roads NC XXNs s cross roads cross roads NC XXNs p eads of go
333. r By default the encoding proposed on a PC is always Unicode Little Endian The texts thus obtained do not contain any formatting information anymore fonts colors etc and are ready to be used with Unitex 34 CHAPTER 2 LOADING A TEXT You can change the default encoding to UTF16LE UTF16BE or UTF8 in the Encoding tab via the Preference command in the Info menu This encoding is valid for the current lan guage only E Preferences for Erenchu oii Morphological dictionaries SVN Encoding Directories Language amp Presentation Select encoding to be used by Unitex UTF16LE O UTF16BE O UTF8 Cancel Figure 2 5 Setting the default encoding for current language 2 3 Editing text files For small texts you also have the possibility of using the text editor integrated into Unitex accessible via the Open command in the File Edition menu This editor offers search and replace functionalities for the texts and dictionaries handled by Unitex To use it click on the Find icon You will then see a window divided into three parts The Find part corresponds to the usual search operations If you open a text split into sentences you can base your search on sentence numbers in the Find Sentence part Lastly the Search Dictionary part visible in figure 2 6 enables you to carry out operations concerning the electronic dictionaries In particular you can search by specifying
334. r several different inflectional interpretations such as for example se PRO PpvLE 3ms 3fs 3mp 3fp 7 3 RESOLVING LEXICAL AMBIGUITIES WITH ELAG 169 Evaluation of ambiguity removal The evaluation of the ambiguity rate is not based solely on the average number of interpre tations per word In order to get a more representative measure the system also takes into account the various combinations of words While instances of ambiguities are resolved the Elag program calculates the number of possible analyses in the text automaton before and after the modification which corresponds to the number of possible paths through the automaton On the basis of this value the program computes the average ambiguity by sentence and word It is this last measure which is used to represent the ambiguity rate of the text because it does not vary with the size of the corpus nor with the number of sentences within The formula applied is A log number of paths lexical ambiguity rate exp test length The relationship between the ambiguity rate before and after applying the grammars gives a measure of their efficiency All this information is displayed in the ELAG processing win dow 7 9 6 Description of the tag sets The Elag and ElagComp programs require a formal description of the tag set to be used in dictionaries This description consists essentially of an enumeration of all the parts of speech present in the dictionaries with
335. raph names with r This can be combined with the and priority marks bagpipe r fst2 McAdam r fst2 phtirius r fst2 Exporting produced entries as a morphological dictionary Dictionary entries produced by dictionary graphs are of course taken into account by the Locate program However you can not refer to them in the morphological mode since they do not belong to a morphological dictionary If you want to do so you just have to add b at the end of the graph name If you add z instead then the morphological dictionary will be compressed immediately thus being usable by the next dictionary graph to be applied Naming conventions The whole naming scheme for dictionary graph is as follows name XYZ fst2 where e Xisin rRmM r means REPLACE mode M means MERGE mode default e Y is in bBzZ option that rules the production of a morphological dictionary see previous section e Zisin aAlLsS a means that the graph will be applied in All matches mode 1 means Longest matches mode default s means Shortest matches mode 3 7 4 Morphological dictionary graphs In addition to dictionary graphs that produce new entries in the text dictionaries you can design morphological dictionary graphs The output of such graphs will be used as special input for the construction of the text automaton We call them morphological dictionary graphs because their main utility is to introduce new morphological analysis
336. rcin WOLINSKI The Unbearable Lightness of Tagging A Case Study in Morphosyntactic Tagging of Polish In Proceedings of the 4th International Workshop on Linguistically Interpreted Corpora EACL 2003 2003 11 1 1 11 22 77 Roger Bruno RABENNILAINA Le verbe malgache AUPELF UREF et Universit Paris 13 Paris 1991 9 1 78 Elisabete RANCHHOD Frozen adverbs comparative forms como c in portuguese Lingvistice Investigationes 15 1 141 170 1991 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 9 1 79 Elisabete RANCHHOD Ressources linguistiques du portugais impl ment es sous intex In C Fairon editor Analyse Lexicale et Syntaxique Le syst me INTEX Lingvisticae Investigationes pages 263 277 Amsterdam Philadelphia John Benjamins Publishing Company 1998 3 8 80 Elisabete RANCHHOD Probl mes de traduction automatique des constructions verbes supports Lingvistice Investigationes 23 2 253 267 2001 Amsterdam Philadelphia John Benjamins Publishing Company 9 1 81 Elisabete RANCHHOD and Michele DE GIOIA Comparative romance syntax frozen adverbs in italian and in portuguese Lingvistice Investigationes 20 1 33 85 1996 Amsterdam Philadelphia John Benjamins Publishing Company 9 1 BIBLIOGRAPHY 347 82 Elisabete RANCHHOD and Samuel ELEUTERIO Constru o de dicion rios elec tr nicos do portugu s problemas te ricos e metodol gicos In Actas do Con gresso Internacional sobre o Po
337. red by Unitex It isconserved to ensure compatibility with Intex graphs e PORIENT x this line is ignored by Unitex It isconserved to ensure compati bility with Intex graphs e this line is ignored by Unitex It serves to indicate the end of the header information The lines after the header give the contents and the position of the boxes in the graph The following example corresponds to a graph recognizing a number 34 lt E gt 84 248 1 2 Y 272 248 0 4 s 1 2 3 4 5 6 7 8 9 0 172 248 1 1 4 The first line after the header indicates the number of boxes in the graph immedi ately followed by a newline This number can not be lower than 2 since a graph always has an initial and a final state The following lines define the boxes of the graph The boxes are numbered starting at 0 By convention state 0 is the initial state and state 1 is the final state The contents of the final state is always empty Each box in the graph is defined by a line that has the following format contents X Y N transitions 4 contents is a sequence of characters enclosed in quotation marks that represents the contents of the box This sequence can sometimes be preceded by an s if the graph is imported from Intex this character is then ignored by Unitex The contents of the sequence is the text that has been entered in the editing line of the graph editor Table 14 4 shows the encoding of two special sequences that are not encoded in
338. refers shortest matches in case of nested sequences For instance if your grammar can recognize the sequences a very hot chili and very hot the first one will be discarded the default e All matches outputs all recognized sequences Longest matches prefers longest matches a very hot chili in our example This is 80 CHAPTER 4 SEARCHING WITH REGULAR EXPRESSIONS 7 Locate Pattern Locate configuration Advanced options Locate pattern in the form of O Regular expression a Graph Set Index Grammar outputs Shortest matches amp Are not taken into account Longest matches Merge with input text All matches O Replace recognized sequences Search limitation amp Stop after 200 matches SEARCH Index all utterances in text Search algorithm 8 Paumier 2003 working on text quicker automaton intersection higher precision Figure 4 4 Locate pattern window The Search limitation box is used to limit the number of results to a certain number of occurrences By default the search is limited to the first 200 occurrences The options of the Grammar outputs box do not concern regular expressions They are de scribed in section 6 10 The same for options of tab Advanced options see section 6 10 2 In the Search algorithm frame you can specify wether you want to perform the locate op eration on the text using the Locat
339. replaces this by two entries one where the character is replaced by a space and one where it is replaced by a dash Thus the following entry acorn shells acorn shell N p is replaced by the following entries acorn shells acorn shell N p acorn shells acorn shell N p NOTE If you want to keep an entry that includes the character escape it using as in the following example EN mc2 FORMULA This replacement is done when the dictionary is compressed In the compressed dictionary the escaped characters are replaced by simple As such if a dictionary containing the following lines is compressed E mc2 FORMULA acorn shell N s and if the dictionary is applied to the following text Formulas like E mc2 have nothing to do with acorn shells you will get the following lines in the dictionary of compound words of the text E mc2 FORMULA acorn shells N p 50 CHAPTER 3 DICTIONARIES Entry Factorization Several entries containing the same inflected and canonical forms can be combined into a single one if they also share the same grammatical and semantic codes Among other things this allows us to combine identical conjugations for a verb bottle V W Pls P2s Plp P2p P3p If the grammatical and semantic information differ one has to create distinct entries bottle N Conc s bottle V W Pls P2s Plp P2p P3p Some entries that have the same grammatical and semanti
340. resented by red line On Windows you can open a sub graph by clicking on the grey line while pressing the Alt key 1To avoid confusion graph calls that refer to the repository are displayed in brown instead of grey 98 CHAPTER 5 LOCAL GRAMMARS On Linux the combination lt Alt Click gt is intercepted by the system In order to open a sub graph click on its name by pressing the left and the right mouse button simultaneously The list of subgraphs called from the current graph and the graphs in witch the current graph is called can be displayed bay clicking on the second an third button of the fourth set of buttons in the toolbar command see Figure 5 15 and Figure 5 25 in section 5 2 8 In these Lists of subgraphs e sub graphs directly called from the current graph appear with their simple filename e sub graphs indirectly called from one of the graphs called by the current graph appear with an arrow before their filename sub graphs that appear in one of the graphs that are called from the current one but that are unplugged and never processed appear in orange sub graphs that are not found neither grf nor fst2 appear in red F Sentence grf home paumier unitex French Graphs Preprocessing Sentence E a T Called graphs B BS T SD ex JE A 941 gt AbrPoint gt AbrPointMilFin gt Abr_nbAmb gt LettreMaj gt LettreMin gt Millions gt MotsComposesAvecMaj gt MotsS
341. rest et que justif Figure 6 63 Example of a concordance comparison 6 10 7 Debug mode When you apply a graph to a text with the Locate menu in the window shown in figure 6 52 if you activate the debug mode in the Locate pattern in the forme of field the concordance will be displayed in a special window such as in figure 6 64 divided into three parts In the top right part of the window is the concordance It is identical to the classical concordance in which the sequences matched by the graph appear in blue In the bottom right you will find the graph used for the locate om In the left side of the window is a table of 3 columns Tag output and matched Each token of the matched sequence appear in the matched column the Tag column con tains what is in the box of the automaton that matched it and if this box has any output it will appear in the output column 152 CHAPTER 6 ADVANCED USE OF GRAPHS For each matched sequence in the concordance if you click on its line in the concordance the table on the left will be actualized and clicking on a row of the table will colour the corresponding box in the graph This will help you to see for each occurence of a matched sequence in the text which path in the automaton recognized it A red number above each box indicates the number of sequences in the text in which that box matched a token E Concordance C Documents and Settings ad
342. roblem is that this version does not start by default See section Java for Mac OS X 10 5 Update 2 at http developer apple com java 2 if you have an older OS X 32 bit Intel or a PowerPC you must try SoyLatte see below How to know if my processor is a 32 or 64 bit one In the Apple menu click on About this Mac If you see something like Processor x xx Ghz Intel Core Duo your processor is a 32 bit one If you see Processor x xx Ghz Intel Core 2 Duo or if you processor is another Intel one like Xeon then you have a 64 bit processor 15 1 Using the Apple Java 1 6 runtime If you are running Mac OS X 10 5 or later on 64 Bits Intel processor you can just use the Java 1 6 from Apple You can get it from http www apple com support downloads javaformacosxl05updatel html You can just start Application gt Utilities gt Java Preferences to verify the status of Java 1 6 First be sure that Java SE 6 is on Java Applications list Option 1 modify the default runtime for Java Applications If you don t use other Java application that need Java 1 5 you can just put Java SE 6 at the top of the Java Applications list on Java Preference Utility Option 2 Create an alias to start Java 1 6 If you don t want modify Java global parameters you can create an alias alias jre6 System Library Frameworks JavaVM framework Versions 1 6 Commands jre6 jar Unitex jar Then just run Unitex from Terminal 20 CHA
343. rphological classes a morphological class should fully determine the inflection categories the word inflects for as well as those that are lexically fixed for the word e g in Polish a noun has a gender and inflects for number and case e What are the exceptions to the inflection categories determined above E g in Polish wybory powszechne general election is a compound noun but it doesn t have a singular form although its head word wybory does 11 1 MULTI WORD UNITS 213 e What are the inflectional characteristics base form morphological class inflection paradigm etc of the single constituents of the MWU E g in French porte door is an uninflected verb in porte avion aircraft carrier while it is an inflected noun in porte fen tre French window which takes an s in plural portes fen tres e How should we combine the inflected forms of the single constituents in order to gen erate the inflected forms of the whole compound E g to inflect battle of nerves and battle cry in number we need to inflect the first and the last constituent respectively 11 12 Lexicalized vs Grammar Based Approach to Morphological Description A previous study 84 has confirmed the status of MWUS as units on the frontier between morphology and syntax Their compound structure suggests productivity which can hardly be processed without a grammar based approach However some of their morphological syntactic
344. rs and how to apply them Then we deals with options and behaviors offered by CasSys 12 1 Applying a cascade of transducers with CasSys 12 1 1 Creating the list of transducers The menu FSGraph proposes two submenus New cascade and Edit cascade Figure 12 1 In order to create the list of transducers choose new cascade If you want to modify an existing cascade you can choose Edit cascade then a file explorer permits to choose the cascade to open The list of transducers of a cascade is saved in a text file with the extension csc ex mycascade csc l Feder R gion Centre Entit s nomm es et nommables managed by Denis Maurel LI Tours France inte gration carried out by Nathalie Friburger and David Nott 239 240 CHAPTER 12 CASCADE OF TRANSDUCERS Lexicon Grammar Open Ctri O Save Ctrl S Print Ctri P Undo Ctrl Z Redo Ctri Y Figure 12 1 FSGraph menu of Unitex and submenus New cascade and Edit cascade 12 1 2 Editing the list of transducers The Cassys configuration window Figure 12 2 is divided into three parts E Cassys Transducer Configuration nouveauCasEN5_Quaero csc unsaved
345. rs is to be divided into units In our formalism units are referred to by numerical variables 1 2 3 etc For example with Unitex a sequence like e Athens 04 consists of five constituents referred to in MULTIFLEX as 1 Athens 2 lt space gt 3 4 0 5 4 Each simple unit subject to inflection within a MWU has to be morphologically identified The identification means providing sufficient data so that any inflected form of the same item may be generated on demand For instance in e m moire vive we need to know that vive is the feminine singular form of a lemma and we have to be able to generate the feminine plural form of the same lemma vives We suppose that the external module for single units working with MULTIFLEX is responsible for such identification and generation of inflected forms of single units In Unitex the generation of forms is strongly inspired by the DELA system 20 In order to be able to generate one or more inflected forms of a word we have to know e its lemma 11 2 FORMALISM FOR THE COMPUTATIONAL MORPHOLOGY OF MWUS 217 e its inflection paradigm called inflection code e the inflection features of forms to be generated Thus within the Unitex MULTIFLEX interface the description of a single unit is done as follows e vive vif A54 fs where A54 is the inflection code of vif and fs is the DELA style description using morpho logical features appearing in Equivalences
346. rtugu s pages 265 282 1996 Lisboa Colibri 3 8 83 Morris SALKOFF Verbs of mental states In Lexique syntaxe et lexique grammaire Papers in honour of Maurice Gross volume 24 of Lingvistice Investigationes Sup plementa pages 561 571 Amsterdam Philadelphia Benjamins 2004 9 1 84 Agata SAVARY Recensement et description des mots compos s m thodes et applica tions 2000 Th se de doctorat Universit de Marne la Vall e 3 8 11 1 1 11 1 2 85 Agata SAVARY A formalism for the computational morphology of multi word units Archives of Control Sciences 15 3 437 449 2005 11 11 1 2 11 2 86 Max SILBERZTEIN Les groupes nominaux productifs et les noms com pos s lexicalis s Lingvistice Investigationes 27 2 405 426 1999 Amsterdam Philadelphia John Benjamins Publishing Company 3 8 11 1 87 Carlos SUBIRATS R GGEBERG Sentential complementation in Spanish A lexico grammatical study of three classes of verbs John Benjamins Amster dam Philadelphia 1987 9 1 88 Thomas TREIG Compl tives en allemand classification Technical Report 7 LADL 1977 9 1 89 Lidia VARGA Classification syntaxique des verbes de mouvement en hongrois dans l optique d un traitement automatique In F Kiefer G Kiss and J Pa jzs editors Papers in Computational Lexicography COMPLEX pages 257 265 Budapest Research Institute for Linguistics Hungarian Academy of Sciences 1996 9 1 90 Simoneta VIETRI On the stu
347. s lt and gt Within this zone things are matched letter by letter as shown on Figure 6 30 Figure 6 30 Example of morphological zone in a grammar 6 4 2 The rules In this mode the content of the graph is not interpreted as it is in the normal way 1 There is no implicit space between boxes So if you want to match a space you have to make it explicit with a space between double quotes 2 You can still use subgraphs but the end of the morphological zone must occur in the same graph as its beginning 3 You can use morphological filters on lt DIC gt and patterns referring to dictionaries like lt be gt lt N ms gt etc 130 CHAPTER 6 ADVANCED USE OF GRAPHS 4 You can use morphological filters alone or on lt TOKEN gt but note that your filters will only apply to the current character As a consequence filters like lt lt 1 9 0 9 gt gt that try to match more than one character will never match anything In fact in mor phological mode morphological filters should only be used to express negations like lt lt aeiouy gt gt any character that is not a vowel 5 Left and right contexts are forbidden 6 You can use outputs 7 lt MOT gt will match any letter as defined in the alphabet file 8 lt MIN gt will match any lowercase letter as defined in the alphabet file 9 lt MAJ gt will match any uppercase letter as defined in the alphabet file 10 lt DIC gt will match any
348. s Normalization of the desired language The normalization grammars for ambiguous forms are described in section 6 1 3 If a sequence of the text is recognized by the normalization grammar all the interpretations that are described by the grammar are inserted into the text automaton Figure 7 4 shows the part of the grammar used for the ambiguity of the sequence 1 in French la le PRO PpvLE 2z1 3fs Figure 7 4 Normalization of the sequence 1 If this grammar is applied to a French sentence containing the sequence 1 a sentence au tomaton that is similar to the one in figure 7 5 is obtained You can see that the four rules for rewriting the sequence 1 have been applied which has added four labels to the automaton These labels are not concurrent with the two preexisting paths for the sequence 1 because of the keep best paths heuristic see section 7 2 4 The normalization at the time of the construction of the automaton allows you to add paths to the 7 2 CONSTRUCTION 157 accumulation des accumulation de NDET Dnom1 4 accumulation N z1 fs PRO PpvLE z1 3fs Figure 7 5 Automaton that has been normalized with the grammar of figure 7 4 automaton but not to remove ones Removing paths will be partially done by the keep best paths heuristic if enabled To go further you will need to use the ELAG disambiguation functionality 7 2 3 Normalization of clitical pronouns in Portuguese In Portug
349. s example grammar applied to Ivanhoe will produce in MERGE mode the concordance shown on Figure 6 49 Thus you can see that the outputs ADJ and NOUN have not been inserted to the left of the input text as one may have expected output a Figure 6 48 Output variables Concordance home paumier unitex English Corpus ivanhoe snt concord html zu 256 matches also flourished in ancient times those bands output NOUN of gallant outlaws whose deeds havd n and oppression possessed by the great Barons output NOUN that they never wanted the pretext were fought many of the most desperate battles output NOUN during the Civil Wars of the Rose orest covering the greater part of the beautiful output ADJ hills and valleys which lie betwe ds of gallant outlaws whose deeds have been output NOUN rendered so popular in English song ish bosom and at the certain hazard of being output ADJ involved as a party in whatever rash red so popular in English song S Such being output ADJ our chief scene the date of our ger ish bosom and at the certain hazard of being output NOUN involved as a party in whatever ras ttn msm Timo m Je GA me A as m m m mm tht m en il D Figure 6 49 Concordance obtained with grammar of Figure 6 48 142 CHAPTER 6 ADVANCED USE OF GRAPHS 6 9 Operations on variables 6 9 1 Testing variables It is possible to test whether a variable has been defined or not in order to bloc
350. s is a space or a hyphen the compressed form of the unit is the unit itself as in the following line O 1 N p which is the output for batt le axes battle axe N p This maintains a certain readability of the inf file when the dictionary contains compound words Whenever one or both of the units in a pair is neither a space nor a hyphen the compressed form is composed of the number of characters to be removed followed by the sequence of characters to be appended Thus the dictionary line 14 8 DICTIONARIES 311 premi re partie premier parti N AN Hum fs is encoded by the line 3er 1 N AN Hum fs The 3er code indicates that 3 characters are to be removed from the sequence premi re and the characters er are to be appended to obtain premier The 1 indicates that only one character needs to be removed from part ie to obtain part i The number 0 is used whenever it needs to be indicated that no letter should be removed 14 83 Dictionary information file In the Apply lexical resources frame it is possible for some dictionaries to get some information with a right click Such information is attached to a biniou bin or biniou fst2 dictionary by the mean of a raw text file named biniou txt located in the same directory 14 8 4 The CHECK DIC TXT file This file is produced by the dictionary verification program CheckDic It is a text file that contains information about the analysed dictionary and has four parts The first part
351. s outputs 145 Analysis of compound words in German Analysis of compound words in Norwe gian 45 Analysis of compound words in Russian 45 Analysis of free compounds in Dutch 276 Analysis of free compounds in German 276 Analysis of free compounds in Norwe gian 276 Analysis of free compounds in Russian 276 Antialiasing 107 318 Approximation of a grammar through a final state transducer 265 Approximation of a grammar with a fi nite state transducer 117 Arabic typographic rules 322 Automata finite state 90 text 281 Automatic inflection 57 113 274 Automaton acyclic 153 minimal 63 of the text 73 153 284 text 115 Axiom 89 Box alignement 108 Boxes alignement 108 connecting 91 creating 90 deleting 99 selection 98 sorting lines 106 BSD 333 INDEX cascade of transducer 239 Cascade of transducers 254 Case see Respect of lowercase uppercase 116 Case respect 115 Case sensitivity 72 79 CasSys 239 Checking dictionary format 54 Chinese characters 187 Clitics normalization 157 277 Cognates 205 Collections of graphs 134 Colors configuration 109 Comment in a dictionary 48 Comments in a graph 91 Comparing concordances 150 Comparing variables 143 Compilation of a graph 117 Compilation of graphs 268 Compiling ELAG grammars 164 Compound words 211 Compressing dictionaries 256 Compression of dictionaries 277 Concatenation of regular expressions 71 76 Co
352. s to it ambiguity removal rules OPTIONS e 1 LANG language LANG ELAG configuration file for the language of the text e r RULES rules RULES rule file compiled in the rul format e o OUT output OUT output text automaton 264 CHAPTER 13 USE OF EXTERNAL PROGRAMS 13 14 ElagComp ElagComp OPTIONS This program compiles the ELAG grammar named GRAMMAR or all the grammars specified in the RULES file The result is stored in the OUT file that will be used by the Elag program OPTIONS e r RULES rules RULES file listing ELAG grammars e g GRAMMAR grammar GRAMMAR single ELAG grammars e 1 LANG language LANG ELAG configuration file for the language of the grammar s e o OUT output OUT output file By default the output file name is the same as RULES except for the extension that is rul 13 15 Evamb Evamb OPTIONS tfst This program computes an average lexical ambiguity rate on the text automaton tfst or just on the sentence which number is specified by N The results of the computation are displayed on the standard output The text automaton is not mod ified OPTIONS e o OUT output OUT optional output filename e s N sentence n sentence number 13 16 Extract Extract OPTIONS lt text gt This program extracts from the given text all sentences that contain at least one occurrence from the concordance The parameter lt text gt represents the comp
353. selves then this License and its terms do not apply to those sections when you distribute them as separate works But when you distribute the same sec tions as part of a whole which is a work based on the Linguistic Resource the distribution of the whole must be on the terms of this License whose permissions for other licensees extend to the entire whole and thus to each and every part regardless of who wrote it Thus it is not the intent of this section to claim rights or contest your rights to work written entirely by you rather the intent is to exercise the right to control the distribution of derivative or collective works based on the Linguistic Resource In addition mere aggregation of another work not based on the Linguistic Resource with the Linguistic Resource or with a work based on the Lin guistic Resource on a volume of a storage or distribution medium does not bring the other work under the scope of this License 14 13 VARIOUS OTHER FILES 337 3 A program that contains no derivative of any portion of the Linguistic Re source but is designed to work with the Linguistic Resource or an encrypted form of the Linguistic Resource by reading it or being compiled or linked with it is called a work that uses the Linguistic Resource Such a work in isola tion is not a derivative work of the Linguistic Resource and therefore falls outside the scope of this License However combining a work that uses the Linguistic Reso
354. sequence of grammatical or semantic codes aviatrix N4 Hum matrix N4 Math radix N4 The first code is used to determine the grammatical code of the entry as well as the name of the grammar used to inflect the canonical form There are two possible forms e V32 grammar name V32 fst2 grammatical code V longest letter prefix e N NC_XXX grammar name NC XXX fst2 grammatical code N These inflectional grammars will automatically be compiled if needed In the example above all entries will be inflected by a grammar named N4 In order to inflect a dictionary click on Inflect in the DELA menu The window in figure 3 7 allows you to specify the directory in which inflectional grammars are found By default the subdirectory Inflection of the directory for the current language is used You can also specify the kind of words your DELAS is supposed to contain If an entry is found that does not correspond to your choice an error message will be displayed 58 CHAPTER 3 DICTIONARIES F Inflection Directory where inflectional FST2 are stored home igm unitex English Inflection Set amp Allow both simple and compound words Allow only simple words Allow only compound words Inflect Dictionary Figure 3 7 Configuration of automatic inflection matrix matrices Figure 3 8 Inflectional grammar N4 Figure 3 8 shows an example of an inflectional grammar The paths describe the suffi
355. shows that a given token shown in green must be analyzed as a combination of two elements a verb and a modifier The point is that the modifier is only made of one Jamo letter that combines with the last Hangul character of the verb in order to give the last 7 9 THE SPECIAL CASE OF KOREAN 187 Hangul character of the whole word in green The green tokens correspond to untagged tokens Untagged tokens are not highlighted in green for other languages Figure 7 38 Decomposition of a Hangul character As a consequence it can be convenient for Korean users to write grammars with mixes of Hangul and Jamo characters Thus a grammar like the one shown on Figure 7 39 will match sequences like the one shown Figure 7 40 Figure 7 39 A grammar with two Jamo letters REMARKS 1 Jamo letters are not in the Korean alphabet file Alphabet txt DO NOT ADD THEM TO THIS FILE because it would induce dysfunctions in programs 2 This alphabet file contains equivalences between some Chinese characters and some Hangul ones In practice if a grammar contains a Chinese character that has such an equivalent Hangul it will match this Hangul in the text automaton For instance the grammar shown on Figure 7 41 will match the sentence of Figure 7 40 because the Korean alphabet file contains an equivalence for that character as shown on Figure 7 42 188 CHAPTER 7 TEXT AUTOMATON de A St dec Hso VH pass Morph Figure 7 40 Sentence automa
356. sing operations the text is modified as it is being read In order to avoid the risk of infinite loops it is necessary that the sequences that are produced by a transducer will not be re analyzed by the same one Therefore whenever a sequence is inserted into the text the application of the transducer is continued after that sequence This rule only applies to preprocessing transducers because during the application of syntactic graphs the transductions do not modify the processed text but a concordance file which is distinct from the text 6 7 3 Priority of the leftmost match During the application of a local grammar overlapping occurrences are all indexed Note that we talk about real overlapping occurrences like abc and bcd not nested occurrences 6 7 RULES FOR APPLYING TRANSDUCERS 137 like abc and bc During the construction of the concordance all these overlapping occur rences are presented cf Figure 6 42 iver Don there extended in ancient times a large forest covering the gr r Don there extended in ancient times a larc O ering the great here extended in ancient times a large f st covering the greater part Figure 6 42 Overlapping occurrences in concordance On the other hand if you modify a text instead of constructing a concordance it is necessary to choose among these occurrences the one that will be taken into account Unitex applies the following priority rule for that purpose the leftmost sequence is used
357. slash like this 123N ONE TWO THREE NEW LINE 322 CHAPTER 14 FILE FORMATS 14 13 7 Forbidden word file The PolyLex programs requires a forbidden word file for Dutch and Norwegian This raw text file is supposed to be named ForbiddenWords txt If must be in the user s Dela directory corresponding to the language to work on Each line is supposed to contain one forbidden word 14 13 8 Log file The UnitexToolLogger programs whenaunitex logging parameters txt file is found with a path to store logfile creates ulp file with a log of the running Unitex tool selected It creates a unitex_logging_parameters_count txt file which contain only the number of latest log file created Log file with ulp extension are uncompressed zipfile compatible with unzip and all standard unzip tools It contain these files e test_info command_line txt a list of parameter of the command line used to run the tool There is one parameter on each line The first line contain the return value the second line the number of parameters e test info command line synth txt a simple line with a summary of the command line used to run the tool e test info list file in txt a list of file read by the tool The first col umn is file size second column is crc32 third is filename e test info list file out txt a list of file created by the tool The first column is file size second column is crc32 third is filename
358. special value 1 The and of tag definitions is marked by a line containing Example Here is the file that corresponds to the text He is drinking orange juice 00000000014 1 He is drinking orange juice 0 2 1 1 2 2 1 1 3 8 1 1 4 6 1 1 5 5 6 1 LL AL 0 0 E 302 CHAPTER 14 FILE FORMATS fT Q E 4 STD4 He he N s p f amp 0 0 0 0 1 0 Y STD4 He he PRO Nomin 3ms 4 0 0 0 0 1 09 4 STD4 is be V P3s 4 2 0 0 2 1 0 Y STD4 Q is i N p f 2 0 0 2 1 0 Y STD4 drinking drinking A 40 0 4 7 04 4 STD4 drinking drinking N s 4 94 20 0954 7 08 4 STD4 drinking drink V G 4 4 0 0 4 7 04 Y STD4 orange orange A 4 6 0 0 6 5 04 Y STD4 orange orange N s 4 6650705645404 4 STD4 orange juice orange juice N XN z1 s 4 6 0 0 8 4 04 Y 14 5 TEXT AUTOMATON 303 STD4 juice juice N Conc s 4 88 0 0 8 4 0 1 STD4 juice juice V W P1s P2s P1p P2p P3p 4 88 0 0 8 4 0 1 STD4 gd 89 0 0 9 0 0 Y 4 14 5 2 The text tind file The text t ind file is an index file used to jump at correct byte offset in the text t fst file when we want to load a given sentence It is a binary file that contains 4 x N bytes where N is the number of sentences It gives the start offset of each sentence as a 4 byte little endian sequence 14 53 The cursentence grf file The cursentence grf file is generated b
359. t It is not possible to refer to information in dictionaries in an inflection transducer but it is possible to reference subgraphs Transducer outputs are concatenated in order to produce a string of characters This string is then appended to the produced dictionary entry Outputs with variables do not make sense in an inflection transducer Case of letters is respected lowercase letters stay lowercase the same for uppercase let ters Besides the connection of two boxes is exactly equivalent to the concatenation of their contents together with the concatenation of their outputs cf figure 6 2 Figure 6 2 Two equivalent paths in an inflection grammar Inflection transducers may be compiled before being used by the inflection program If not the inflection program will compile them on the fly For more details see section 3 5 6 1 2 Preprocessing graphs Preprocessing graphs are meant to be applied to texts before they are tokenized into lexical units These graphs can be used for inserting or replacing sequences in the texts The two customary uses of these graphs are normalization of non ambiguous forms and sentence boundary recognition The interpretation of these graphs in Unitex is very close to that of syntactic graphs used by the search for patterns The differences are the following e you can use the special symbol lt gt that recognizes a newline e if you work in character by character mode you can use the sp
360. t 322 concord html 306 concord ind 272 274 304 concorden 272 321 concord txt 305 concord tfst n 2 4 321 corpus txt 314 cursentence grf 282 303 cursentence tok 282 304 cursentence txt 282 304 diff html 307 dic 43 57 263 308 320 dlc n 320 d1f 43 57 263 308 320 dif n 320 enter pos 283 300 err 43 57 263 308 320 err n 320 norm rul 174 regexp grf 277 stat_dic n 263 320 stats n 42 283 320 system die det 319 tags ind 308 tags_err 308 320 tags_err n 320 tagset def 313 text cod 42 283 299 text tfst 285 300 text tind 285 303 tfst tags by_alph txt 304 tfst tags by_freq txt 304 tok_by_alph txt 42 283 300 tok_by_freq txt 42 283 300 tokens txt 42 283 299 train_dict 315 user dic def 319 alphabet 31 38 40 54 259 267 270 273 282 284 formats 291 HTML 81 148 text 34 291 INDEX transcoding 32 Forbidden word file 322 Form canonical 48 inflected 47 Generation of Korean MWU dictionary 254 German compound words 45 GlossaNet 258 306 Grammars ambiguity removal 163 collection 168 constraints 118 context free 89 ELAG 116 extended algebraic 90 for phrase boundary recognitions 11 formalism 89 inflectional 57 local 116 normalisation of non ambiguous forms 114 of the text automaton 115 normalization of non ambiguous forms 39 splitting into sentences 37 Granularity of dictionaries 155 Gr
361. t fr umlv unitex x This will allow you to load bin fst2 and alphabet files and to keep them in memory persistently You use the filename created by loadPersistent function String persistentAlphabet UnitexJni loadPersistentAlphabet unit String persistentFst2 UnitexJni loadPersistentFst2 unitex Frenc String persistentDictionary UnitexJni loadPersistentDictionary unitex French Dela communesFR bin 13 4 Text file encoding parameters Unitex uses Unicode for text file14 1 All program which read or write text file share same encoding parameters Possible format are utf16le bom utf16le no bom 254 CHAPTER 13 USE OF EXTERNAL PROGRAMS utf16be bom utf16be no bom utf8 bom utf8 no bom for Unicode Big Endian Little Endian and UTF 8 with or without Unicode byte order mark at the beginning of the file For the input format you can specify several bom encoding separated by comma but only one no bom encoding OPTIONS e k ENCODING input encoding ENCODING input text file format Can contain several value separated by a comma e q ENCODING output encoding ENCODING output text file format By default value are input encoding utfl61le bom utfl6be bom utf8 bom oi 13 5 BuildKrMwuDic BuildKrMwuDic OPTIONS dic This program generates a MWU dictionary graph from a text table dic describing each component of each MWU OPTIONS e o GRF outp
362. t grammar format a equivalent FST2 subgraph calls may remain Finite State Transducer can be just an approximation Hattening depth Maximum flattening depth 10 Cancel Figure 6 5 Configuration of approximation of a grammar The box Flattening depth lets you specify the level of embedding of subgraphs This value represents the maximum depth up to which the callings of subgraphs will be replaced by the subgraphs themselves The Expected result grammar format box allows you to determine the behavior of the pro gram beyond the selected limit If you select the Finite State Transducer option the calls to subgraphs will be replaced by E beyond the maximum depth This option guaran tees that we obtain a finite state transducer however possibly not equivalent to the original grammar On the contrary the equivalent FST2 option indicates that the program should allow for subgraph calls beyond the limited depth This option guarantees the strict equiva lence of the result with the original grammar but does not necessarily produce a finite state transducer This option can be used for optimizing certain grammars A message indicates at the end of the approximation process if the result is a finite state transducer or an FST2 grammar and in the case of a transducer if it is equivalent to the original grammar cf Figure 6 6 6 2 3 Constraints on grammars With the exception of inflection grammars a grammar can ne
363. t grf2 gt grf files to be compared OPTIONS e output X saves the result if any in X instead of printing it on the output Compares the given grf files and prints their difference on the standard output Returns 0 if they are identical modulo box and transition reordering 1 if there are differences 2 in case of error Here are the diff indications that can be emitted e P name a presentation property has changed name property name SIZE FONT e M a b box moved a box number in lt grf1 gt b box number in lt grf2 gt e C a b box content changed a box number in lt grf1 gt b box number in lt grf2 gt e A x box added x box number in lt grf2 gt e R x box removed x box number in lt grf1 gt e T a b x y transition added a b src and dst box numbers in lt grf1 gt x y src and dst box numbers in lt grf2 gt e X ab x y transition removed a b src and dst box numbers in lt grf1 gt x y src and dst box numbers in lt grf2 gt Note that transition modifications related to boxes that have been added or re moved are not reported 270 CHAPTER 13 USE OF EXTERNAL PROGRAMS 13 23 GrfDiff3 GrfDiff3 lt mine gt lt base gt lt other gt lt mine gt my grf file lt other gt the other grf file that may be conflicting lt base gt the common ancestor grf file OPTIONS e output X saves the result if any in X instead of printing it on the output e conflicts X saves the description of the
364. t not followed by of the republic However you can draw graphs with positive or negative contexts In that case graphs are no more equivalent to algebraic grammars but to context sensitive grammars that do not have the same theoretical properties 6 3 1 Right contexts To define a right context you must bound a zone of the graph with boxes containing and which indicate the start and the end of the right context These bounds appear in the 6 3 CONTEXTS 123 graph as green square brackets Both bounds of a right context must be located in the same graph i o Figure 6 13 Using a right context Figure 6 13 shows a simple right context The graph matches numbers followed by a cur rency symbol but this symbol will not appear in matched sequences i e in the concordance Right contexts are interpreted as follows During the application of a grammar on a text let us assume that a right context start is found Let pos be the current position in the text at this time Now the Locate program tries to match the expression described inside the right context If it fails then there will be no match If it matches the whole right context that is to say if Locate reaches the right context end then the program will rewind at the position pos and go on exploring the grammar after the right context end You can also define negative right contexts using to indicate the right context start Figure 6 14 shows a graph that
365. t works on a semitic language needed if Dico has to compress a morphological dictionary e u X arabic rules X specifies the Arabic typographic rule configura tion file 13 13 ELAG 263 e r X raw X indicates that Dico should just produce one output file X con taining both simple and compound words without requireing a text directory If X is omitted results are displayed on the standard output dic i represents the path and name of a dictionary The dictionary must be a bin dictionary obtained with the Compress program or a dictionary graph in the st2 format see section 3 7 page 64 It is possible to give priorities to the dictionaries For details see section 3 7 1 The program Dico produces the following files and saves them in the directory of the text e dlf dictionary of simple words in the text e dlc dictionary of compound words in the text e err list of unknown words in the text e tags err unrecognized simple words that are not matched by the tags ind file e tags ind sequences to be inserted in the text automaton see section 3 7 3 page 65 e stat dic n filecontaining the number of simple words the number of com pound words and the number of unknown words in the text NOTE Files d1 dlc errandtags err are not sorted Use the program Sort Txt to sort them 13 13 Elag Elag OPTIONS lt tfst gt This program takes a t st text automaton t st and applie
366. tag as proper names all the unknown words that begin with an uppercase letter thanks to the graph NPr shown in figure 3 15 The in the graph name gives to it a low priority so that it will be applied after the standard dictionary This graph works with words that are still unknown after the application of the standard dictionary Square brackets stand for a context definition For more information about contexts see section 6 3 Since dictionary graphs are applied using the engine of Locate they have exactly the same properties than syntactic graphs So you can use morphological filters and or morpho logical mode For instance the graph shown on Figure 3 16 use morphological filters to recognize roman numerals Note that it also uses contexts in order to avoid recognizing uppercase letters in some contexts 66 CHAPTER 3 DICTIONARIES a Yb No E a Tm Md Ep Fm AZ Tp Dy Ho CST Es pad Ry Rh Pg Ag tg Kay Py Am Sm Bk Pr Nd Pm nyju Np Nb Mo Ir peu RB p om La Ce Ac Eise Pa Figure 3 14 Dictionary graph of chemical elements 3 7 APPLYING DICTIONARIES 67 3 REm amp Ce 6 NPr Figure 3 15 Dictionary graph that tags unknown words beginning with an uppercase letter as proper names By default dictionary graphs are applied in MERGE mode If you want to apply them in REPLACE mode you must suffix g
367. tagged corpus txt file Tuples are composed of either a sequence of 2 or 3 tags to compute transition probability or a word preceded by 0 or 1 tag to compute emit probability Units in a tuple must be separated by a tabulation These tuples are followed by the sequence N pj 316 CHAPTER 14 FILE FORMATS FE of delimiters and then an integer representing the number of occurrences of this tuple in the corpus file Filenames are suffixed by cat or morph In the first one tuples are composed of tags formed of grammatical syntactic and semantic codes In the second one tuples consist in tags formed of grammatical syntactic and semantic codes and sometimes followed bya and inflectional codes Here is an example of a data file with cat tags the 9630 those 2364 eyes 324 DET Ddef the 96304 DET Ddem those 1404 PRO Pdem those 964 N eyes 324 DET N 62541 PREP DET N 258374 4 Here is an example of a data file with morph tags the 9630 4 those 2364 eyes 324 DET Ddef s the 44374 DET Ddef p the 51934 DET Ddem p those 1409 PRO Pdem p those 964 N p eyes 324 DET s N s 184894 PREP DET s Noe 2 69774 4 A special line is added to data files in order to identify whether the file contains cat or morph tags This line contains CODE FEATURES followed by either the integer 0 for cat tags or 1 for morph tags NOTE At the fin
368. text Concord will not surround matches with tabulations e 1 X left X number of characters on the left of the occurrences de fault 0 In Thai mode this means the number of non diacritic characters 258 CHAPTER 13 USE OF EXTERNAL PROGRAMS r X right X number of characters non diacritic ones in Thai mode on the right of the occurrences default 0 If the occurrence is shorter than this value the concordance line is completed up to right If the occurrence is longer than the length defined by right it is nevertheless saved as whole NOTE For both 1eft and right you can add the s character to stop at the first S tag For instance if you set 40s for the left value the left context will end at 40 characters at most less if the 5 tag is found before Sort order options TO order in which the occurrences appear in the text default LC left context for primary sort then occurrence for secondary sort LR left context then right context CL occurrence then left context CR occurrence then right context RL right context then left context RC left context then occurrence For details on the sorting modes see section 4 8 2 Output options H html produces a concordance in HTML format encoded in UTF 8 de fault t text produces a concordance in Unicode text format g SCRIPT glossanet SCRIPT produces a concordance for GlossaNet in HTML format The HTML
369. thalie FRIBURGER Reconnaissance automatique des noms propres application la classification automatique de textes journalistiques 2002 Th se de doctorat Universit de Tours 12 31 A Simple English Axis Generator http nlp cs nyu edu GMA docs HOWTO axis 13 9 32 Jacqueline GIRY SCHNEIDER Syntax and lexicon Blessure wound noeud knot caresse caress SMIL Journal of Linguistic Calculus 3 4 55 72 1978 9 1 33 Jacqueline GIRY SCHNEIDER Les nominalisations en fran ais L op rateur faire dans le lexique Droz Gen ve Paris 1978 9 1 34 Jacqueline GIRY SCHNEIDER Les pr dicats nominaux en fran ais Les phrases sim ples verbe support Droz Gen ve Paris 1987 9 1 35 GNU Lesser General Public License http www gnu org licenses lgpl html 1 1 14 13 9 36 Gaston GRoss D finition des noms compos s dans un lexique grammaire Langue Francaise 87 1990 11 1 37 Gaston GROSS Les expressions fig es en fran ais Noms compos s et autres locutions Ophrys Paris 1996 3 8 11 1 38 Maurice GROSS M thodes en syntaxe Hermann Paris 1975 9 1 39 Maurice GROSS Sur quelques groupes nominaux complexes In J C Cheva lier et M Gross editor M thodes en grammaire francaise pages 97 119 Paris Klincksieck 1976 9 1 40 Maurice GRoss Taxonomy in syntax SMIL Journal of Linguistic Calculus 3 4 73 96 1978 9 1 344 BIBLIOGRAPHY 41 Maurice GROSS Simple sentences Discussi
370. that file already exists the produced lines are ap pended at the end of the file e i INFO info INFO designates a text file in which the information about the analysis has been produced Language options e D dutch e G german e N norwegian e R russian NOTE for Dutch or Norwegian words the program tries to read a text file contain ing a list of forbidden words This file is supposed to be named ForbiddenWords txt see section 14 13 7 and stored in the same directory than BIN 13 30 REBUILDTFST 277 13 30 RebuildTfst RebuildTist lt tfst gt This program reconstructs text automaton t fst gt taking into account the manual modifications If the program finds a file sentenceN grf in the same directory as lt tfst gt it replaces the automaton of sentence N with the one represented by sentenceN grf The input text automaton is modified 13 31 Reconstrucao Reconstrucao OPTIONS lt index gt This program generates a normalization grammar designed to be applied before the construction of an automaton for a Portuguese text The lt index gt file represents a concordance which has to be produced by applying in MERGE mode to the con sidered text a grammar that extracts all forms to be normalized This grammar is called V Pro Suf and is stored in the Portuguese Graphs Normalization directory OPTIONS e a ALPH alphabet ALPH the alphabet file to use e r ROOT root ROOT the inverse
371. that is normally distributed in either source or binary form with the ma jor components compiler kernel and so on of the operating system on which the executable runs unless that component itself accompanies the executable It may happen that this requirement contradicts the license restrictions of other proprietary libraries that do not normally accompany the operating system Such a contradiction means you cannot use both them and the Library together in an executable that you distribute 7 You may place library facilities that are a work based on the Library side by side in a single library together with other library facilities not covered by this License and distribute such a combined library provided that the separate distribu 14 13 VARIOUS OTHER FILES 329 tion of the work based on the Library and of the other library facilities is otherwise permitted and provided that you do these two things a Accompany the combined library with a copy of the same work based on the Library uncombined with any other library facilities This must be distributed under the terms of the Sections above b Give prominent notice with the combined library of the fact that part of it is a work based on the Library and explaining where to find the accompanying uncombined form of the same work 8 You may not copy modify sublicense link with or distribute the Library except as expressly provided under this License Any attempt otherwise to copy
372. that will be kept e Search limitation to a certain number of occurrences in Cassys this search is not limited such a limitation has no sense in the use of CasSys We allways index all utterances in the text 12 2 5 A special way to mark up patterns with CasSys The output of the transducers can be used to insert special information into texts particularly to mark up the recognized patterns it is possible to use all the marks you want such as etc or xml tags such as xxx lt xxx gt but Cassys proposes a special way to mark up patterns that offers some advantages and that we present here Unitex splits texts into different sorts of tokens like the sentence delimiter S the stop marker STOP contiguous sequences of letters lexical tags aujourd hui ADV etc The lexical tag is used by CasSys in a special way The lexical tag between curly brackets is normally used to avoid ambiguities see explanation in section 2 5 4 and in section7 5 1 For example in a text if you have the token curly brackets N nei ther curly nor brackets will be recognized but the whole sequence curly brack ets A lexical tag can contain complex lexical information like N Pers Hum fs In a graph you can look for a lexical token using the lexical information it contains for example you can write lt N gt to search a noun lt Pers Hum gt for a human person or lt Pers gt These lexical masks are explained in the Chapter Searchin
373. the same way as they are entered into the grf files 14 3 GRAPHS 297 Sequence in the graph editor Sequence in the gr f file N x N N Table 14 4 Encoding of special sequences NOTE The characters between lt and gt or between and are not interpreted Thus the character in sequence le lt A Conc gt is not interpreted as a line separator since the pattern lt A Conc gt is interpreted with priority X and Y represent the coordinates of the box in pixels Figure 14 1 shows how these coordinates are interpreted by Unitex 0 0 x y Figure 14 1 Interpretation of the coordinates of boxes N represents the number of outgoing transitions of the box This number is always 0 for the final state The transitions are defined by the number of their target box Every line of the box definition ends with a newline 14 3 2 Format fst2 An fst2 file is a text file that describes a set of graphs Here is an example of an fst2 file 00000000024 298 CHAPTER 14 FILE FORMATS Sthe DETY A gt ADJY pretty small fq The first line represents the number of graphs that are encoded in the file The beginning of each graph is identified by a line that indicates the number and the name of the graph 1 NP and 2 Adj in the file above The following lines describe the states of the graph If the state is final the line starts with the t character and with the character if
374. the facilities you need to build a program for Mac OS X whether it s an applica tion kernel extension or command line tool The only problem with Xcode is that it is HUGE to download 800 Mb Of course you will not need all the stuff included in the package but all what you need is in it 6006 Tools E H e H http developer apple com tools S 5 a FI THE GNU MAC ic Archive Apple 20 v Amazon France eBay France Yahoo Informations 114 v Advanced Search Developer Connection Search Restrict to Tools gt Log In Not a Member Contact ADC ADC Home gt Tools Mac OS X provides you with a full suite of free developer tools to prototype compile debug and Getting Starsad Y Download optimize your applications speeding up your A guided introduction Xcod development cycle Xcode 2 2 Apple s integrated and learning path for ZS code 2 2 development environment can be used with either developers new to Java or the Cocoa and Carbon frameworks Mac OS Mac OS X developer X also provides a wide selection of open source tools Tools Topics tools such as the GNU Compiler Collection GCC which is used to build Mach O programs the native runtime environment of Performance amp Debugging Mac OS X In addition Apple provides tools for analyzing application WebObjects performance The OpenGL tools allow you to monitor OpenGL applications Xcode and easily construct shaders Mac OS X s strea
375. the mode used for displaying transducer outputs Parameters name x y and z are defined in the same way as FONT BCOLOR x defines the background color of the graph x represents the color in RGB format FCOLOR x defines the foreground color of the graph x represents the color in RGB format ACOLOR x defines the color inside the boxes that correspond to the calls of sub graphs x represents the color in RGB format SCOLOR x defines the color used for writing in comment boxes boxes that are not linked up with any others x represents the color in RGB format CCOLOR x defines the color used for designing selected boxes x represents the color in RGB format DBOXES x this line is ignored by Unitex It is conserved to ensure compati bility with Intex graphs DFRAME x there will be a frame around the graph if x is y not if it is n DDATE x puts the date at the bottom of the graph if x is y not if it is n DFILE x puts the name of the file at the bottom of the graph depending on whether x is y or n 296 CHAPTER 14 FILE FORMATS e DDIR x prints the complete path of the graph wether x is y or n This option has no effect if the DF ILE option is set to n e DRIG x displays the graph from right to left or left to right depending on whether x is y or n e DRST x this line is ignored by Unitex It isconserved to ensure compatibility with Intex graphs e FITS x this line is igno
376. through you then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Linguistic Resource If any portion of this section is held invalid or unenforceable under any par ticular circumstance the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims this section has the sole purpose of protecting the integrity of the free resource distribution system which is implemented by public license practices Many people have made generous contributions to the wide range of data distributed through that system in reliance on consistent application of that system it is up to the author donor to decide if he or she is willing to distribute resources through any other system and a licensee cannot impose that choice This section is intended to make thoroughly clear what is believed to be a con sequence of the rest of this License If the distribution and or use of the Linguistic Resource is restricted in certain countries either by patents or by copyrighted interfaces the original copyright holder who places the Linguistic Resource under this License may add an ex plicit geographical distribution limitation excluding those countries so that distribution is permitted only in or among cou
377. tml file is an HTML file that represents a concordance This file is encoded in UTF 8 The title of the page is the number of occurrences it describes The lines of the concordance are encoded as lines where the occurrences are considered to be hy pertext lines The reference associated to each of these lines has the following form lt a href X Y Z gt X and Y represent the start and end position of the occur rence in characters in the file name_of_text snt Z represents the number of the phrase in which this occurrence appears All spaces that are at the left and right edges of lines are encoded by a non breaking space nbsp in HTML which allows the preservation of the alignment of the utterances even if one of them has a left context with spaces NOTE If the concordance has been constructed with the glossanet parameter the HTML file has the same structure except for the links In these concordances the occurrences are real links pointing at the web server of the GlossaNet application For more information on GlossaNet consult the link on the Unitex web site Here is an example of a file lt html lang en gt Y lt head gt 4 meta http equiv Content Type content text html charset UTF 8 title 6 matches lt title gt Y lt head gt lt body gt Y lt table border 0 cellpadding 0 width 100 style font family Arial Unicode MS font size 12 gt 4 font face Courier new size 3 gt Y on there
378. to be referenced in the web site just send a mail to unitex univ mlv fr The more visible the more cited 16 CONTENTS Chapter 1 Installation of Unitex Unitex is a multi platform system that runs on Windows as well as on Linux or MacOS This chapter describes how to install and how to launch Unitex on any of these systems It also presents the procedures used to add new languages and to uninstall Unitex 1 1 Licenses Unitex is a free software This means that the sources of the programs are distributed with the software and that anyone can modify and redistribute them The code of the Unitex pro grams is under the LGPL licence 35 except for the TRE library for dealing with regular expressions from Ville Laurikari 63 which is under 2 clause BSD licence which is more permissive than the LGPL The LGPL Licence is more permissive than the GPL licence be cause it makes it possible to use LGPL code in nonfree software From the point of view of the user there is no difference because in both cases the software can freely be used and distributed All the data that go with Unitex are distributed under the LGPLLR license 53 Full text versions of LGPL 2 clause BSD and LGPLLR can be found in the appendices of this manual 1 2 Java runtime environment Unitex consists of a graphical interface written in Java and external programs written in C C This mixture of programming languages is responsible for a fast and portable
379. ton matched by grammar of Figure 7 39 e Figure 7 41 A grammar with a Chinese character E Figure 7 42 Extract of Korean alphabet file Chapter 8 Sequence Automaton The construction of local grammars can be a long process during which the linguist repeated many times the same operations The aim of the Seq2Grf program is to produce quickly and automatically local grammars This program can be used in command line mode or by clicking on Construct Sequences Automaton in the Text Menu The use of the command Seq2Grf is described in section 13 33 For a given document TEILite or txt format files or SNT when preprocessed for this task with STOP tags this programs builds a single automaton that recognizes all the sequences contained in the document Special attention should be paid to the establishment of the list of sequences that are recog nized by the graph This chapter presents the file formats supported by the Seq2Grf program the construction of the sequence automaton and the use of wildcards 8 1 Sequences Corpus We call sequences corpus or qualified corpus a list of sequences of one or several words that we want to be recognized by only one local grammar graph This sequences corpus is stored in one single file wich must be from one of the following formats e raw text files in which sequences are delimited by end of line e SNT files already processed with this menu sequences will be delimited by the S
380. tor Claude Devis introduction of morphological filters based on the TRE library Nathalie Friburger author of Cassys Hyun Gue Huh author of the tools used to generate Korean dictionaries Claude Martineau had worked on the simple word inflection part of MultiFlex Sebastian Nagel has optimized many parts of the code has also adapted PolyLex for German and Russian Alexis Neme has optimized Dico and Tokenize has also merged Locate into Dico in order to allow dictionary graphs Aljosa Obuljen author of Stats S bastien Paumier main developper Agata Savary author of Mult iFlex Anthony Sigogne author of Tagger and TrainingTagger Gilles Vollant author of UnitexTool has optimized many aspects of Unitex code memory speed multi compiler compliance etc Patrick Watrin author of XMLizer has worked on the integration of XA1ign Moreover Unitex would be useless without all the precious linguistic resources it contains All those resources are the result of hard work done by people that shall not be forgotten Some are mentionned in disclaimers that come with dictionaries and complete information is available on http igm univ mlv fr unitex linguistic data bib html If you use Unitex in research projects Unitex has been used in several research projects Some are listed in the Related works section of Unitex home page If you did some work with Unitex resources project paper thesis and if you want it
381. txt file If you choose another file name the current text will not be affected Click on the GO button to start the modification of the text The precedence rules that are applied during these operations are described in section 6 7 After this operation the resulting file is a copy of the text in which transducer outputs have been taken into account Normalization operations and splitting into lexical units are auto 150 CHAPTER 6 ADVANCED USE OF GRAPHS E Concordance D My Unitex EnglishiCorpus ivanhoe_snticoncord html n a E ted of yore the fabulous Dragon of Wantley 5 here were fought many of 4 b 7 D My Unitex EnglishiCorpusiivanhoe snt cog Bd 2343 sentence delimiters 186612 9300 diff tokens 83774 9274 simple forms 25 9 di 81970 occurrences 13284 DLF entries simple words 273 occurrences 274 DLC entries 5 IN THAT PLEASANT DISTRICT of merry England which is watered by the river Don there extended in ancient times a large forest covering the greater part of the beautiful hills and valleys which lie between Sheffield and the pleasant town of Doncaster 5 The remains of this extensive wood are still to be seen at the noble seats of Wentworth of Warncliffe Park and around Rotherham S Here haunted of yore the fabulous Dragon of Wantley 5 here were fought many of the most desperate battles during the Civil Wars of the Figure 6 62 Selection of an occurrence in the text matica
382. txt file cf section 11 2 1 Knowing that vive is a feminine singular form of vif we may demand the generation of its plural without hav ing to explicitly indicate the plural of which gender we are interested in since we only wish to change the number the gender remains as in the original word vive i e feminine 11 23 Inflection paradigm of a MWU The morphological description of MWUs in our formalism is inspired by the DELA system in the sense that e each MWU is attributed an inflection code e a MWU s inflection code explicitly describes each inflected form of a MWU in terms of actions to be performed on the lemma and inflectional features to be attached to each form In the Unitex interfaced version MULTIFLEX uses inflection codes represented as Unitex graphs compiled into the st2 format For example Figure 11 1 contains the inflection graph for battle royal Figure 11 1 Inflection graph for battle royal According to the Unitex convention three constituents are present in battle royal battle re ferred to as 1 a space referred to as 2 and royal referred to as 3 If a variable appears alone in a box the constituent has to be the same as in the lemma of the MWU For instance lt 3 gt in the uppermost path means that the unit royal is to be recopied as such If the variable is 218 CHAPTER 11 COMPOUND WORD INFLECTION accompanied by a set of category feature equations the constituent has to be inflected to the
383. u can click on to see the error messages as shown on Figure 13 3 This is useful when an error message occurs so fast that you cannot read it If a command has been logged its log number appears in the second column Note that you can export all the commands diplayed in the console to the clipboard with Ctrl C Ihome paumier Unitex2 1beta App UnitexToolLogger Tfst2Grf home paumier unite home paumier Unitex2 1 beta App UnitexToolLogger Tfst2Grf home paumier unite Ihome paumier Unitex2 1beta App UnitexToolLogger Reg2Grf home paumier unite home paumier Unitex2 1beta App UnitexToolLogger Grf2Fst2 home paumier unitex home paumier Unitex2 1beta App UnitexToolLogger Locate t home paumier unite home paumier Unitex2 1beta App UnitexToolLogger CreateLog d home paumier home paumier Unitex2 1beta App UnitexToolLogger CreateLog d home paumier home paumier Unitex2 1beta App UnitexToolLogger CreateLog d home paumier home paumier Unitex2 1beta App UnitexToolLogger CreateLog d home paumier Cannot open the graph toto grf home paumier Unitex2 1beta App toto grf home paumier Unitex2 1beta App UnitexToolLogger CreateLog d home paumier O O O O O O O O A O Figure 13 3 Console 13 3 Unitex JNI You can use Unitex as a Java Native interface by including the following imports import fr umlv unitex jni UnitexJni import java io impor
384. uese verbs in the future tense and in the conditional can be modified by the inser tion of one or two clitical pronouns between the root and the suffix of the verb For example the sequence dir me o they will tell me corresponds to the complete verbal form dir o as sociated with the pronoun me In order to be able to manipulate this rewritten form it is necessary to introduce it into the text automaton in parallel to the original form Thus the user can search one or the other form The figures 7 6 and 7 7 show the automaton of a sentence after normalization of the clitics 158 CHAPTER 7 TEXT AUTOMATON 3543 sentences Os benfeitores Dir se ia uma galeria de afogados todos solenes secos hirtos de Sentence labios finos e ar de cerim nia Reset Sentence Graph Rebuild FST Text Elag Frame Explode Implode Apply Elag Rule V MC C1s C4s C3s PRO Pes R4ms R4fs R4mp R4fp Figure 7 7 Normalized phrase automaton 7 2 CONSTRUCTION 159 The Reconstrucao program allows you to construct a normalization grammar for these forms for each text dynamically The grammar thus produced can then be used for normal izing the text automaton The configuration window of the automaton construction suggests an option Build clitic normalization grammar cf figure 7 10 This option automatically starts the construction of the normalization grammar which is then used to construct the text automaton if you have selecte
385. uistic Resource including whatever changes were used in the package which must be distributed under Sections 1 and 2 above and if the package contains an encrypted form of the Linguistic Resource with the complete machine readable work that uses the Lin guistic Resource as object code and or source code so that the user can modify the Linguistic Resource and then encrypt it to produce a modified package containing the modified Linguistic Resource b Use a suitable mechanism for combining with the Linguistic Resource A suitable mechanism is one that will operate properly with a modified ver sion of the Linguistic Resource if the user installs one as long as the mod ified version is interface compatible with the version that the package was made with c Accompany the package with a written offer valid for at least three years to give the same user the materials specified in Subsection 4a above for a charge no more than the cost of performing this distribution 338 CHAPTER 14 FILE FORMATS d If distribution of the package is made by offering access to copy from a designated place offer equivalent access to copy the above specified ma terials from the same place e Verify that the user has already received a copy of these materials or that you have already sent this user a copy If the package includes an encrypted form of the Linguistic Resource the re quired form of the work that uses the Linguistic Resource
386. uivisDeLettreMaj see gt NN gt NenN Nombres gt PhTh Cas g n ral Prenoms Ponctuation Symboles1Maj gt abr nb cas2 cas3 cas4 gt crochets crochets gt motifAnthro gt motifSymboles gt nb_abr gt parTel gt parentheses parentheses sumboles rois sigles Figure 5 15 Display the list of all called graphs 5 2 3 Manipulating boxes You can select several boxes using the mouse In order to do so click and drag the mouse without releasing the button When you release the button all boxes touched by the se NI you are working on KDE you can deactivate lt Alt Click gt in kcontrol 5 2 EDITING GRAPHS 99 lection rectangle will be selected and are displayed in white on blue ground as shown on Figure 5 17 Figure 5 16 Selecting several boxes You can select several boxes by keeping simultaneously the lt CTRL gt and lt SHIFT gt key pressed and by clicking on every box you want to add to your current selection This way you can select several boxes without selecting all the boxes located in their area Mister M Figure 5 17 Selecting distant boxes When boxes are selected you can move them by clicking and dragging the cursor without releasing the button In order to cancel the selection click on an empty area of the graph If you click on a box a
387. urand Mes documents UNITEXEnglish Corpusivanhoe_snticoncord htmi ag p Tag output Matched 7 matches bon with he slightest shade of selfishness and instead of dividing yet farther his weakened nation by 2 he the ious than prepossessing especially as instead of doffing his bonnet he pulled it still deepe purpose 1 homage and the kiss of peace S But instead of receiving their salutations with courtesy J ES dric who dried his hands with a towel instead of suffering the moisture to exhale by waving t KSC buying hither by his father Henry the Second with the purpose of buying golden opinions of the inhab reyhound which ran limping about as if with the purpose of seconding his master in collecting a ic jouble click to open the graph Figure 6 64 The Concordance window in debug mode Chapter 7 Text automaton Natural languages contain much lexical ambiguity The text automaton is an effective and visual way of representing such ambiguity Each sentence of a text is represented by an automaton whose paths represent all possible interpretations This chapter presents the concept of text automaton the details of their construction and the operations that can be applied in particular ambiguity removal and linearization Since version 2 1 it is possible to search the text automaton for patterns see section 7 7 7 1 Displaying text auto
388. urce with the Lin guistic Resource or an encrypted form of the Linguistic Resource creates a package that is a derivative of the Linguistic Resource because it contains por tions of the Linguistic Resource rather than a work that uses the Linguistic Resource If the package is a derivative of the Linguistic Resource you may distribute the package under the terms of Section 4 Any works containing that package also fall under Section 4 4 As an exception to the Sections above you may also combine a work that uses the Linguistic Resource with the Linguistic Resource or an encrypted form of the Linguistic Resource to produce a package containing portions of the Linguistic Resource and distribute that package under terms of your choice provided that the terms permit modification of the package for the customer s own use and reverse engineering for debugging such modifications You must give prominent notice with each copy of the package that the Lin guistic Resource is used in it and that the Linguistic Resource and its use are covered by this License You must supply a copy of this License If the package during execution displays copyright notices you must include the copyright notice for the Linguistic Resource among them as well as a reference directing the user to the copy of this License Also you must do one of these things a Accompany the package with the complete corresponding machine readable legible form of the Ling
389. ure 6 11 use of interval to match several consecutive tokens 122 CHAPTER 6 ADVANCED USE OF GRAPHS 6 2 4 Error detection In order to keep the programs from blocking or crashing Unitex automatically detects er rors during graph compilation The graph compiler checks that the main graph does not recognize the empty word and searches for all possible forms of void loops When an error is encountered an error message is displayed in the compilation window Figure 6 12 shows the message that appears if one tries to compile the graph Det of Figure 6 10 7 ERROR lessages with a colored background are generated by the interface not by the external programs Compiling graph Det ICompiling graph DetCompose Recursion detection started Resolving E conditions Looking for E loops Looking for infinite recursions Recursion detection completed ERROR Det calls DetCompose that recalls the graph Det Cancel Figure 6 12 Error message when trying to compile Det When you start a pattern search with a grf graph if Unitex detects an error at the graph compilation the locate operation is automatically interrupted 6 3 Contexts Unitex graphs as we described them up to there are equivalent to algebraic grammars These are also known as context free grammars because if you want to match a sequence A the context of A is irrelevant Thus you cannot use a contex free graph for matching occurences of presiden
390. ut GRE grf file to produce e d DIR directory DIR inflection directory containing the inflection graphs required to produce morphological variants of roots e a ALPH alphabet ALPH alphabet file to use e b BIN binary BIN bin simple word dictionary to use 13 6 Cassys Cassys OPTIONS lt snt gt This program applies an ordered list of grammars to a text and constructs an index of the occurrences found OPTIONS e a ALPH alphabet ALPH the language alphabet file e r X transducer dir X take tranducer on directory X so you don t specify full path for each transducer note that X must be back slash termi nated 13 6 CASSYS 255 e 1 TRANSDUCERS LIST transducers list TRANSDUCERS LIST the transducers list file with their output policy e s transducer fst2 transducer file transducer fst2 atrans ducer to apply e m output policy transducer policy output policy the out put policy of the transducer specified e t TXT text TXT the text file to be modified with extension snt e i in place mean uses the same csc snt directories for each transducer e d no create directory mean the all snt csc directories already exist and don t need to be created e g minus negation operator minus uses minus as negation opera tor for Unitex 2 0 graphs e g tilde negation operator tilde uses tilde as negation operator default e h help display this
391. ve oe de dede ee E OR d 20 EE eu dr tere querer ros p E E ee oe ELA XR ed Roy ah i Opening CINE cecce e Rr ART Ro x oe owe E ew tee e ERS Z0 EU IONE AA oed oe EO A Le Pie E xw 251 Normalization of separators 62425 244224 20h 4 etes OS 252 Spht ng See reso E ore ded ee OS Eee ES RO 2 5 3 Normalization of non ambiguous forms 254 Splutlinga tex CIO tokens ius soc Re eR Rome ERR kom Se de 255 Applying dictionaries 4229299 escasas E Gd Ros 2 5 6 Analysis of compound words in Dutch German Norwegian and Rus O ROTE ee we OP HR cg ep e ure es 11 12 13 14 15 17 17 17 18 18 19 19 20 24 26 27 27 28 28 L5 penne atagged text Lie Luis 8339 M see Dictionaries 3 1 The DELA dictionaries 3 1 1 The DELAF format 3 1 2 The DELAS Format lt 4 ou m t o bande 313 Dictionary Contents lt o ssc RE ds bea a 3 2 Looking up a word in a dictionary 33 CheddogdicHonaty format e sca sarana eem DUE GONDE ca cote m e WHEE e pA RE COR dede 3 5 Automatic inflection lt s bbe ed dee baw Pda 251 Inflection of simple words 4444 445 3 5 2 Inflection of compound words 3 5 3 Inflection of semitic languages AD COMPreSSiON uuu ede Livenet ERO EE ENEE 9 7 Applying dich naries sed boe EE eee EE m Sed o Di de eo oor RD aie Rad 3 72 Application rules for dictionaries 94 9
392. ver have an empty path This means that the paths of a main graph must not recognize the empty word but this does not prevent a subgraph of that grammar from recognizing epsilon 6 2 COMPILATION OF A GRAMMAR 119 Compiling graph loop Recursion detection started Resolving lt E gt conditions Looking for lt E gt loops Looking for infinite recursions Recursion detection completed Compilation has succeeded Loading X BOULOT Recherche manuelunitex resources mg loop fst2 Computing grammar dependencies Flattening Cleaning graph Minimization Writing grammar Saving tags The resulting grammar is an equivalent finite state transducer Figure 6 6 Resultat of the approximation of a grammar It is not possible to associate a transducer output with a call to a subgraph Such outputs are ignored by Unitex It is therefore necessary to use an empty box that is situated to the left of the call to the subgraph in order to specify the output cf Figure 6 7 DET is ignored on this path but not on this one Figure 6 7 How to associate an output with a call to a subgraph The grammars must not contain void loops because the Unitex programs cannot terminate the exploration of such a grammar A void loop is a configuration that causes the Locate 120 CHAPTER 6 ADVANCED USE OF GRAPHS program to enter an infinite loop Void loops can originate from transitions that are labeled by the empty word or fro
393. vernment head of government NC NofNs p h heads of governments head of government NC NofNs p head of government head of government NC NofNs s n n n otaries public notary public NC_NsNs p otary public notary public NC NsNs s otary publics notary public NC_NsNs p rolling stone rolling stone NC XXN s rolling stones rolling stone NC XXN p tudents union student union NC Ns N s tudents unions student union NC Ns N p tudents union student union NC_Ns N s tudents unions student union NC Ns N p tudent union student union NC Ns N s tudent unions student union NC Ns N p Figure 11 10 Inflection graph N1 for En Figure 11 11 Inflection graph N3 for English glish simple words simple words 11 3 INTEGRATION IN UNITEX 225 e g angle of reflection lt Nb n gt Figure 11 12 Inflection graph NC_NXXXX for English MWUs e g advance booking office Figure 11 13 Inflection graph NC_XXXXN for English MWUs sa 0 e g air brake lt Nb n gt Figure 11 14 Inflection graph NC_XXN for English MWUs e g birth date gt sa Figure 11 15 Inflection graph NC_NN_NofN for English MWUs lt lt s1 gt H 2 H 2 lt Nb p gt e g criminal police Figure 11 16 Inflection graph NC XXXinv for English MWUs Hehe HHO lt Nb n gt e g cross roads Figure 11 17
394. vity had become a wards the end of the reign of Richard I when his return from his long captivity had become an Figure 4 2 Result of a search for the pattern MOT 4 4 Concatenation There are three ways to concatenate regular expressions The first consists in using the concatenation operator which is represented by the dot Thus the expression lt DET gt lt N gt recognizes a determiner followed by a noun The space can also be used for concatenation as well as the empty string The following expressions the lt A gt cat the lt A gt cat recognizes the token the followed by an adjective and the token cat The parenthesis are used as delimiters of a regular expression All of the following expressions are equivalent the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat the lt A gt cat 4 5 UNION 77 4 5 Union The union of regular expressions is expressed by typing the character between them The expression I youthe she it we they lt V gt recognizes a pronoun followed by a verb If an element in an expression is optional it is sufficient to use the union of this element and the empty word epsilon Examples the little lt E gt cat recognizes the sequences the cat and the little cat lt E gt Anglo French Indian recognizes French Indian Anglo French and Anglo Indian 4 6 Kleene star The Kleene star represented by the charact
395. ween a name and an adjective which suits This grammar will preserve the correct analysis of sentences like Les personnes de bonne humeur m insupportent Is is however recommended to limit the use of the operator because it harms the legibility of the grammars It is preferable to distinguish the labels which accept various inflectional combinations by means of discriminating subcategories defined in the discr part Figure 7 19 ELAG grammar that verifies gender and number agreement Optional Codes The optional syntactic and semantic codes are declared in the cat part They can be used in ELAG grammars like other codes The difference is that these codes do not intervene to This grammar is not completely correct because it eliminates for example the correct analysis of the sen tence J ai re u des coups de fil de ma m re hallucinants 174 CHAPTER 7 TEXT AUTOMATON decide if a label must be rejected as an invalid one while loading of the text autmaton In fact optional codes are independent of other codes such as for example the attribute of the language level z1 z2 or z3 In the same manner as for inflectional codes it is possible to deny an inflectional attribute by writing the character right before the name of the attribute Thus with our example file the lt A gauche symbol recognizes all adjectives in the feminine which do not have the gauche code All codes which are not declared in the t agset def f
396. which betwixt sun and sun he baptized five hundred heathen Danes and Britons At length the barriers were opened and five knights chosen by lot advanced urse of spectators fixed upon them the five knights advanced up the platform n a champion that could bear down these five knights in one day s jousting 5 et and black the chosen colours of the five knights challengers The cords hed their vow by each of them breaking five lances the Prince was to declare Figure 6 20 Results of the application of the grammar shown on Figure 6 19 To avoid that you can use the special symbol to indicate the end of the left context of the expression you want to match This symbol will be represented by a green star in the graph 126 CHAPTER 6 ADVANCED USE OF GRAPHS as shown on Figure 6 21 The effect of such a context is to use this part of the grammar for computing matches but to ignore it in the results as shown on Figure 6 22 Figure 6 21 Matching a noun after a left context Concordance D My Unitex English Corpus ivanhoe_snticoncord html e courses and cast to the ground three utes to keep at sword s point his three entinels to give the alarm when any one omanlike and bravely 5 Of twenty four started up and bent their bows 5 Six he back of which was decorated with two These two squires were followed by two ber with a grave pace followed by four ake part 5 and being divided into two antagonists 5 I add th
397. wing syntax Sabc EQUAL xyz This acts like a switch that will block the grammar exploration if the value of variable abc is different from the value of variable xyz Note that for dictionary variables it is the inflected form as found in the dictionary beware of case variations that is used in the test If you want to compare variable abc against the constant string JKL use the following test Sabc EQUAL JKLS You can also test if contents differ with UNEQUAL If you want to compare variables so that case variations are ignored you can use the follow ing tests Sabc EQUALCC xyz or Sabc UNEQUALCC xyz 6 10 Applying graphs to texts This section only applies to syntactic graphs 6 10 1 Configuration of the search In order to apply a graph to a text you open the text then click on Locate Pattern in the Text menu or press lt Ctrl L gt You can then configure your search in the window shown in figure 6 52 In the Locate pattern in the form of field choose Graph and select your graph by clicking on the Set button You can choose a graph in grf format Unicode Graphs or a com piled graph in st2 format Unicode Compiled Graphs If your graph is a grf one Unitex will compile it automatically before starting the search If you click on Activate de bug mode the concordance will be displayed in a window in which you will also find the automaton and for each match the list of states of the
398. with this library if not write to the Free Software Foundation Inc 59 Temple Place Suite 330 Boston MA 02111 1307 USA Also add information on how to contact you by electronic and paper mail You should also get your employer if you work as a programmer or your school if any to sign a copyright disclaimer for the library if necessary Here is a sample alter the names Yoyodyne Inc hereby disclaims all copyright interest in the library Frob a library for tweaking knobs written by James Random Hacker lt signature of Ty Coon gt 1 April 1990 Ty Coon President of Vice That s all there is to it 332 CHAPTER 14 FILE FORMATS Appendix B TRE s 2 clause BSD License This is the license copyright notice and disclaimer for TRE a regex matching pack age library and tools with support for approximate matching Copyright c 2001 2009 Ville Laurikari lt vl iki fi gt All rights reserved Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met 1 Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer 2 Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER AND
399. word present in the morphological dictionaries see below 11 You can use patterns that refer to the morphological dictionaries like have lt V K gt etc 12 The meta lt PRE gt NB lt SDIC gt and lt CDIC gt are forbidden 13 If you reach the end of the morphological zone and if you are not at the end of a token the match will fail For instance if the text contains enabled you can not only match enable 6 43 Morphological dictionaries In morphological mode you can perform queries using dictionaries For instance you can ask for every word made of the prefix un followed by an adjective with the grammar shown on Figure 6 31 a eres gt Figure 6 31 Matching words made of un adjective ending with able However if we want to match with this grammar the word unaware we must know that aware is an adjective But aware may not be present in the text so that we cannot rely on the text dictionaries This is the reason why we must define a list of dictionaries to lookup in in morphological mode To do that go in Info gt Preferences gt Morphological dictionaries as shown on Figure 6 32 You can select as many dictionaries as you want but they MUST be bin ones Once done you can apply your grammar and get results 6 4 THE MORPHOLOGICAL MODE 131 Preferences for English p Language amp Presentation Morphological dictionaries Directories Choose the bin di
400. write the following lines Bb ale elos Das This file is optional If no sorted alphabet file is specified the SortTxt program sorts in the order of the Unicode encoding 143 Graphs This section presents the two graph formats the graphic format grf and the com piled format fst2 14 3 1 Format grf A grf file is a text file that contains presentation information in addition to infor mation representing the contents of the boxes and the transitions of the graph A grf file begins with the following lines Unigraph SIZE 1313 9504 FONT Times New Roman 124 OFONT Times New Roman B 124 BCOLOR 167772154 FCOLOR 04 ACOLOR 126322564 SCOLOR 167116809 CCOLOR 2554 DBOXES y DFRAME y 14 3 yu D F GRAPHS 295 DATE yf DFILE y DIR vgl DRIG nf DRST nf ITS 1004 PORIENT 14 9 The first line Unigraph is a comment line The following lines define the parame ter values of the graph presentation SIZE x y defines the width x and the hight y of a graph in pixels FONT name xyz defines the font used for displaying the contents of the boxes name represents the name of the mode x indicates if the text should be in bold face or not If x is B it indicates that it should be bold For non bold face x should be a space In the same way y has value 1 if the text should be italic a space if not z represents the size of the text OFONT name xyz defines
401. xes to add or to remove to get to an inflected form from a canonical form and the outputs text in bold under the boxes are the inflectional codes to add to a dictionary entry In our example two paths are possible The first does not modify the canonical form and adds the inflectional code s The second deletes a letter with the L operator then adds the ux suffix and adds the inflectional code mp Here are the operators that can be used L left removes a letter from the entry R right restores a letter to the entry In French many verbs of the first group are conjugated in the present singular of the third person form by removing the r of the infinitive and changing the 4 letter from the end to peler p le acheter gt ach te g rer g re etc Instead of describing an inflectional suffix for each verb LLLL le LLLL te and LLLL re the R operator can be used to describe it in one way LLLL RR C copy duplicates a letter in the entry and moves everything on its right by one posi tion In cases like permitted or hopped we see a duplication of the final consonant of the verb To avoid writing an inflectional graph for every possible final consonant one can use the C operator to duplicate any final consonant 3 5 AUTOMATIC INFLECTION 59 e D delete deletes a letter shifting anything located on the right of this letter For in stance if you want to inflect the Romanian word european into euro
402. xt alignment module based on the XAlign tool Chapter 11 describes the compound word inflection module as a complement of the simple word inflection mechanism presented in chapter 3 Chapter 12 describes the CasSys cascade of transducer system Chapter 13 contains a detailed description of the external programs that make up the Unitex system Chapter 14 contains descriptions of all file formats used in the system The reader will find in appendix the LGPL license under which the Unitex source code is released as well as the LGPLLR license which applies for the linguistic data distributed with Unitex There is also the 2 clause BSD licence that applies to the TRE library used by Unitex for morphological filters Unitex contributors Unitex was born as a bet on the power of Open Source philosophy in the academic world seehttp igm univ mlv fr unitex why unitex html relying on the assump tion that people would be interested in sharing their knowledge and skill into such an open project The following list sounds like Open Source is good for science e Olivier Blanc has integrated the ELAG system into Unitex originally designed by Eric Laporte Anne Monceaux and some of their students has also written RebuildTfst previously known as MergeTextAutomaton CONTENTS 15 Matthieu Constant author of Grf2Fst2 Julien Decreton author of the text editor integrated in Unitex has also designed the undo functionality in the graph edi
403. y Unitex during the display of a sentence automaton The Fst 2Grf program constructs a grf file from the text fst2 file that represents a sentence automaton NOTE outputs of graph boxes are used to encode offsets as defined in t fst tags Offsets are separated with spaces For instance here are some lines of the graph representing the first sentence of Ivanhoe Ivanhoe 0 0 0 0 6 0 100 200 2 3 4 Y by bY PART 2 0 0 2 1 0 220 150 2 5 6 Y Iby by PBEPIR S O0 0 2 1 0 220 50 2 56 4 Sir sir N4Hum s 4 0 0 4 2 0 310 200 1 74 14 54 The sentenceN grf file Whenever the user modifies a sentence automaton that automaton is saved under the name sentenceN grf where N represents the number of the sentence Such a graph contains offsets in graph box outputs see note in section 14 5 3 304 CHAPTER 14 FILE FORMATS 14 5 5 The cursentence txt file During the extraction of the sentence automaton the text of the sentence is saved in the file called cursentence txt That file is used by Unitex to display the text of the sentence under the automaton That file contains the text of the sentence followed by a newline 14 5 6 The cursentence tok file During the extraction of the sentence automaton the numbers of the tokens that compose the sentence are stored in a file named cursentence tok This file con tains one line per token each line being made of 2 integers x y x is the token number y is the length of the token i
404. yun NC_2XN1 N zxiro racyuna zxiro racyuna avio prevozni avio prevozni avio prevozni avio prevozni avio prevoznicye avio prevoznik NC 2XN avio prevozni avio prevozni avio prevoznici avio prevoznik avio prevozni avio prevoznicima avio prevoznik NC 2X ZXiro racyun ZXiro racyun C C 2XN1 N C 2XN1 N C X 1 _2XN1 N Comp p6qm X 1 omp w2qm omp w4qm mp siqm omp s2qm omp s3qm mp s4qm omp s5qm Comp s6qm omp s7qm omp plqm omp p2qm Comp p3qm omp p4qm omp p5qm Comp p6qm Comp p7qm omp w2qm omp w4qm k avio prevoznik NC_2XN2 N Comp slvm ka avio prevoznik ku avio prevoznik ka avio prevoznik kom avio prevozni ku avio prevoznik ka avio prevoznik avio prevozni avio prevoznici avio prevoznik avio prevoznicima avio prevoznik NC 2X avio prevoznicima avio prevoznik NC 2X avio prevoznika avio prevoznik avio prevoznika avio prevoznik avio prevoznik avioprevoznika avio prevoznik NC_2XN2 avioprevozniku avio prevoznik NC_2XN2 avioprevoznika avio prevoznik NC 2XN2 avioprevoznik ke avio prevoznik avioprevoznicye avio prevoznik avioprevoznikom avio prevoznik avioprevozniku avio prevoznik NC_2XN2 avioprevoznici avio prevoznik NC_2XN2 avioprevoznika avio prevoznik NC 2XN2 avioprevoznicima avio prevoznik NC 2XN avioprevoznik avio prevoznik NC 2XN2 avioprevoznici avio prevoznik NC_2XN2 C_2XN2 NC_
Download Pdf Manuals
Related Search
Related Contents
lettre_cst_98 DeskJet Series seasoning Manual de Instruções Manual do produto Linux - User Manual Service Manual User Manual for the Devil Fish MIDI In system for 15 - Beglec Samsung KN55S9CAFXZA User's Manual Copyright © All rights reserved.
Failed to retrieve file