Home
Outilex platform Graphical Interface - user guide
Contents
1. CHAPTER 3 DICTIONARIES lt attrtype name collective type bool gt lt true alias Coll gt lt attrtype gt lt attrtype name proper type bool gt lt true alias pr Pr Prenom 2 lt attrtype gt lt pos name noun cutename N gt lt attribute name subcat lt attribute name gender lt attribute name number lt attribute name proper lt attribute name coll lt pos gt 3 3 Operations type nounsubcat shortcut yes gt type gender shortcut yes gt type number shortcut yes gt type proper default false shortcut yes gt type collective shortcut yes gt 3 3 1 Editing a dictionary For editing a new or existing dictionary you need to click on items new or open in the menu Dictionary Then a text editor containing the dictionary will appear in the content part of the UI After having edited the dictio nary you can save it by clicking on save or save as in the same menu Dictionaries are saved in UTF 8 3 3 2 Converting DELA in XML Converting a DELA dictionary into an XML one requires the definition of the file delaf corresp that defines the correpondance between tags used in the DELA dictionary and tags used in the XML dictionary This file is located in the directory lt language gt e g french of the directory lingdef Below is a sample of this file POS A adj POS
2. boxes filled with text and transitions that relate these boxes There exist two obligatory special boxes an initial one here in the form of an arrow and a final one in the form of a circle Generally they are read from left to right from the initial box to the final box An example is given in figure 4 1 The text of a box is set in an input text field For instance the first box on the left of the example graph is defined with the following text lundi mardi mercredi jeudi vendredi samedi dimanche where symbol stands the union symbol of a regular expression and is represented in the box with a line breaking This graph therefore recognizes a text sequence like mardi prochain For technical details on graph edition cf Unitex manual 13 14 CHAPTER 4 GRAMMARS Figure 4 1 A simple graph 4 2 Subgraphs A call to a subgraph can be designed in a graph box by typing the name of the subgraph without extension preceded by the symbol in the input text field The subgraph should be in the same directory The box calling a subgraph is greyed to distinguish them from normal boxes as illustrated in figure 4 2 For instance the box defined by the following input text subgraph is a call to the graph subgraph xgrf He Figure 4 2 A call to a subgraph 4 3 Ouputs and weights If you want to associate an output with a box you need to use the output text field of the box As in the input sym
3. number value singular gt lt inflected gt lt inflected gt lt form gt abaissables lt form gt lt feat name gender value masculine gt lt feat name number value plural gt lt inflected gt lt inflected gt lt form gt abaissables lt form gt lt feat name gender value feminine gt lt feat name number value plural gt lt inflected gt lt entry gt To avoid memory space feeding XML dictionaries are compressed and their file extension is dic xml gz 3 2 2 Lingdef The Lingdef file defines the tagset that can be used in Outilex XML dic tionaries It is also encoded in XML This file is located in the directory lt language gt e g french of the directory lingdef Hereby is an example of the definition of the tagset used for nouns lt nouns gt lt attrtype name nounsubcat type enum gt lt value name pred alias Pred gt lt value name conc alias Conc concret gt lt value name abst alias Abst abstract abs gt lt value name hum alias Hum human gt lt value name anl alias Anl animal gt lt value name tps alias Tps temporal gt lt value name top alias Top toponym gt lt value name unit alias Unit gt lt value name num alias Nnum numeral gt lt value name dnom alias Dnom detnom gt 10 lt attrtype gt
4. Outilex platform Graphical Interface user guide Olivier Blanc and Matthieu Constant Oct 11 2006 Contents 1 Introduction 1 2 Getting started 3 2 1 System Requirements 3 2 2 Outilex directory 3 2 3 Installation procedure 4 24 Launch Outilex szk meia Be nie L ane he 4 2 4 1 Starting command 4 2 4 2 UI general description 5 243 Projects EA 5 3 Dictionaries 7 3 1 DELA format 2 ak aa ea 7 32 XML format 22e bie wee Sn A Oe DE eS Se 8 32 1 Description sa sa A kw hee so Rate 8 31222 yaw Bok eum 9 3 3 Operations 10 3 3 1 Editing a dictionary 10 3 3 2 Converting DELA in XML 10 3 3 3 Indexing dictionaries 11 3 3 4 Add selected dictionary to project 12 4 Grammars 13 4 1 A simple graph 13 4 2 SUDETADHS a 286 2 2 ar oe SOU eS 14 4 3 Ouputs and weights 14 4 4 Normalization graphs 16 4 5 Decoration graphs 17 5 Text FSA 19 li CONTENTS 6 Text processing 23 6 1 Text segmentation 23 6 2 Dictionary application 24 6 3 Normalize text automaton 24 6 4 Grammar applic
5. lt q id 0 pos 0 gt lt tr 1b1 60 to 1 gt lt tr 1b1 61 to 1 gt lt q gt lt q id 1 pos 0 gt lt tr 1b1 49 to 2 gt lt tr 1b1 50 to 2 gt lt tr 1ble 51 to 2 gt lt tr 1b1 52 to 2 gt lt tr 1ble 53 to 2 gt x q lt q id 2 pos 0 gt lt tr 1ble 54 to 3 gt lt tr 1bl 55 to 3 gt lt tr 1b1 56 to 3 gt x q lt q id 3 pos 0 gt lt tr 1b1 62 to 4 gt lt tr 1b1 63 to 4 gt lt tr 1b1 64 to 4 gt lt tr 1b1 65 to 4 gt lt tr 1b1 66 to 4 gt x q lt q 14 4 pos 0 gt lt tr 1b1 67 to 5 gt lt tr 1b1 68 to 5 gt lt tr 1b1 69 to 5 gt lt g gt lt q id 5 pos 0 gt lt tr 1b1 70 to 6 gt lt tr 1ble 71 to 6 gt lt tr 1b1 72 to 6 gt lt tr 1b1 73 to 6 gt lt tr 1ble 74 to 6 gt lt tr 1b1 75 to 6 gt x q lt q id 6 pos 0 gt lt tr 1b1 76 to 7 gt lt tr 1b1 77 to 7 gt lt tr 1b1 78 to 7 gt x q lt q id 7 pos 0 gt lt tr 1b1 59 to 8 gt x q lt q id 8 pos 0 f 1 gt lt sfsa gt lt text fsa gt 50 CHAPTER 8 APPENDICE A TEXT AUTOMATON
6. 28 CHAPTER 7 C PROGRAMS This program converts an UTF 8 DELA dictionary lt dela gt extension dic into a compressed XML dictionary lt dela gt xm1 using the tag cor respondance file lt corresp gt 7 8 dic index dic index validate ratio lt r gt lt dicofile gt This program compresses the dictionary lt dicofile gt extension dic xml gz into an IDX dictionary The output file is the name of lt dicofile gt with the extension idx For instance if lt dicofile gt is dico dic xml gz the output would be dico dic idx 7 9 make concord html make concord html lt concordidx gt left lt left size gt right lt right size gt o lt res gt dontsort It constructs an html concordance from the index concordance file lt concordidx gt with a left context of lt left size gt characters default 50 and a right con text of lt right size gt characters default 80 Optionally the concordance can be put in the text order option dontsort by default it is sorted The result is put in the file lt res gt default concord html 7 10 make wrtn make wrtn 1 lt lingdef gt lt axiom gt This program compiles the grammar defined by the main XGRF graph lt axiom gt and its sub graphs into a unique XML file representing a Weighted Recursive Transition Network WRTN with the extension wrtn It uses the lingdef file lt lingdef gt to interpret semantically the symbols of the graphs For instanc
7. form gt lt lem gt cing lt lem gt lt pos v det gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt cing lt form gt lt lem gt cing lt lem gt lt pos v noun gt lt f n subcat v num gt lt f n gender v m gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt marins lt form gt lt lem gt marins lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lexic gt lt sfsa sz 19 gt lt txt gt Cinq marins sont toujours port s disparus seul un membre de 1 quipage a t secouru lt txt gt lt q id 0 pos 0 gt lt tr 1bl 0 to 1 gt lt tr 1bl 1 to 1 gt lt tr 1b1 2 to 1 gt lt tr 1bl 3 to 1 gt 22 CHAPTER 5 TEXT FSA x q lt q id 1 pos 0 gt lt tr 1ble 4 to 2 gt lt tr 1bl 5 to 2 gt lt tr 1b1 6 to 2 gt lt q gt lt q id 2 pos 0 gt lt tr 1ble 7 to 3 gt lt tr 1ble 8 to 3 gt x q lt q id 18 pos 0 f 1 gt lt sfsa gt lt text fsa gt Chapter 6 Text processing Outilex platform allows users to process texts using linguistic resources such as dictionaries and grammars The left part of the UI permits to create your own desired chain Processing a text is e segment text in tokens and sentences check box segmentation e applying dictionaries on segmented text and obtaining a text automa
8. ADV adverb POS DET det POS N noun 3 3 OPERATIONS 11 flex f PREPDET DET PRO A N V form gender e feminine flex m PREPDET DET PRO A N V form gender e masculine flex p PREPDET DET PRO A N V form number e plural flex s PREPDET DET PRO A N V form number e singular synt d A lemma postpos b true synt g A lemma antepos b true Each line defines a correspondance for a given code used in the DELA It contains several fields The first field define the type of code that is described POS for part of speech flex for inflection code synt for syntactical or semantic code The line POS N noun indicates that the part of speech N in the DELA dictionary meaning noun will be used as noun in the XML one The line flex s PREPDET DET PRO A N V form number e singular means that s is an infection code that is only valid for part of speech PREPDET DET PRO A Nand V The sequence form number e singular shows that it is defined as an attribute of the inflected form form where number is the name of the attribute e is the type of the attribute e for enumeration and singular is the value of the attribute For converting a DELA dictionary into an XML dictionary you need to click on item Transcode DELA gt XML in the menu Dictionary The selected DELA dictionary will then be converted in a compressed XML format and be displayed in the content part of the UI Program used delaf2xml sh cf section 7 7 3 3 3 Indexing dictionaries To b
9. index file default to concord idx longest match keep only longest matching sequences tags display morpho syntactic tags tree display syntactic tree w display weights of matching sequences m merge grammar s outputs into concordances all shortcut for tags tree w m ipath keep only one concordance for the same text segment can be a lot faster for ambigous grammars v verbose mode for debugging It applies a compiled wrtn grammar lt gram gt extension wrtn to a text fsa lt txtfsa gt and saves the matching sequences index into a file lt outputres gt default to concord idx which can be processed by make concord html There exist different options that are described above 7 3 concordancer2 concordancer 1 lt lingdef gt gram lt gram gt v longest match tags tree w m ipath iout l o lt outputres gt lt txtfsa gt lt txtfsa gt input text fsa l lt lingdef gt lingdef to be used gram lt gram gt wrtn grammar to apply O lt concord gt name of the resulting concordance index file default to concord xml longest match keep only longest matching sequences tags display morpho syntactic tags tree display syntactic tree w display weights of matching sequences m merge grammar s outputs into concordances all shortcut for tags tree w m timeout lt s gt specify a maximum amount of time in seconds to spend to parse a sentenc
10. lt lem gt lt pos v X gt lt lex gt lt lex gt lt form gt entendu lt form gt lt lem gt entendu lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt entendu lt form gt lt lem gt entendu lt lem gt lt pos v noun gt lt f n subcat v hum gt lt f n gender v m gt lt f n number v s gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt entendu lt form gt lt lem gt entendre lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt par lt form gt lt lem gt par lt lem gt lt pos v lex gt lt f n case v min gt 44 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt lex gt lt lex gt lt form gt par lt form gt lt lem gt par lt lem gt lt pos v noun gt lt f n gender v m gt lt f n number v s gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt par lt form gt lt lem gt par lt lem gt lt pos v prep gt lt lex gt lt lex gt lt form gt les lt form gt lt lem gt les lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt les lt for
11. lt f n number v p gt lt f n pers v 3 gt lt lex gt lt lex gt lt form gt toujours lt form gt lt lem gt toujours lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt toujours lt form gt lt lem gt toujours lt lem gt lt pos v adv gt lt lex gt lt lex gt lt form gt port s lt form gt lt lem gt port s lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt port s lt form gt lt lem gt port lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt port s lt form gt lt lem gt port lt lem gt lt pos v noun gt lt f n gender v m gt lt f n number v p gt lt f n proper v false gt 33 34 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt f n compound v false gt lt lex gt lt lex gt lt form gt port s lt form gt lt lem gt porter lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt disparus lt form gt lt lem gt disparus lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt disparus lt form gt lt lem gt disparu lt lem gt lt pos v adj gt lt f n pos
12. lt pos v lex gt 35 36 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt f n case v min gt lt lex gt lt lex gt lt form gt un lt form gt lt lem gt un lt lem gt lt pos v det gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt un lt form gt lt lem gt un lt lem gt lt pos v det gt lt f n subcat v ind gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt un lt form gt lt lem gt un lt lem gt lt pos v noun gt lt f n subcat v num gt lt f n gender v m gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt membre lt form gt lt lem gt membre lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt membr lt form gt lt lem gt membr lt lem gt lt pos v adj gt lt f n postpos v false gt lt f n antepos v false gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt 37 lt form gt membre lt form gt lt lem gt membre lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n number lt lex gt lt lex gt v s gt lt form gt membre lt form gt lt lem gt membre lt lem gt lt pos v noun gt lt f n subcat lt f n gender lt f n numb
13. subcat v tps gt lt f n gender v m gt lt f n number v s gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt t lt form gt lt lem gt tre lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt 41 42 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt f n number v s gt lt lex gt lt lex gt lt form gt secouru lt form gt lt lem gt secouru lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt secouru lt form gt lt lem gt secourir lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt lt form gt lt lem gt lt lem gt lt pos v punc gt lt lex gt lt lex gt lt form gt I1 lt form gt lt lem gt il lt lem gt lt pos v lex gt lt f n case v cap gt lt lex gt lt lex gt lt form gt il lt form gt lt lem gt il lt lem gt lt pos v pro gt lt f n procat v ppv gt lt f n case v nom gt lt f n gender v m gt lt f n pers v 3 gt lt f n number v s gt lt lex gt lt lex gt lt form gt entendu lt form gt lt lem gt entendu lt lem gt 43 lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt entendu lt form gt lt lem gt entendu
14. ton representing the possible analyses for each sentence check box apply dictionaries e normalizing text fsa by applying normalization graphs check box normalize e apply a cascadus of grammars in the form of graphs on the text au tomaton resulting to a new text automaton check box apply graph cascadus e applying a grammar to the text automaton to obtain a concordance or a modified text e g an annotated text check box locate pattern 6 1 Text segmentation The text segmentation process creates a directory associated to the text lt text gt dir and outputs a segmented text lt text gt segmentation put in this directory 23 24 CHAPTER 6 TEXT PROCESSING 6 2 Dictionary application You must insert and select dictionaries you need click on button more click on button less to close The process will generate a text automa ton lt text gt fsa in the text directory A copy of it is also made file lt text gt 0 fsa 6 3 Normalize text automaton This operation must be run after dictionary application and text automaton construction It applies the graph Norm xgrf in the directory lingdef lt language gt where lt language gt is the current language This process generates a new normalized text automaton file lt text gt norm fsa 6 4 Grammar application You need to define a list of graphs that will be applied in cascadus on lt text gt fsa Each iteration j will generate a new text automaton lt tex
15. 9 gt lt q id 9 pos 0 gt lt tr lt tr lt tr lt tr lt tr lt tr lt tr lt tr lt tr x q 1b1 28 to 10 gt 1b1 29 to 10 1b1 30 to 10 gt 1b1 31 to 10 gt 1b1 32 to 10 gt 1b1 33 to 10 gt 1b1 34 to 10 1b1 35 to 10 gt 1b1 36 to 10 gt lt q id 10 pos 0 gt lt tr lt tr lt tr lt tr lt tr lt g gt 1b1 37 to 11 gt 1b1 38 to 11 gt 1b1 39 to 11 gt 1b1 40 to 11 gt 1b1 41 to 11 gt lt g id 11 pos 0 gt lt tr lt tr lt tr lt tr x q 1b1 42 to 12 gt 1b1 43 to 12 gt 1b1 44 to 12 gt 1b1 45 to 12 gt 47 48 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt q id 12 pos 0 gt lt tr 1b1 46 to 13 gt x q lt q id 13 pos 0 gt lt tr 1b1 47 to 14 gt lt tr 1b1 48 to 14 gt lt q gt lt q id 14 pos 0 gt lt tr 1b1 49 to 15 gt lt tr 161 50 to 15 gt lt tr 1b1 51 to 15 gt lt tr 1b1 52 to 15 gt lt tr 1bl 53 to 15 gt x q lt q id 15 pos 0 gt lt tr 1b1 54 to 16 gt lt tr 1b1 55 to 16 gt lt tr 161 56 to 16 gt x q lt q id 16 pos 0 gt lt tr 1b1 57 to 17 gt lt tr 1b1 58 to 17 gt lt q gt lt q id 17 pos 0 gt lt tr 1bl 59 to 18 gt x q lt q id 18 pos 0 f 1 gt lt sfsa gt lt s sa sz 9 gt lt txt gt Il a t entendu par les enqu teurs lt txt gt
16. HAPTER 8 APPENDICE A TEXT AUTOMATON lt f n subcat v hum gt lt f n gender v m gt lt f n number v p gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lexic gt lt s sa sz 19 gt lt txt gt Cing marins sont toujours port s disparus seul un membre de 1 quipage a t secouru lt txt gt lt q id 0 pos 0 gt lt tr 1ble 0 to 1 gt lt tr 1ble 1 to 1 gt lt tr 1b1 2 to 1 gt lt tr 1ble 3 to 1 gt x q lt q id 1 pos 0 gt lt tr 1ble 4 to 2 gt lt tr 1bl 5 to 2 gt lt tr 1b1 6 to 2 gt lt q gt lt q id 2 pos 0 gt lt tr 1ble 7 to 3 gt lt tr 1ble 8 to 3 gt lt q gt lt q id 3 pos 0 gt lt tr 1b1 9 to 4 gt lt tr 1b1 10 to 4 gt x q lt q id 4 pos 0 gt lt tr 1b1 11 to 5 gt lt tr 1b1 12 to 5 gt lt tr 1b1 13 to 5 gt lt tr 1b1 14 to 5 gt lt q gt lt q id 5 pos 0 gt lt tr 1b1 15 to 6 gt lt tr 1b1 16 to 6 gt lt tr 1b1 17 to 6 gt lt tr 1b1 18 to 6 gt lt tr x q 1b1 19 to 6 gt lt q id 6 pos 0 gt lt tr x q 1b1 20 to 7 gt lt q id 7 pos 0 gt lt tr lt tr lt tr x q 1b1 21 to 8 gt 1b1 22 to 8 gt 1b1 23 to 8 gt lt q id 8 pos 0 gt lt tr lt tr lt tr lt tr x q 1b1 24 to 9 gt 1b1 25 to 9 gt 1b1 26 to 9 gt 1b1 27 to
17. ation 24 6 5 Locate pattern 24 7 C Programs 25 al apply die LAA sen eal Bl sd ed bs 25 1 2 Concordancer su Br as ally AR a a es to a 25 3 CONCOTAANIC E D Sn AR Su eg AO eer AS on 26 IA eae FD II Ee 27 por dela index poa as s p AA ean b 27 7 6 x cio gas a BR E a ED 27 as Ana ete a de 27 55 da de crua e is ek Be eta 28 7 9 make concord html 28 AO Anak WELNZS i para whe i ii a UE era so ds Bu 28 TAT tisa cpe he tas ade eit his De ah de el le 28 Pa a ld m he i ere hs aaa antai arr ag 29 AN 5 es ala Qe a DES A 29 Tell ao bu K 29 Lo wrtnetlatten x hs pag b hoa ete ad A ee UE LS a 29 II uns ku a oe eg See oh net ee BS 30 ATARI RS Col Oi SU Tr 30 Sc UL G2 Be Sy a o MI A 30 7 19 wrtn txt transduct 30 8 Appendice a text automaton 31 Chapter 1 Introduction Outilex is a 4 year research project funded by the French Industry Ministery It gathers 4 academic institutions and 6 industrial organizations Centre de Energie Atomique CEA Institut Gaspard Monge IGM Universite de Marne la Vallee coor dinator Laboratoire de Psychologie et Neurosciences de la Cognition de Universit de Rouen Laboratoire d Informatique de Paris 6 LIP6 Langage Communicat
18. bol can play the role of a line breaker Weights can also be added with the following syntax in the the output text field lt output gt lt weight gt Future improvements should authorize call to subgraphs in other directories 4 3 OUPUTS AND WEIGHTS 15 where lt output gt is a string defining the output associated with an input lt weight gt is a positive real number For instance Noun 3 0 The weight of a path of a graph is the sum of all its weights The weight is optional by default if the weight of an input label is missing its value is 0 Then the output text field can simply be defined by the text lt output gt For instance the graph in figure 4 3 recognizes sequences noun adj But when a compound noun noun comp is also recognized on the same sequence of text a priority is given to this last analysis because it is assigned a weight of 1 0 for the other As a consequence the symbol is a specialized symbol to delimit lt output gt and lt weight gt So when using such a symbol in the lt output gt string field lt weight gt is obligatory and user should put an actual weight or an empty string following the last symbol For instance in figure 4 4 the output of the last box is defined with the following text lt date gt where symbol is present at the end because field lt output gt contains such a symbol the associated weight is the default weight 0 because l
19. ducer Whenever not possible it makes an approximation by limiting the maximum depth lt N gt 30 CHAPTER 7 C PROGRAMS 7 16 u82i u82i This program converts the utf 8 encoded input into ASCII output 7 17 82 16 u82u16 This program converts the utf 8 encoded input into a little endian uni code output 7 18 u1628 u1628 This program converts the input encoded in little endian unicode en coded into an output encoded in utf 8 7 19 wrtn txt transduct wrtn txt transduct 1 lt lingdef gt gram lt fst gt ipath iout html txt m r i o lt outputres gt lt txtfsa gt This program applies wrtn transducer lt fst gt to a text fsa lt txtfsa gt and generates a new text from the orignal one depending on the chosen option e m for merging outputs in the text when finding matching sequences e r for replacing matching sequences by the associated outputs in the new text e i for ignoring outputs new text is the original one USELESS The output text lt outputres gt can be either in HTML html or in TXT txt It requires the lingdef file lt lingdef gt Chapter 8 Appendice a text automaton lt xml version 1 0 gt lt text fsa sz 2 gt lt lexic sz 79 gt lt lex gt lt form gt Cing lt form gt lt lem gt cing lt lem gt lt pos v lex gt lt f n case v cap gt lt lex gt lt lex gt lt form gt cing lt form gt lt lem gt cing lt lem gt lt p
20. e ipath keep only one concordance for the same text segment can be a lot faster for ambigous grammars v verbose mode for debugging 7 4 DECORE FSA 27 It applies a wrtn grammar to a text fsa and saves the matching sequences index into a file default to concord xml which can be processed by make concord html 7 4 decore fsa decore fsa 1 lt lingdef gt rtn lt fst gt v ipath iout l o lt outputres gt lt txt sa gt This program applies a compiled decoration grammar fst extension wrtn to the text fsa lt txtfsa gt It outputs a new version of txtfsa lt outputres gt with new transitions whenever new analyses have been found by applying grammar fst It requires the lingdef file lt lingdef gt 7 5 dela index dela index lt dela gt corresp lt corresp gt r lt ratio gt o lt index gt This program compresses an UTF 8 DELA dictionary lt dela gt into into an IDX dictionary using the tag correspondance file lt corresp gt The output file is lt index gt 7 6 dic lookup dic lookup imaj imark icase dic lt diconame gt regi reg2 sal d This program searches the entries of the words regi reg2 in the dictionary diconame Options could be e imaj ignore case in words but not in dictionaries e icase ignore case in dictionaries and in words e imark ignore diacritics in words and in dictionaries 7 7 delaf2xml sh delaf2xml sh c lt corresp gt lt dela gt
21. e if lt axiom gt is main xgrf the output would be main wrtn 7 11 tfsa copy tfsa copy o lt out gt gz f lt oformat gt lt txtfsa gt 7 12 TFSA2DOT 29 This program converts a binary text fsa lt txtfsa gt into XML if option f is xml Option gz indicates that the result will be gzipped 7 12 tfsa2dot tfsa2dot 1 lt lingdef gt l o lt output gt lt txtfsa gt n lt sentenceno gt This program converts the lt sentenceno gt th sentence of text fsa txtfsa extension fsa into the file lt output gt describing an automaton with the DOT format using the lingdef file lt lingdef gt By default the output file is named sentence lt sentenceno gt dot 7 13 tokenization tokenization lt text gt This program is used to segment a text lt text gt in tokens sentences and paragraphs lt text gt is the input text file name and can be either in TXT format or HTML format Outputs are lt text gt segmentation lt text gt tokenization and lt text gt postfilter 7 14 transduct fsa transduct fsa 1 lt lingdef gt gram lt fst gt 1gst ipath iout dontsurf o lt outputres gt lt txtfsa gt This program applies a normalization wrtn transducer lt fst gt on a text fsa lt txtfsa gt It produces a new text fsa lt outputres gt 7 15 wrtn flatten wrtn flatten lt rtn gt maxdepth lt N gt This program flattens a compiled grammar lt rtn gt into a finite state au tomaton or trans
22. e commands of the different programs 2 3 Installation procedure To install Outilex platform you need to follow the steps described below e Go to Outilex directory e Run compilation by typing install outilex Warning You might have to change some environment parameters depending on you local system e Set LINGDEF environment variable by typing LINGDEF OUTILEX_HOME lingdef french lingdef xml export LINGDEF with OUTILEX HOME the path of Outilex directory To avoid doing this operation everytime you want to run Outilex platform you should put these commands in your bashrc file Note this operation is temporary and should be removed in the next versions of the Outilex platform 2 4 Launch Outilex 2 4 1 Starting command Outilex platform User Interface UI can be launched by typing java jar outilexUl jar 2 4 LAUNCH OUTILEX 5 2 4 2 UI general description The UI is composed of e a menu on top Tool bar coming soon e a process resource panel on the left to create personal processing chain with available linguistic resources e a content panel on the right to display linguistic resources and processing results Important note Many functionalities can be run via popup menus right click on the mouse Double clicking on a resource in the left panel makes it display on the rightpanel 2 4 3 Projects Outilex platform works with a system of project Each project is composed of a
23. e used in the Outilex text processes the dictionaries must be indexed into binary files that represent the dictionaries in the form of minimized finite state transducers These files have the extension idx 12 CHAPTER 3 DICTIONARIES To do so you need to click on item indexing in the menu Dictionary The C program that is launched is either dic index for compressed XML dictionaries cf section 7 8 or dela index for DELA dictionaries cf section 7 5 This process through the interface will produce a file lt name gt idx if input dictionaries are named lt name gt dic or lt name gt dic xml gz 3 3 4 Add selected dictionary to project You can add the selected dictionary to your current project by clicking on item Add to project in the menu Dictionary ONLY IF IT HAS AL READY BEEN INDEXED Chapter 4 Grammars Grammars used in Outilex are equivalent to Recursive Transition Networks They are in the form of graphs Their symbols used can be lexical values lexical masks or call to sub graphs They can also contain outputs and weights Outilex platform includes a graph editor developed from Unitex sources This graph editor generates graph encoded in XML with the extension xgrf Some example graphs are provided with the platform Temporary for more details on graph edition see the Unitex manual file manuelunitex pdf 4 1 A simple graph Outilex graphs are oriented graphs represented with oriented rectangular
24. er lt f n proper v abst gt ve m y s gt v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt membre lt form gt lt lem gt membre lt lem gt lt pos v noun gt lt f lt f n subcat n gender lt f n number lt f n proper lt f lt lex gt lt lex gt v conc gt v m gt v s gt v false gt n compound v false gt lt form gt membre lt form gt lt lem gt membre lt lem gt lt pos v noun gt lt f n subcat lt f n gender lt f n number lt f n proper lt f n lt lex gt lt lex gt v hum gt v m gt v s gt v false gt compound v false gt 38 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt form gt membr lt form gt lt lem gt membrer lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt membre lt form gt lt lem gt membrer lt lem gt lt pos v verb gt lt f n mode v ind subj gt lt f n number v s gt lt f n pers v 1 3 gt lt lex gt lt lex gt lt form gt membre lt form gt lt lem gt membrer lt lem gt lt pos v verb gt lt f n mode v imp gt lt f n number v s gt lt f n pers v 2 gt lt lex gt lt lex gt lt form gt de lt form gt lt lem gt de lt lem gt lt pos v lex gt lt f n case v mi
25. ion Information LCI Lingway LORIA Systran Thales Com Thales R amp D Started in October 2002 this project aims at developing a platform devoted to Natural Language Processing CHAPTER 1 INTRODUCTION Chapter 2 Getting started 2 1 System Requirements To compile the C programs you need e compiler gcc g with version 4 tool jam from Perforce e library libxml standard XML library e library ICU from IBM e library C Boost www boost org compiled with ICU To compile the Systran C tokenization module you need an old ver sion of flex 2 5 4 package flex old Newer versions do not work To run the User Interface you need the Java Run time Environment 1 5 2 2 Outilex directory Outilex directory includes the following files and directories e file README txt to get started e file install outilex script to compile and install C programs e file outilexUI jar to run user interface e file clean outilex scipt to clean compiled programs 3 4 CHAPTER 2 GETTING STARTED e directory bin C compiled programs directory data some linguistic data provided with the platform directory docs documentation e directory lingdef linguistic definitions of the set of tags used in dic tionaries and graphs This directory also contains a log file outilex log that includes all commands that have been launched from the interface This can help users getting used with the syntax of th
26. ities must be put just after the 18 CHAPTER 4 GRAMMARS opening square brackets Attributes to this part of speech can also be added by adding outputs with the following syntax lt attribute gt e g Npr means proper name in figure 4 6 For instance the graph in figure 4 6 recognizes named entities which are tagged N Npr A complex entity can inherit from attributes from its elements e g lemma mode gender and so on This can be indicated in the decora tion graph as an output with the following syntax lt attribute gt Such an example is shown in figure 4 7 when recognized in a text the pattern lt avoir verb gt lt verb ppast gt is analysed as a CV verbal complex whose mode is the mode of avoir and whose lemma is the one of the verb at the past participle For instance the sequence a lu would be analysed as a CV whose mode is ind for indicative and whose lemma is lire the verb avoir followed by a verb at the past participle Chapter 5 Text FSA The Text FSA is used to represent text ambiguity For each sentence there is an automaton that represents its possible analyses Grammar application programs all use it as input Given the text containing two sentences Cinq marins sont toujours port s disparus seul un membre de l quipage a t secouru Il a t entendu par les enqu teurs After application of the French dictionary provided for the project by IGM the text is represented as a set of two a
27. m gt lt lem gt le lt lem gt lt pos v det gt lt f n subcat v def gt lt f n gender v f gt lt f n number v p gt lt lex gt lt lex gt lt form gt les lt form gt lt lem gt le lt lem gt lt pos v det gt lt f n subcat v def gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt l s lt form gt lt lem gt 1 lt lem gt 45 lt pos v noun gt lt f n subcat v conc gt lt f n gender v m gt lt f n number v p gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt 1 s lt form gt lt lem gt 1 s lt lem gt lt pos v prep gt lt lex gt lt lex gt lt form gt les lt form gt lt lem gt le lt lem gt lt pos v pro gt lt f n procat v ppv gt lt f n case v acc gt lt f n pers v 3 gt lt f n number v p gt lt lex gt lt lex gt lt form gt enqu teurs lt form gt lt lem gt enqu teurs lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt enqu teurs lt form gt lt lem gt enqu teur lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt enqu teurs lt form gt lt lem gt enqu teur lt lem gt lt pos v noun gt 46 C
28. ma is considered to be the same as the inflected form e g car e The third element is the part of speech and is obligatory e g ADV for adverb N for noun e The elements following symbol are syntactic and semantic infor mation and are optional e g Conc for concrete Npr for proper name Hum for human The character sequence following symbol is a set of morphological information and are optional each character stands for a piece of in formation e g P3s stands for present P at the third person 3 of singular sl e The sequence following symbol is an optional comment e g this is an example The tagset is free as long as the writer follows the syntax defined above This editable dictionaries are encoded in UTF 8 and their file extension is dic 3 2 XML format 3 2 1 Description Outilex uses an UTF 8 XML exchange format file extension dic xml Tagset is defined in the lingdef cf section 3 2 2 All tags used must be defined in the lindef file Hereby is an example of an entry lt entry gt lt lemma gt abaissable lt lemma gt lt pos name adj gt lt inflected gt lt form gt abaissable lt form gt lt feat name gender value masculine gt lt feat name number value singular gt 3 2 XML FORMAT 9 lt inflected gt lt inflected gt lt form gt abaissable lt form gt lt feat name gender value feminine gt lt feat name
29. n gt lt lex gt lt lex gt lt form gt de lt form gt lt lem gt de lt lem gt lt pos v XI gt lt lex gt lt lex gt lt form gt de lt form gt lt lem gt de lt lem gt lt pos v det gt lt f n subcat v ind gt lt lex gt lt lex gt lt form gt d lt form gt lt lem gt d lt lem gt lt pos v noun gt lt f n subcat v conc gt lt f n gender v m gt lt f n number v s gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt de lt form gt lt lem gt de lt lem gt lt pos v prep gt lt lex gt lt lex gt lt form gt 1 lt form gt lt lem gt 1 lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt 1 lt form gt lt lem gt le lt lem gt lt pos v det gt lt f n subcat v def gt lt f n gender v f gt lt f n number v s gt lt lex gt lt lex gt lt form gt 1 lt form gt lt lem gt le lt lem gt lt pos v det gt lt f n subcat v def gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt 1 lt form gt lt lem gt 1 lt lem gt 39 40 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt pos v noun gt lt f n gender v m gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt lt form gt lt lem gt l
30. os v det gt lt f n gender v f gt lt f n number v p gt lt lex gt lt lex gt lt form gt cing lt form gt lt lem gt cing lt lem gt lt pos v det gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt 31 32 CHAPTER 8 APPENDICE A TEXT AUTOMATON lt form gt cing lt form gt lt lem gt cing lt lem gt lt pos v noun gt lt f n subcat v num gt lt f n gender v m gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt marins lt form gt lt lem gt marins lt lem gt lt pos v lex gt lt f v min gt lt lex gt lt lex gt lt form gt marins lt form gt lt lem gt marin lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt marins lt form gt lt lem gt marin lt lem gt lt pos v noun gt lt f n subcat v hum gt lt f n gender v m gt lt f n number v p gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt sont lt form gt lt lem gt sont lt lem gt lt pos v lex gt lt f v min gt lt lex gt lt lex gt lt form gt sont lt form gt lt lem gt tre lt lem gt lt pos v verb gt lt f n mode v ind gt
31. set of resources texts dictionaries and grammars The Menu Project allows users to create open and save projects A project is associated with one language This language selects a linguistic definition file For example if french is the project language the set of linguistic tags that will be used in the processings is defined in the file lingdef french lingdef xml CHAPTER 2 GETTING STARTED Chapter 3 Dictionaries Dictionaries are sets of lexical entries associated with morphological syn tactic and semantic information Lexical entries are either simple words or multiword units Outilex platform allows users to edit their own dictionaries There are several formats e an editable textual format DELA format encoded in UTF 8 exten sion dic e an exchange format in XML extension dic xml gz e a binary format used by programs extension idx The different operations on dictionaries are gathered in the Dictionary menu of the platform 3 1 DELA format The DELA format has been defined in Syntax of an entry An entry is defined on a single line as it is shown in the following examples car N Conc s this is an example eats eat V P3s Tony Blair N Npr Hum ms sincerely ADV 8 CHAPTER 3 DICTIONARIES e The first element is the inflected form and is obligatory car and eats e The second element between symbols and is the lemma and is optional e g eat If it is not present the lem
32. t lem gt lt pos v punc gt lt lex gt lt lex gt lt form gt quipage lt form gt lt lem gt quipage lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt quipage lt form gt lt lem gt quipage lt lem gt lt pos v noun gt lt f n subcat v conc gt lt f n gender v m gt lt f n number v s gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt a lt form gt lt lem gt a lt lem gt lt pos v lex gt lt f v min gt lt lex gt lt lex gt lt form gt a lt form gt lt lem gt a lt lem gt lt pos v XI gt lt lex gt lt lex gt lt form gt a lt form gt lt lem gt a lt lem gt lt pos v noun gt lt f n gender v m gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt a lt form gt lt lem gt a lt lem gt lt pos v prep gt lt lex gt lt lex gt lt form gt a lt form gt lt lem gt avoir lt lem gt lt pos v verb gt lt f n mode v ind gt lt f n number v s gt lt f n pers v 3 gt lt lex gt lt lex gt lt form gt t lt form gt lt lem gt t lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt t lt form gt lt lem gt t lt lem gt lt pos v noun gt lt f n
33. t gt j fsa The final automaton is lt text gt final fsa A copy of it is made in file lt text gt fsa lt text gt fsa is actually the current text automaton to be processed 6 5 Locate pattern You need to select a graph to be applied and the type of result you want Chapter 7 C Programs The Outilex platform is made of a set of independant C programs This chapter defines their different prototypes 7 1 apply dic apply dic dic lt dic1 gt lt prio1 gt dic lt dic2 gt lt prio2 gt imaj icase imark 1 lt lingdef gt o lt out gt lt tokfile gt This program applies a set of dictionaries lt dicj gt extension idx with different priorities lt prioj gt real numbers by default 10 to a segmented text lt tokfile gt It outputs a text fsa lt out gt by default the name of the text file with the extension fsa It uses a lingdef file lt lingdef gt Options could be e imaj ignore case in texts but not in dictionaries e icase ignore case in dictionaries and in texts e imark ignore diacritics in texts and in dictionaries 7 2 concordancer concordancer 1 lt lingdef gt gram lt gram gt v longest match tags tree w m ipath iout o lt outputres gt lt txtfsa gt with options lt txtfsa gt input text fsa 25 26 CHAPTER 7 C PROGRAMS gram lt gram gt wrtn grammar to apply o lt concord gt name of the resulting concordance
34. t weight gt is the empty string lt nountcomp gt k Figure 4 3 Use of weights 16 CHAPTER 4 GRAMMARS lt date gt lt date gt Figure 4 4 Special outputs 4 4 Normalization graphs A normalization graph is a graph such that when applied to a text automaton it normalizes some sequences like de as shown in the following example de de DET Dind z1 mp fp Figure 4 5 Normalization graph 4 5 DECORATION GRAPHS 17 4 5 Decoration graphs A decoration graph is a graph such that when applied to a text automaton new transitions corresponding to new analyses are added to the initial text automaton Below are two examples monsieur madame mademoiselle pr pas de MAJ slext caphnaj gt acronymes lt lextmaj gt Figure 4 6 Recognition of named entities lt tre verb gt Amode lt protppy gt PPV lt avoir verb gt Amode lt tre verbtppast gt lt verbtind cond subjtinf G ppast gt Alemma Amode Figure 4 7 Recognition of verbal complexes These graphs have a special format Linguistic entities that have to be analysed must be delimited in the graph by square brackets in the output like in the graphs shown above The part of speech e g N for noun in figure 4 6 CV for verbal complex in figure 4 7 PPV for preverbal pronoun in figure 4 7 to be assigned to these ent
35. tpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt disparus lt form gt lt lem gt disparu lt lem gt lt pos v noun gt lt f n subcat v hum gt lt f n gender v m gt lt f n number v p gt lt f n proper v false gt lt f n compound v false gt lt lex gt lt lex gt lt form gt disparus lt form gt lt lem gt disparaitre lt lem gt lt pos v verb gt lt f n mode v ppast gt lt f n gender v m gt lt f n number v p gt lt lex gt lt lex gt lt form gt disparus lt form gt lt lem gt disparaitre lt lem gt lt pos v verb gt lt f n mode v ind gt lt f n number v s gt lt f n pers ve 112 5 lt lex gt lt lex gt lt form gt lt form gt lt lem gt lt lem gt lt pos v punc gt lt lex gt lt lex gt lt form gt seul lt form gt lt lem gt seul lt lem gt lt pos v lex gt lt f n case v min gt lt lex gt lt lex gt lt form gt seul lt form gt lt lem gt seul lt lem gt lt pos v adj gt lt f n postpos v true gt lt f n antepos v false gt lt f n gender v m gt lt f n number v s gt lt lex gt lt lex gt lt form gt seul lt form gt lt lem gt seul lt lem gt lt pos v adv gt lt lex gt lt lex gt lt form gt un lt form gt lt lem gt un lt lem gt
36. utomata each automaton associated with a sentence A graphical example of the second sentence is pgivin in figure 5 1 By default the resulting text automaton is a binary file Nevertheless Outilex provides a converter into a XML file Extracts of such file is pro vided below tfsa copy The complete file is given in appendice The element text fsa contains fisrt an element lexic and a set of elements sfsa lexic includes the set of lexical entries that will be used in the labels of the automata transitions Lexical entries are represented the same way as defined in the lingdef file Elements sfsa represent sentence automata which are defined as a set of states q from which some transitions tr start lt xml version 1 0 gt lt text fsa sz 2 gt lt lexic sz 79 gt lt lex gt lt form gt Cing lt form gt lt lem gt cing lt lem gt lt pos v lex gt lt f n case v cap gt 19 20 CHAPTER 5 TEXT FSA o 8 2 i H H i 4 4 U 4 i 8 i 5 r al i 3 8 H E 8 i i i Fi E 9 y E b i A El 8 E E H H H i i H i i i i El i f H i H 5 B 3 Figure 5 1 A sentence automaton lt lex gt lt lex gt lt form gt cing lt form gt lt lem gt cing lt lem gt lt pos v det gt lt f n gender v f gt lt f n number v p gt lt lex gt lt lex gt lt form gt cing lt
Download Pdf Manuals
Related Search
Related Contents
Operating Instructions for Micro Application Head Handbook of troubleshooting plastics processes : a practical guide <<< USER aND INSTaLLaTION MaNUAL Twister Moving Light Enclosure Braun K 750 CombiMax Food Processor - D:\ELECTRONICA\braun\_food_processor_®_ Copyright © All rights reserved.
Failed to retrieve file