Home

1 Introduction 2 Background

1. Trento Italy 1992 Kenneth W Church A stochastic parts program noun phrase parser for unrestricted text In IEEE Proceedings of the ICASSP pages 695 698 Glasgow 1989 Steve DeRose Grammatical category disambiguation by statistical optimization Computa tional Linguistics 14 1 31 39 1988 Alain Duval et al Robert Encyclopedic Dictionary CD ROM Hachette Paris 1992 Judith Klavans and Evelyne Tzoukermann Dictionaries and corpora Combining corpus and machine readable dictionary data for building bilingual lexicons Computational Linguistics to appear under review S Klein and R F Simmons A grammatical approach to grammatical tagging coding of english words JACM 10 334 347 1963 Geoffrey Leech Roger Garside and Erik Atewll Automatic grammatical tagging of the lob corpus CAME News 7 13 33 1983 Bernard Merialdo Tagging English text with a probabilistic model Computational Linguistics 20 2 155 172 1994 D S Moore and G P McCabe Introduction to the Practice of Statistics W H Freeman New York 1989 O Soumoy Tzoukermann E and J P H van Santen Duration in french text to speech synthesis In 11222 941202 18 TM Murray Hill N J USA 1994 Technical Memorandum A T amp T Bell Laboratories Evelyne Tzoukermann and Mark Y Liberman finite state morphological processor for span ish In Proceedings of the 13th International Conference on Computational Linguistics Helsinki Fi
2. e Word splitting when all preprocessing has completed the corpus is split into words and translated from 8 bit characters to 7 bit ascii characters Accents are expressed by diacritic symbols that follow the unaccented letter For example tre is represented as etre 5 2 Morphological processing We use finite state transducers FST for the morphological analysis Our FST is built on the model developed for Spanish morphology 14 and handles mainly inflectional morphology as well as some derivational affixes such as anti anti in anti iranien anti iranien and arri re great in arri re grand p re great grandfather The finite state dictionary is originally built using the Robert Encyclopedic dictionary 7 and is increased through acquisition of proper nouns from unrestricted texts The FST used in the morphological stage of the tagger can consist of up to 4 distinct sub FST s 1 main non proper noun FST 2 proper noun FST compiled from various sources 3 proper noun FST dynamically generated from the training corpus Part of Speech Tagger for French a User s Manual 7 4 proper noun FST generated heuristically from the current test corpus Nearly complete conjugations for French verbs are included in the main FST 5 3 Tagset choice and hand tagging We believed that a flexible tagset will be of benefit for the diverse applications that could make use of the tagger Thus we h
3. pr sident est and couvent have a different pronunciation when they are an inflected verb or a noun Also at the duration level studies such as Pl and 3 have shown that duration of function words tends to be shorter than non function words therefore a part of speech tagger can help finding these function words e querying tagged corpora can be very useful for studying collocations or bilingual correspon dences 8 For example in S a tagger for English 44 is utilized to disambiguate English Part of Speech Tagger for French a User s Manual 12 Sentence Te president s e sme Trew paler 3 envant p o s pron noun pron aux p part prep verb prep art noun verb verb noun noun verb noun pronunciation lo prezida Isl e arete pur parle lol kuv prezid o kuv 2 Jest translations ie presidente _ stoppet speaking atthe convent Table 7 French sentence with prononuciation varying with the part of speech text in order to determine verbs and non verbs As the study is focused on correspondences be tween French and English motion verbs the tagger marks the English verbs so that the French corresponding sentence is selected as a candidate for the analysis of bilingual correspondence 9 Conclusion We described a part of speech tagger that correctly tags over 91 of unrestricted text with a very small amount of training data When the correct
4. verb 3rd person singular imperfect indicative verb 3rd person singular simple past indicative verb 3rd person singular imperfect subjunctive verb infinitive numeral demonstrative pronoun demonstrative pronoun feminine plural demonstrative pronoun feminine singular demonstrative pronoun masculine plural demonstrative pronoun masculine singular demonstrative adjective feminine plural demonstrative adjective feminine singular demonstrative adjective masculine plural demonstrative adjective masculine singular demonstrative adjective plural demonstrative adjective singular possessive adjective feminine plural possessive adjective feminine singular possessive adjective masculine plural possessive adjective masculine singular possessive adjective plural possessive adjective singular punctuation acronym sentence beginning sentence end NIL ERROR 10 11 12 13 14 15 References Lalit R Bahl and Robert L Mercer Part of speech assignement by a statistical decision algo rithm IEEE International Symposium on Information Theory pages 88 89 1976 K Bartkova and C Sorin A model of segmental duration for speech synthesis in French Speech Communication 6 245 260 1987 G E P Box and G C Tiao Bayesian Inference in Statistical Analysis Addison Wesley Read ing Mass 1973 Eric Brill A simple rule based part of speech tagger In Third Conference on Applied Compu tational Linguistics
5. The strength score assumes that f results from a binary distributiuon B p n This is the distribution which results when n independent trials are made each having probability p of the decision and probability 1 p of any other member of the tag genotype We do not know p but must make an estimate from the data When p is estimated as the proportion f n of the decision in the tag genotype then the theory of the binomial distribution H2 page 398 gives sd p v p 1 p n f 40 5 n 1 so that neither p nor 1 P will be zero This procedure is explained in BI pages 34 36 We can estimate the uncertainty of p by Vp 1 p n We estimate p and we use the strength oll strength p ps 1 p 100 n for the decision This score represents our estimate of the probability less our estimate of the uncertainty Notice in the above table that 25 25 has a lower strength than 30 30 which in turn has a lower strength that 82 82 The strength measure is designed to give lower values for the same f n the smaller n is Several examples of genotype decisions obtained through statistical means are shown in table 5 Part of Speech Tagger for French a User s Manual 9 5 6 Application of the genotype resolutions We do not necessarily want to use all genotype decisions One can observe that by varying the number of decisions made on a genotype basis we can obtain significantly different results Therefore we have esta
6. about the Part of Speech Tagger for French a User s Manual 4 part of speech as well as some morphological features such as mood tense person gender and number Each tag is actually an acronym carrying morphological information Example V refers to verbs in general V3S refers to third person singular verbs of any mood or tense V3SPI refers to third person singular verbs in present of the indicative and V S refers to all singular verbs This terminology has several advantages When negative constraints are applied they can be invoked at several levels of the tag using all the available combinations in the above example a rule can apply to the part of speech p o s only the p o s and the number the p o s and the person the p o s and the tense or the p o s with its mood tense person and number e The accuracy function c T A refers to the accuracy of the current tag assignment T A when compared to the correct TA i e TA e The inaccuracy function 4 T A refers to the percentage of incorrect tags in a given T A e The ambiguity function a T A refers to the percentage of incorrect tags in a given T A If T A has 1000 words and 700 of them are tagged correctly in 7 A 10 of them are tagged incorrectly and the remaining 290 are still ambiguous at this stage then TA 70 0 96 ATA 0 1 96 and a TA 29 9 So e A genotype is the list of all possible tags that a given word can inherit from the mo
7. conditional auxiliary 2nd person plural present subjunctive auxiliary 2nd person plural future indicative auxiliary 2nd person plural imperfect indicative auxiliary 2nd person plural simple past indicative auxiliary 2nd person plural imperfect subjunctive auxiliary 3st person plural present indicative auxiliary 3rd person plural present conditional auxiliary 3rd person plural present subjunctive auxiliary 3rd person plural future indicative auxiliary 3rd person plural imperfect indicative auxiliary 3rd person plural simple past indicative auxiliary 3rd person plural imperfect subjunctive auxiliary 1st person singular present indicative auxiliary 1st person singular present imperative auxiliary 1st person singular present conditional auxiliary 1st person singular present subjunctive auxiliary 1st person singular future indicative auxiliary 1st person singular imperfect indicative auxiliary 1st person singular simple past indicative auxiliary 1st person singular imperfect subjunctive auxiliary 2nd person singular present indicative auxiliary 2nd person singular present imperative auxiliary 2nd person singular present conditional auxiliary 2nd person singular present subjunctive auxiliary 2nd person singular future indicative auxiliary 2nd person singular imperfect indicative auxiliary 2nd person singular simple past indicative auxiliary 2nd person singular imperfect subjunctive auxiliary 3st person singular present indicative auxiliary 3rd person
8. 1 Introduction The purpose of this work is to produce a part of speech tagger for French using morphological analysis provided by a finite state transducer The tagger also utilizes a combination of statistical learning and linguistic knowledge and is built in a pipelined architecture All modules except for preprocessing and morphological analysis can be ordered in various ways Part of speech tagging consists of applying several disambiguation modules on a list of ambiguous words until a single tag remains for each word We propose and evaluate a sequencing strategy for the various modules and point out the best sequencing available Several experiments were performed to figure out the best order of the different modules Results showed that optimal results are obtained when morphological analysis is applied first followed in that order by the application of linguistic constraints the statistical stage and finally the mapping of the large tagset to a smaller one 2 Background French is an inherently ambiguous language when it comes to morphological analysis For example the word mise can have as many as eight morphological analyses mise mis adjective feminine singular mise mis noun feminine singular mise miser past participle feminine singular mise miser verb ist person singular present indicative mise miser verb ist person singular present subj
9. 7 Translation between the large set of tags and the small set of tags s 9 Analysis and evaluation of the method oo 9 6 1 Training and test corpora aooaa 9 6 2 Cross validation 2 ls RR s 9 6 3 Technical characteristics of the system 2 2222s os 10 Results 2 22 sss SR RR RR s 10 7 1 Optimal Tagging Scheme oh 10 7 2 Analysis by sequence 22 lll 4 RR ss 10 7 3 Analysis by threshold aoaaa aaa s 11 7 4 Analysis by tagset 2 2 2 ll RR s 11 Applications 2 ll sls s IL L L RS RR RR oes 11 Conclusion 22s sss L SR SR RR ss 12 Acknowledgments o oaaae 12 User s Manual description of the MT toolset aoa 22 s 12 A l System library files oa 13 A 2 Morphological analyzer aaa e 13 A 3 Makefile ah 13 A 4 Filters which are part of the tagger itself aaa a 13 A 5 Other tools La 14 Choosing a Tagset 2 eR S sh o hh s s 14 REFERENCES ah vii SHORT SET LARGE SET MEANING OF THE TAG auxiliary Ist person plural present indicative auxiliary Ist person plural present imperative auxiliary Ist person plural present conditional auxiliary Ist person plural present subjunctive auxiliary Ist person plural future indicative auxiliary Ist person plural imperfect indicative auxiliary Ist person plural simple past indicative auxiliary Ist person plural imperfect subjunctive auxiliary 2nd person plural present indicative auxiliary 2nd person plural present imperative auxiliary 2nd person plural present
10. a d automated learning In our current work we have used the first three methods only During each iteration of the deterministic stage anchors are identified An anchor is a word which in the current tag assignment has only one possible tag If a word is left with one tag only after the application of a negative rule this word will be consequently used as an anchor for the next iteration If the neighboring words and the anchor itself follow some pattern which 1s disallowed as a negative constraint action is then taken We have determined empirically that three iterations are sufficient for disambiguation of the sentence The user can change the number of iterations if this becomes necessary In the future we might consider an alternative approach in the propagation of negative constraints It is interesting to note that the list of negative constraints could be expanded much more if we were to ignore that some negative constraints fail in only a limited number of cases For example the negative constraint N N noun followed by another noun would fail only in a few special situations namely dimanche soir and similar temporal constructs for French Part of Speech Tagger for French a User s Manual 8 For proper nouns and acronyms we have adopted a heuristic approach if we encounter a word with initial uppercase we assume that it is a possible proper noun and add a proper noun tag to its genotype Similarly if the
11. ailable large tagged corpora exist for French so other techniques had to be discovered to tackle this problem 4 Theoretical Principles This section contains a formal description of the tagging scheme A list of definitions of terms used in this work is also provided 41 Definitions e The initial tag assignment is the tag assignment after preprocessing and morphology e A tag assignment TA is a list of lexemes along with a set of tags correct or not assigned at a particular stage to each of the words in the corpus The following is an example of a TA data structure L BD3S RDF RDM usine NFS VISPI VISPS V2SPM V3SPI V3SPS The word to be tagged is in the left hand column whereas the the right hand column lists the tags associated with this word The left hand side is the word and the right hand side the list of tags associated with this word e The correct tag assignment T A is a tag assignment in which each word has been assigned one tag only the correct one A training corpus of 10 000 words has been manually tagged and used as a basis for evaluating newly tagged corpora e The tagsets T S two tagsets have been considered a large one consisting of 253 tags and a smaller one consisting of 67 tags In addition the user can specify any generalized subset of tags occuring in the large tagset The tagsets are shown in Appendix A Section C The tags within each tagset have a hierarchical structure They contain information
12. alues of some genotype decisions 6 2 Cross validation In order to evaluate the statistical consistency of our results we performed a validation consisting of the following we split the test corpus into 11 slices of equal length 10 of them were extracted from one corpus and the 11th one was extracted from a different corpus different source different subject material We performed a series of training experiments each time using 10 of the 11 slices for training and the remaining 11th slice for testing It was statistically impossible to distinguish the performance of the tagger in the special case when training occurred on one 10 slice corpus and testing on the remaining 11th slice from the other 10 experiments More precisely the performance of the tagger in the special case ranked 4th among the 11 experiments Part of Speech Tagger for French a User s Manual 10 6 3 Technical characteristics of the system 1 Time complexity all filters run linearly with the size of the test corpus 2 System requirements all software included in the tagger toolkit is written in Perl version 4 as well as in Bourne and C Shell script languages The tagger should work on most Unix platforms 7 Results We have analyzed 43 tagging schemes ranging from the morphology stage only M to a complex se ries of procedures morphology deterministic statistical with a threshold of 30 deterministic tagset reduction or M D A39 DT 7 1 Optimal Ta
13. answer is not certain the tagger keeps the remaining ambiguities The use of linguistic knowledge and statistical learning is an original contribution to the disambiguation problem A flexible tagset allows adaptation of the tagger for various natural language applications Several questions such as tagging unknown words and typographical errors need to be solved We are in the process of collecting more training data to improve the system performance as well as trying the tagger on other languages 10 Acknowledgments We would like to thank Ido Dagan and Diane Lambert for the comments suggestions and support that they provided throughout the work A User s Manual description of the MT toolset We have developed a series of tools which can be reused 1n other similar problem set ups Each tool is a stand alone utility and pipelines of such tools can be designed to perform various tasks There are 4 directories where the tagger and the corpora reside TAGGERDIR TRAININGDIR TESTDIR TEMPDIR In order to tag a corpus the user needs to perform the following steps e know where the system files are located e create a directory and put the corpus file in it The extension of the file should be cor e copy the system makefile into the directory where the corpus is located and modify it so that the values of the directories are set properly e modify the environmental variable CORPUS to designate the name of the corpus fil
14. ave provided a facility to translate between our original large tagset and the tagset in use for a specific application We perform the deterministic stage see below on the large tagset in order to be able to disambiguate as many words as possible and allow for a tagset switch at any time after the last deterministic operator in the tagging scheme It turns out that whereas deterministic operators work better on the large tagset 1t 1s unclear whether the statistical learning performs better on the small tagset Manual tagging was done on 10 000 words and used for the training corpus for learning and on the test corpus for evaluating We have provided a tool which prompts the user with a list of all tags from the possible tags for a given word and lets the user either choose the correct tag or specify some additional tags if necessary 5 4 Application of deterministic rules Linguistic knowledge was utilized in the tagger in terms of negative constraints It is more feasible for the computational linguist to predict forbidden transitions between tags rather than anticipate all the possibilities of that transition in the given language The constraints are read from left to right and disallow a particular bigram or trigram of tags Examples Article Verb states that a verb cannot follow an article Negative constraints can be gathered using four methods 1 the literature 2 linguistic knowledge 3 manual analysis of tagged corpor
15. blished a parameter for the A stage which shows which decisions to use A certain genotype decision will be applied only if its strength is above the threshold We have made evaluations using the following values for the threshold from 5 practically all decisions to 30 45 60 75 90 and 100 no decisions at all The results summarizing the effect of the thresholds are shown in next section This stage can preserve some ambiguous words if not all possible genotypes were present in the training corpus 5 7 Translation between the large set of tags and the small set of tags Since we use an internal large tagset for most of the disambugation we can apply at some point a tagset reduction operator which would collapse the large tagset into a smaller set of tags The smaller set of tags is either the one predefined in the system or a tagset given by the user of the system 6 Analysis and evaluation of the method 6 1 Training and test corpora We have chosen the following as our corpora e Training 10 000 words from the ECI European Corpus initiative corpus e Test 1 000 words from randomly chosen sentences in the AFP Agence France Presse corpus These corpora have a significant number of typographical errors and misprints Typos can cause problems for two reasons e at the deterministic stage if they become anchors they can trigger incorrect removals of neighboring tags e at the statistical stage they can lead to incorrect v
16. ducer The tagger also utilizes a combination of statistical learning and linguistic knowledge and is built in a pipelined architecture All modules except for preprocessing and morphological analysis can be ordered in various ways Part of speech tagging consists of applying several disambiguation modules on a list of ambiguous words until a single tag remains for each word We propose and evaluate a sequencing strategy for the various modules and point out the best sequencing available Several experiments were performed to figure out the best order of the different modules Results showed that optimal results are obtained when morphological analysis is applied first followed in that order by the application of linguistic constraints the statistical stage and finally the mapping of the large tagset to a smaller one The system works on unrestricted text Pages of Text 1 Other Pages 15 Total 16 No Figs 0 No Tables 7 No Refs 0 Mailing Label tm sty 1988 Jun 10 AT amp T BELL LABORATORIES Initial Distribution Specifications 11222 950726 03TM page ii of ii Complete Copy Cover Sheet Only DH 1122 Arno Penzias MTS 11222 1122 MTS Kenneth W Church Cathy Cohen Eileen Fitzpatrick Julia Hirschberg Donald Hindle James Hieronymus Mark Jones Diane Lambert David Lewis Fernando Pereira Lawrence R Rabiner Thomas Restaino David Yarowsky Future AT amp T Distribution by ITDS Release to any AT amp T employee excludin
17. e E g if the corpus file is called MYCORPUS cor then the user has to set CORPUS MYCORPUS e type make MYCORPUS DA5T for the best tagging sequence Any other tagging sequence can be obtained by replacing DA5T in the previous command by the corresponding tagging sequence acronym Part of Speech Tagger for French a User s Manual 13 A 1 A 2 A 3 A 4 System library files NCONSS list of negative constraints NOSP list of compound words TAGS mapping between the large and small tagsets arclistd finite state transducer for morphological analysis MAINPROPERS arclist finite state transducer that contains many proper nouns Morphological analyzer dictionary finite state transducer driver Makefile makefile script that is used for tagging Filters which are part of the tagger itself mtapply puts together the tags resulting from applying I gram and bigram statistical decisions mtback translates the output of mtiter into the normal tag assignment format Example P NP NMS NFS becomes P NP followed on the next line by NMS NFS mtcompound this filter checks for compounds in the input and outputs them as a single token Example if de facon and que appear in the input the output will contain de_facon_que mtconcise this filter translates the verbose morphological features and parts of speech from the FST into concise tags from the tagset Example
18. e The rules are simple P and M are applied first also there must be an L stage before the A stages Example The tag scheme DAs DT means the composition T D As D TAo Part of Speech Tagger for French a User s Manual 5 e Negative constraints Negative constraints are examples of deterministic knowledge that express linguistic relation ships between the features of the words in a given n gram thereby performing some contextual disambiguation over strings of tags It seemed natural to use human expertise to partly dis ambiguate text through rules Of course one could argue that the machine would eventually learn it but generalities that capture gender and number agreement are straightforward to state They are available to the human without effort they are easy to implement Each of the linguistic constraints is applied several times over the anchors of the corpus This way anchors can create new anchors and thus enlarge the islands of disambiguated words Example BS3 BD1 3rd person subject personal pronoun Ist person indirect personal pronoun In the phrase il nous faut we need literally it is necessary to us the tags are BS3MS for il and BDIP BIIP BJIP BRIP BSIP for nous The negative constraint BS3 BD1 rules out BD1P and thus reduces the alternatives from 5 to 4 for the word nous N K noun interrogative pronoun In the phrase fleuve qui riv
19. e accents Namely an initial uppercase vowel will get an accent if it precedes a consonant in the following configuration if the word starts with the following pattern ECV where E is the character E C is one of the consonants b bl br c ch cl cr d dl dr f fl fr g gl gr h j j l m n p ph pl pr q r s sl sr t tl tr v vl vr z and V one of the vowels fa e i o u y the acute accent 1s recovered if the observed word is A or Etre the accent will be grave or circumflex respectively in order to produce a and tre e Acronyms similarly to the case for proper nouns the an initial guess that a certain word might be an acronym will be validated only if there are no other tags available from the morphology lookup e Compound words compound words or non compositional words in French are to be tagged as a separate entity and not as a sequence of two or three different words These are recognized as such by looking in a dictionary of lexical compounds at this stage and considered as a single lexical unit For example locutions such as a priori top secret or raz de mar e will be treated as unique lexical entries e Personal pronouns if two words are connected by a dash and the second word is a personal pronoun the two words are considered individually For example the compound dit elle said she is analyzed as two words dit and elle
20. er that qui can be tagged both as an E relative pronoun and a K interrogative pronoun the E will be chosen by the tagger since K cannot follow a noun N R V article verb for example Vappelle call him her The word appelle can only be a verb but 1 can be either an article or a personal pronoun Thus the rule will eliminate the article tag giving preference to the pronoun 4 2 Formulation of the tagging problem An initial tag assignment is given on which a tagging scheme is applied through processing operators Pi P P The goal is to obtain T A a new tag assignment with a maximal accuracy That is one wants to have a T A4 maz Since there are many possible tagging schemes one objective is to determine which one of them is the best This will be the optimal tagging scheme which will be kept for tagging 5 Implementation We have developed all the software tools necessary in preprocessing and tagging the text as well as additional utility programs Most of the tools are implemented in PERL and shell script Several software tools have been developed in PERL with a few shell scripts which execute the different operators described above as well as additional bookkeeping filters utilities and other programs These tools are described in detail in Appendix A The different tools are used to implement the processing operators mentioned in the previous
21. file are in the right column Linguistic knowledge about possible sequencing of parts of speech is very powerful since several types of restrictions can be expressed for words and tags in context For example an article cannot be followed by a verb in French as well as in many other languages Given that many words have unique tags restriction rules could use such words as anchors to disambiguate surrounding words Part of Speech Tagger for French a User s Manual Wow pos tag Meaning oke fa S qui devrait tre implante e a Eloyes Vosges repr sente un investissement d environ milliards de yens 148 milliards de francs S beginning of sentence definite feminine article feminine singular noun punctuation relative pronoun verb 3rd person singular present conditional auxiliary infinitive past participle feminine singular preposition proper noun punctuation proper noun punctuation verb 3st person singular present indicative indefinite masculine singular article masculine singular noun preposition preposition numeral punctuation numeral numeral preposition masculine plural noun punctuation numeral numeral preposition masculine plural noun punctuation end of sentence Table 2 Sample output of the tagger for the first sentence from the text in Table 1 Part of Speech Tagger for French a User s Manual 3 On the other hand statistical lea
22. g contract employees Author Signatures Dragomir R Radev Evelyne Tzoukermann William A Gale Organizational Approval Department Head Steve E Levinson For Use by Recipient of Cover Sheet Computing network users may order copies via the library k command Internal Technical Document Service for information type man library after logon Otherwise C1B 102A AL CB 30 2011 MV 3L 19 1 Enter PAN if AT amp T BL or SS if non AT amp T BL HO 2 Fold this sheet in half with this side out 3 Check the address of your local Internal Technical Document Service if listed otherwise use HO 4F 112 Use no envelope Please send a complete O microfiche LI paper copy of this 4 Indicate whether microfiche or paper copy is desired document to the address shown on the other side 10 B Contents Introduction 2 2l s s RR RR Res 1 Background 2 RR RR s s 1 Related Work 2 s sS RR RR s 3 Theoretical Principles 2l RR RR rs 3 4 1 Definitions oaoa 3 4 2 Formulation of the tagging problem 2e on 5 Implementation 4 222 2l sl se sS RR Rs 5 5 1 Text preprocessing 222 le sS RS RR s oss 5 5 2 Morphological processing oh 6 5 3 Tagset choice and hand tagging eR mo s os 7 5 4 Application of deterministic rules 2 RS s 7 5 5 Statistical learning of genotype resolutions ooa a a nos 8 5 6 Application of the genotype resolutions 2 2 oa 9 5
23. gging Scheme We have determined empirically that under the current model the best tagging scheme is M DAsT i e morphology deterministic statistical with a threshold of 5 tagset reduction as shown in Ta ble 4 In the following subsections we identify the factors that influence the accuracy of the tagging scheme 7 2 Analysis by sequence Table 4 demonstrates that at the end of the morphological stage 53 of the corpus has a single unique and correct tag 1 of the words is incorrectly tagged and 47 is still ambiguous The deterministic stage increases the percentage of correct tags by almost 7 while the statistical stage with the maximum coverage i e 5 provides almost 90 of correct tags Various tagging schemes have quite different performance as Table 4 shows tagging scheme Yocorrect Yo Incorrect Vambignons M 53 5 BEE oO iw AE M M M Har M M M Table 4 Results of the different tagging schemes The best scheme is the one that applies sequentially Morphology M Negative Constraints with 3 iterations D Statistical Decisions with maximal coverage A5 and Tag Reduction T Part of Speech Tagger for French a User s Manual 11 7 3 Analysis by threshold Table 5 reflects the differences in performance of the tagger when only the threshold of the statis tical operator varies A lower value of the threshold represents more possibly incorrect s
24. ing e mtshow allstages visualization utility e mtshow detstage visualization utility e mtshow disambig visualization utility e mtshow wrong visualization utility e mttop shows the most frequent words in a corpus B Choosing a Tagset The following list shows the tagsets that are used in the system The first column indicates the restricted set of tags and the second column indicates the extended set of tags Notice that the user can specify any subset of tags being contained in the large set In order to specify a different set map the new tag to the large one and write the change in the first column AT amp T Document Cover Sheet for Technical Memorandum a Title Part of Speech Tagger for French a User s Manual Authors Electronic Address Location Ext Company if other than AT amp T BL Dragomir R Radev s radev research att com MH 2D 468 4078 Evelyne Tzoukermann evelyne research att com MH 2D 448 2924 William A Gale gale research att com MH 2C 278 2520 Document No Filing Case No Work Project No 11222 950726 03TM 60011 311402 2228 11215 950727 08TM 20878 311401 1503 Keywords Text to Speech Synthesis French Text Analysis Part of Speech Tagging Computational Morphology MERCURY Announcement Bulletin Sections CMM Communications CMP Computing CFS Life Sciences Abstract The purpose of this work is to produce a part of speech tagger for French using morphological analysis provided by a finite state trans
25. l pronoun indirect 2nd person plural personal pronoun indirect 2nd person singular personal pronoun indirect 2nd person plural personal pronoun indirect 2nd person singular personal pronoun disjoint feminine 3rd person plural personal pronoun disjoint feminine 3rd person singular personal pronoun disjoint masculine 3rd person plural personal pronoun disjoint masculine 3rd person singular personal pronoun disjoint 1st person plural personal pronoun disjoint 1st person singular personal pronoun disjoint 2nd person plural personal pronoun disjoint 2nd person singular personal pronoun disjoint 2nd person plural personal pronoun disjoint 2nd person singular personal pronoun reflechi feminine 3rd person plural personal pronoun reflechi feminine 3rd person singular personal pronoun reflechi masculine 3rd person plural personal pronoun reflechi masculine 3rd person singular personal pronoun reflechi 1st person plural personal pronoun reflechi 1st person singular personal pronoun reflechi 2nd person plural personal pronoun reflechi 2nd person singular personal pronoun reflechi 3rd person plural personal pronoun reflechi 3rd person singular personal pronoun subject feminine 3rd person plural personal pronoun subject feminine 3rd person singular personal pronoun subject masculine 3rd person plural personal pronoun subject masculine 3rd person singular personal pronoun subject 1st person plural personal pronoun subject 1st person singular perso
26. n feminine singular noun feminine noun invariable in number masculine noun masculine plural noun masculine singular noun masculine noun invariable in number singular noun invariable in gender plural noun invariable in gender invariable noun onomat preposition present participle present participle masculine singular past participle past participle feminine plural past participle feminine singular SHORT SET LARGE SET MEANING OF THE TAG cere He Fe AH A vip vip vip vip vip vip vip vip v2p past participle masculine plural past participle masculine singular article definite article definite feminine article definite masculine article definite masculine plural article definite masculine singular article definite partitive article indefinite article indefinite feminine singular article indefinite feminine plural article indefinite masculine plural article indefinite masculine singular article partitive article particle nominal proper noun verb Ist person plural present indicative verb Ist person plural present imperative verb Ist person plural present conditional verb Ist person plural present subjunctive verb Ist person plural future indicative verb Ist person plural imperfect indicative verb Ist person plural simple past indicative verb Ist person plural imperfect subjunctive verb 2nd person plural present indicative verb 2nd person plural present conditional verb 2nd pe
27. nal pronoun subject 2nd person plural personal pronoun subject 2nd person singular coordinating conjunction subordinating conjunction indefinite pronoun indefinite pronoun feminine singular indefinite pronoun feminine plural indefinite pronoun masculine singular SHORT SET LARGE SET MEANING OF THE TAG indefinite pronoun masculine singular indefinite pronoun plural relative pronoun relative pronoun feminine relative pronoun feminine plural relative pronoun masculine relative pronoun masculine plural possessive pronoun feminine singular possessive pronoun feminine plural possessive pronoun masculine plural possessive pronoun masculine singular possessive pronoun plural possessive pronoun singular interjection feminine plural adjective feminine singular adjective masculine plural adjective masculine singular adjective masculine adjective invariable in number plural adjective invariable in gender singular adjective invariable in gender invariable adjective plural adjective singular adjective interrogative pronoun interrogative pronoun feminine interrogative pronoun feminine plural interrogative pronoun masculine interrogative pronoun masculine plural pronoun pronoun 3rd person singular pronoun feminine plural pronoun feminine singular pronoun masculine plural pronoun masculine singular pronoun plural invariable in gender pronoun singular invariable in gender feminine noun feminine plural nou
28. nland 1990 International Conference on Computational Linguistics Atro Voutilainen Nptool a detector of english noun phrases Columbus Ohio 1993 Proceed ings of the Workshop on very large corpora
29. noun masc plur becomes NMP mthsuniq removes duplicate tags mtiter applies the negative constraints on a tag assignment mtlearn statistically computes the best statistical decisions from the training corpus mtnop removes the proper noun and acronym tags if others tags are present for the same word mtnosgml removes SGML tags from the input corpus p mtpn handles pronouns in constructions such as dit elle mtprintl print all tags for a given word on the same line mtrestore recovers tags that have been ruled out at some stage mtsplit splits the corpus into a list of the words in in mtstat applies the statistical decisions mttest2 computes the accuracy of the tagging when given the correct tagging mttrans translates the large tagset into the small tagset Part of Speech Tagger for French a User s Manual 14 A 5 Other tools The following tools are used mostly for debugging e mtasc changes 7 bit French text to 8 bit text e mtbatch batch mode utility e m count counts the ambiguities in a given tag assignment e mteval batch mode utility e mtex batch mode utility e mthuniq same as mthuniq but assumes that the tags on each line of the input are sorted e mtlearn2 same as mtlearn but it also uses genotype bigrams e mtlc converts the input into lowercase e mtnop s same as mtnop but works on the small tagset e mtrun batch mode utility e mtselect utility for manual tagg
30. rning is used as follows given a manually tagged training corpus the most frequent tags from each combination of tags can be easily learned When the statistical knowledge is applied the best decisions based on the disambiguated data are made We look at the morphological analysis the deterministic stage and the statistical stage as oper ators which modify the current tag assignment of the corpus and produce a new and more accurate tag assignment There are additional modules such as preprocessing and morphological stages that are applied in a fixed order The whole process of tagging can be looked at as the composition of these processing operators Since the operators are compositional they can be applied in any order we can theoretically order them in many different ways We want to find out what sequence of operators leads to an overall improvement of the tagging accuracy 3 Related Work There are a number of taggers and tagging methods available for the last decades works in part of speech tagging have generally followed either a rule based approach IB 4 5 or a statistical one HJ ho 1 e ISI Statistical approaches often use Hidden Markov Models for estimating lexical and contextual probabilities while rule based systems capture linguistic generalities to express contextual rules Most of these works have benefited from large tagged corpora making feasible the training and testing procedures However no publicly av
31. rphological module Example the word mise has the following genotype JFS NFS QSFS VISPI VISPS V2SPM V3SPI V3SPS e A statistical decision consists of a genotype its most likely predominant resolution in the training corpus and the likelihood of that resolution Example if PN P are possible tags then P is selected in 96 85 of the cases 768 out of 793 e Processing operators are essentially functions that take a tag assignment as argument and produce another tag assignment Operators are explained in more detail in the next section Example If P is a processing operator and TA a tag assignment then P T A TAs which means that T A is the resulting tag assignment pP L BD3S RDF RDM i L RDF usine NFS VISPI V1SPS V2SPM V3SPI V3SPS usine NFS e A tagging scheme is a composition of n processing operators which when applied on the initial tag assignment T Ap returns another tag assignment T A4 In order to keep our notation consistent we shall use the concatenation of the symbols representing the operators in composition to refer to a given tagging scheme For example we shall use DAT to express that 3 operators deterministic D application of n gram statistical decisions A and tagset reduction T have been applied to the initial TA For simplicity the P M and L stages preprocessing morphology and learning see next section will be omitted when referring to a particular tagging schem
32. rson plural present subjunctive verb 2nd person plural future indicative verb 2nd person plural imperfect indicative verb 2nd person plural simple past indicative verb 2nd person plural imperfect subjunctive verb 3rd person plural present indicative verb 3rd person plural present conditional verb 3rd person plural present subjunctive verb 3rd person plural future indicative verb 3rd person plural imperfect indicative verb 3rd person plural simple past indicative verb 3rd person plural imperfect subjunctive verb Ist person singular present indicative verb Ist person singular present imperative verb Ist person singular present conditional verb Ist person singular present subjunctive verb Ist person singular future indicative verb Ist person singular imperfect indicative verb Ist person singular simple past indicative verb Ist person singular imperfect subjunctive verb 2nd person singular present indicative verb 2nd person singular present imperative verb 2nd person singular present conditional verb 2nd person singular present subjunctive verb 2nd person singular future indicative verb 2nd person singular imperfect indicative verb 2nd person singular simple past indicative E SORE SET LARGE SET MEANING OF THE TAG verb 2nd person singular imperfect subjunctive verb 3st person singular present indicative verb 3rd person singular present conditional verb 3rd person singular present subjunctive verb 3rd person singular future indicative
33. section 5 1 Text preprocessing A raw corpus of text is the input to the preprocessor Several filters need to be applied in order to normalize the text The following steps are applied e Sentence boundaries places where sentences begin and end are identified and replaced by appropriate SGML tags Punctuation symbols are also assigned special tags Part of Speech Tagger for French a User s Manual 6 e Proper nouns the morphological dictionary contains common nouns and proper nouns but the productivity of proper nouns is very high Therefore each word starting a sentence needs to be identified and recognized as either a common or a proper noun These words undergo special treatment each word starting a sentence will be given the PROPER noun tag after morphological analysis if the word inherits a new analysis the latter one will prevail if not the word is identified as PROPER noun and is dynamically added to the PROPER NAMES dictionary see Section 5 2 If an initial uppercase word is found in the middle of a sentence it will inherit immediately the PROPER noun tag An additional difficulty due to the ac cents appears In continental French accented characters lose their accents if they become capitalized This is valid in both sentence initial position and in the middle of the sentence Therefore many words in the text will be missing their accents A phonology based recovery technique is applied in order to attempt to recover thes
34. singular present conditional auxiliary 3rd person singular present subjunctive auxiliary 3rd person singular future indicative auxiliary 3rd person singular imperfect indicative auxiliary 3rd person singular simple past indicative auxiliary 3rd person singular imperfect subjunctive auxiliary infinitive auxiliary present participle auxiliary present participle masculine singular auxiliary past participle auxiliary past participle feminine plural auxiliary past participle feminine singular auxiliary past participle masculine plural auxiliary past participle masculine singular adverb SHORT SET anar SET MEANING OF THE TAG indefinite personal pronoun personal pronoun direct feminine 3rd person plural personal pronoun direct feminine 3rd person singular personal pronoun direct masculine 3rd person plural personal pronoun direct masculine 3rd person singular personal pronoun direct 1st person plural personal pronoun direct 1st person singular personal pronoun direct 2nd person plural personal pronoun direct 2nd person singular personal pronoun direct 3rd person plural personal pronoun direct 3rd person singular personal pronoun indirect feminine 3rd person plural personal pronoun indirect feminine 3rd person singular personal pronoun indirect masculine 3rd person plural personal pronoun indirect masculine 3rd person singular personal pronoun indirect 1st person plural personal pronoun indirect 1st person singular persona
35. tatistical decisions a higher value fewer but more reliable decisions tagging scheme era No incorrect ambiguous onar sa 4 12 kampas sss s3 ro nDAeT 1 79 39 nDasT ss 23 Bs kanan 768 ie ar DDAwT 35 19 7 able 5 Analysis of statistical decisions ponpaer osa n AT 7 4 Analysis by tagset Table 6 presents the different tagging schemes with reduction to the small set of tags at different levels Because of the large discrepancy in number between the large tagset 253 and the small one 67 we hypothesized that the there might be a significant difference at each time the tagset was reduced The numbers in Table 6 do not verify this hypothesis and in fact show that the difference in performance is small when using different versions of the tagset tagging scheme correct incorrect ambiguous M 53 5 1 0 45 7 nr Jos nro ms M As 89 1 9 7 1 2 Lunar wa s u M M M D Ago 73 3 1 6 25 3 ber ws ts ar M A D 88 7 10 3 1 0 ansor ws ss o Table 6 Comparison between the two tagsets 8 Applications There are several ways one can think of using a part of speech tagger e text to speech synthesis several levels of the text to speech at the grapheme to phoneme level knowing the part of speech of a word can determine its pronunciation for example in the French sentence presented in Table 7 the words
36. unctive mise miser verb 2nd person singular present imperative mise miser verb 3rd person singular present indicative mise miser verb 3rd person singular present subjunctive The goal of a part of speech tagger 1s to reduce the number of part of speech ambiguities this is achieved by using a combination of linguistic knowledge and statistical rules that progressively reduce the number of possible tags for a given word A tag contains information about the part of speech as well as about certain grammatical categories such as tense mood number and gender The input to the system is a French text with 8 bit encoded accents Table 1 shows an example of text data L usine qui devrait tre implant e Eloyes Vosges repr sente un investissement d environ 3 7 milliards de yens 148 milliards de francs Elle fabriquera dans un premier temps le produit liquide qui entre dans le processus des photocopies ainsi que des pi ces d tach es pour la filiale de Minolta en RFA Table 1 Corpus Sample of newswire compiled by the French embassy in Washington D C The goal 1s to obtain an output text where a single part of speech is associated with each word Table 2 shows the output of the first sentence of the text in Table 1 disambiguated at a word level In the left column are the words corresponding to the French corpus the part of speech tags corresponding to the words tag
37. word has all uppercase characters the word is a possible acronym and is given the appropriate tag Later after applying the deterministic operator it is possible that a given tag other than proper noun and acronym Jis ruled out due to negative constraints Then the proper noun or acronym tag will remain 5 5 Statistical learning of genotype resolutions At this stage we try to identify linguistic phenomena according to which a certain genotype has a predominant gene tag It turns out that most of the genotypes have predominant genes Thus it is possible to resolve some ambiguities using the genotype decision for the genotype of the word by looking up at table of the most likely tags for certain genotypes Such a table can be compiled from the training corpus A measure of confidence has been used to apply decisions under a certain threshold Table 3 shows the decisions made upon the application of the threshold genotype deseos Treg fm eee NP P MS NMS NXP RIMS W 107 109 P MS pM CMs 3950 965 XP W MP V2SPT V2SPS NMP Table 3 Best decisions that can be made according to unigram distributions d ou uz cd Z S E m D We use a strength score for each statistical rule based on the frequency f of the decision among n observations of the tag genotype For instance Table 2 gives f 195 and n 199 for the decision RDM from the tag genotype BD3S RDM

1 Introduction 2 Background

Contents

Download Pdf Manuals

Related Search

Related Contents