Home

Text categorization using lexical chains

1. basic 1 practice 1 student 1 grade 1 rank 1 series 1 development 1 spiritual 1 founder 1 principal 3 general 1 South Korean 1 defense 1 form 1 art 4 art 41 b1ow 1 counter 1 punch 1 kick 1 jump 1 standing 1 combat 1 kicking 11 In 1 lower 1 proficiency 2 high 71 Korean 2 Korean 2 Korean 2 Title China warns Taiwan about making two states theory legal Running time 55 minutes on Intel Pentium II democracy 1 mainland 1 society 1 formula 2 Taiwan 1 capitalist 1 run 1 Beijing 1 system 1 country 1 formula 2 China 1 part 1 Taiwan 1 state 1 back 1 Taiwan 1 policy 1 48 official 1 leadership 1 senior 1 member 1 people 1 r People 1 Taiwanese 1 safety 2 property 1 1life 1 fire 1 action 1 people 1 Iprospect 1 Taiwan 1 constitution 1 state 1 force 1 Taiwan 1 China 1 Taiwan 1 split 1 feature 1 state 1 people 1 state 1 treat 1 lIside 1 I President 1 Taiwanese 1 China 1 voter 1 party 1 moderate 1 voter 11 separatist 1 candidate 1 threat 1 China 1 Taiwan 1 people 1 threat 1 premier 2 Taiwan 1 China 1 threat 1 1eader 1 Chinese 1 top 21 Communist Party 2 Ipeople 1 fire 1
2. pot 1 plant 1 stem 1 change 1 leve1 1 plant 1 banana 2 plant 1 soil 1 peat 1 mixture 1 soi1 1 soi1 1 plant 1 Plant 1 root 2 plant 1 Danana 2 plant 111 flowering 2 present 1 long 1 month 2 month 2 temperature 211 winter 1 week 3 daily 1 1 E summer 11 heavy 2 heavy 21 E normal 1 L yellow 11 turn 121 resumption 1 1 1ateral 1 times 21 soak 1 planting 1 shipping 1 planting 1 damage 11 season 3 days 1 coloration 1 growing 1 prior 1 fungicide 1 spray 6 dip 6 50
3. Swedish Institute of Computer Science SICStus Prolog User s Manual Septem ber 1998 Juha Takkinen An adaptive approach to text categorization and understanding a preliminary study Presented at the Fifth IDA Graduate Conference on Computer and Information Science Link ping November 1995 Ellen M Voorhees Using wordnet for text retrieval In Fellbaum Fel98 chap ter 12 pages 285 303 Yiming Yang An evaluation of statistical approaches to text categorization 1999 25 Appendix A Source code File main pl Main file for the lexical chainer and linker system yA By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 use module library objects consult util consult wordnet consult morph consult article consult sentence consult chain consult lexchain consult lexlink test lexlink new example example print chains 26 File lexlink pl Part of the lexical chainer and linker system yA By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 lexlink 4 super lexchain amp dynamic chain_rel 2 use module library ordsets list to ord set 2 ord_member 2 ord del element 3 ord union 2 amp new Instance super lt new Instance amp write Linking nl Instance get chain_ids X Instance chain_synsets X Cha
4. for categorizing data This approach could be made far more dynamic than that of the rule induction In this report I will focus on a browsable hypertext system similar to those present on Internet portals like the WWW virtual library Here the presented links are selected and categorized by humans Therefore it will be necessary to employ more people to find select and categorize the information as the Internet grows If part of this process could be automated enormous amounts of resources could be saved It might be difficult to make a system that ensures high quality in the selected documents but to categorize documents automatically seems more reasonable because a complete knowledge of the documents is not required Instead a rather coarse grained view of the topics present in the documents will do http vlib org The categories on a typical Internet portal are stored in a hierarchy or network struc ture of topic labels Today documents are manually placed in these categories and the whole structure of category labels is also extended manually The problem is therefore not only to assign a predefined label to a given document but also to place a document into a hierarchy or network of topic labels and to extend this structure whenever needed For this purpose the semantic network approach mentioned above can provide the nec essary background information The fact that such background information bases are public available
5. S3b Sx S2b Delete S2b S3b lt retract synset rel S2b S3b Try to find other relations 34 lt prune synrel W2 W3 lt prune_synrel W1 W2 prune_synrel _ _ amp not X X fail true After prune synrel prune synsets removes synsets with no relations to other synsets yo Sn prune synsets amp prune synsetsC W R1 C1 W R2 1C2 lt prune synsetsi R1 R2 lt prune synsets C1 C2 amp prune synsetsi amp prune synsetsi X R1 XIR2 lt syn rel X lt prune_synsets1 R1 R2 amp prune synsetsi X R1 XIR2 lt syn rel X lt prune_synsets1 R1 R2 amp prune synsetsi X R1 R2 lt prune_synsets1 R1 R2 amp Given two words with corresponding synsets succedes if there is a relation between the two words and return a list of synset relations of the form rel Weight Synset idi Synset id2 The words are equal relation extrastrong W W S Rel lt relation list extrastrong S S Rel amp The words share one or several synsets relation strong RecencyNo SentenceNo 91 92 Rel Temp is RecencyNo 7 Temp gt SentenceNo ord_intersection S1 82 83 93 1 35 lt relation_list strong S3 S3 Rel amp 4 Horizontal link exists between the two relation strong RecencyNo SentenceNo 91 92 Rel Temp is RecencyNo 7 T
6. Therefore the running time of the strong and medium analysis will probably be somewhere near to that of the extra strong search It is quite difficult to give an average time analysis since it depends on which of the WordNet relations is used for constructing the medium relation search 18 Name Direction Antonymy Horizontal Attribute Horizontal Cause Down Entailment Down Group Horizontal Holonymy Down Hypernymy Up Hyponymy Down Meronymy Up Participle Horizontal Pertain Horizontal See also Horizontal Synonymy Horizontal Table 3 1 Directions assigned to the WordNet relation as it is done by St Onge 3 4 Lexical chain linker The linking of the identified chains is done in the lexlink class The process is very similar to what is done in the chain class only the relation types allowed is different Each synset of a chain is matched against the synsets of the other chains using the relation method relation is only allowed to use upward or horizontal links to find a path between the two synsets The path must be no longer than ten relations It is important to note that the object identity of both the lexchain and the chain classes are of crucial importance for its success Both classes use value assignment by specialized version of the get set predicates Another solution could be to pass the object values as parameters to the objects in a frame like manner This would keep the declarative semantics intact but instead it m
7. chain pl Part of the lexical chainer and linker system By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 This class holds the words of a chain along with the relations between them chain 1 super wordnet amp dynamic word rel 3 amp dynamic syn rel 3 use module library lists member 2 non member 2 amp use module library ordsets list to ord set 2 ord intersection 3 amp attributes chain Chain containing word id s recency 0 amp Holds the sentence number of the most recent added word in the chain new Instance Word super lt instance Instance synsets Word Synsets Instance set chain Word Synsets amp Adds the word Word to the chain and returns the relation type for the added word Nope A cec EE SES d ne E me add Word SentenceNo RelationType lt synsets Word Synsets get chain C1 get recency RecencyNo member w2 52 C1 lt relation RelationType RecencyNo SentenceNo Word Synsets w2 521 Re1 32 Update chain recency chain recency RecencyNo SentenceNo Update chain set chain Word Synsets C1 assert word rel RelationType Word W2 assert synrel Rel prune synrel Word W2 prune synsets Word Synsets C1 C2 set chain C2 amp chain recency RecencyNo SentenceNo SentenceNo gt RecencyNo set recency Sen
8. makes it possible to rapidly develop systems that can be used for exper imentation on real life data Furthermore the learning phase is not limited to symbol manipulation of known data as is the case of the two other examples mentioned above in the first case pre categorized texts in the user model recorded user actions In the following chapters I will proceed by introducing background information on the lexicographic database WordNet along with an overview of research in text retrieval problems This is followed by a description of the theory of lexical chaining which is used to disambiguate meanings of words as represented in WordNet Finally a method for categorizing texts is presented along the implementation of the prototype system and experiments showing initial results The main goal is to examine if a good method for text categorization using lexical chaining can be found and to test the usability of the lexical chaining technigue in practice WordNet can be downloaded from http www cogsci princeton edu wn The system will soon be made available for download from http www diku dk students haste textcat Chapter 1 Background The system described in this project heavily relies on a lexicographic database called WordNet Fel98 Other databases are available which contains information about lan guage and semantics but WordNet is unigue in that it covers a large part of the target language English and has well de
9. End ground Base ground End name Base BaseList name End EndList append BaseList EndList WordList name Word WordList amp beforedelimeter WordIn Word0ut ground WordIn name WordIn WordInList append WordOutList WordInList name WordOut WordQutList 41 File wordnet pl WordNet access for the lexical chainer and linker system yA By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 use module library db use module library objects databases s db s sim db sim ant db ant stop db stopl exc db exc mm_db mm mp_db mp hyp db hypl sa db sa at db at per_db per cs_db cs lent db ent g db gl ppl db pp111 open database databases DB open database DB open database open database DB Name Rest db open DB read Ref assert db ref Name Ref open database Rest 7 open database wordnet is a static object containing methods for accessing the WordNet database in a convenient way y A RE SE E E ee eee wordnet 4 super object amp use module library lists non member 2 amp use module library ordsets list to ord set 2 Y DataBase Consult Consults the given term in the associated external database 4 dbc Term functor Term Head db ref Head R 42 db fetc
10. Open Directory Project and then search for links be tween the categories The lexical chaining implementation from this system could easily be adapted for this use In this way it may be possible to discover properties of the rela tional structure in the network This information could then be used to insert new nodes automatically However due to time limitations I have chosen to do another test By implementing the lexical linker rom figure 2 1 it will be possible to examine if links between the chains can be found However since the lexical chainer already have found relations between the words of the individual chains the relation paths between chains are probably exhibiting a diflerent structure In the prototype system I will allow longer path and path only containing upward or horizontal directions In this way I hope to find a path which have some resemblance to a category hierarchy See http www yahoo com and http dmoz org respectively 14 Chains with relations Category network ESTI E IN S A o NS c6 3 M Ld CT Figure 2 2 The categorization system surrounding the lexical chaining and linking system is searching for a path in the category hierarchy similar to the one found between the chains An alternative could be to find relations between chains and category labels and hereafter search for a path through the identified labels 15 Chapter 3 Implementation The protot
11. a term found in wordnet and the rest of the words The term is made by first concatenating the first three words of the list the the two first and finally the first If none is found the first word is ignored get next word List Word Rest lt compound word List WordTemp Rest lt checkword WordTemp Word amp get next word List Word Rest lt compound word List WordTempi Rest lower case WordTempi WordTemp2 lt checkword WordTemp2 Word amp get next word List Word Rest lt get next word List Word Rest amp cic ecu ec EM A pt e era Concatenation of terms by the way described above SB dero uec ct c AA etes tsz eze compound word X Y Z Rest Word Rest atom concat X T1 atom concat T1 Y T2 atom concat T2 T3 atom concat T3 Z Word amp 39 compound word X Y Rest Word Rest atom concat X T atom concat T Y Word amp compound word Word Rest Word Rest amp checkword takes as input a noun and transforms it into the baseform If the predicate fails it is either because the input word is not a noun or it was not identified as such 4 Check the word as it is checkword Word Word lt validate Word amp Try to look it up in the exception list of irregular forms checkword WordIn WordOut wordnet dbc exc noun WordIn Word0ut lt validate WordOut amp Try morphing of word re
12. bearing oranges 110758147 Any pigment producing the orange color Table 1 1 Example of the synsets representing the different senses of the word Orange 1 1 WordNet In WordNet every word is represented by a set of synonyms called synsets see table 1 1 Each synonym represents a meaning of the word Each synset have a unique id most of them have a gloss phrase assigned to them which is some informal English text describing the meaning of the synonym But apart from this phrase the synonyms are defined in terms of links to other synsets This means that the information in WordNet is build around sets of synonyms as they are the primary keys to represent the semantics of words Figure 1 1 shows a search on the word orange in WordNet In the left window we see that the word both can be used as an adjective and as a noun Furthermore it shows different types of links to the word E g following is a kind of orange hyponyms will show the synsets which are specialization s of orange The right window shows the four different senses found of the word and displays the gloss phrase associated with each A number of different relations exist between synsets in WordNet They are listed in table 1 2 along with their type A semantic relation means that the relation holds between meanings of words more specifically synsets and a lexical relation means that its between word forms not meanings of words The most importa
13. chain_extract_synsets S1 82 list_to_ord_set S2 83 chain_synsets R1 R2 chain synsets R1 R2 chain synsets R1 R2 28 File lezchain pl Part of the lexical chainer and linker system By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 The lexchain class implements the chaining of an article by reading each sentence and construct lexical chains lexchain super object amp use module library lists delete 3 amp attributes chain ids ID s of all chains sentences 1 1 amp ID s of all sentence objects new Instance super lt instance Instance Instance chainer amp aos o sols des oe duce cod eccL oe Print out the values of the identified chains 4 print chains get chain ids IDs lt print chains IDs Y print chains Y print chains X Rest X get chain Y lt chain words Y 2 write Z nl lt print chains Rest 4 Given a chain list with each element being Word Synsets returns a list of Word No of synsets elements chain words chain words C W S R11 W N IR2 29 length S N lt chain_words R1 R2 amp Given a list of Word Synsets pairs returns a list of Synsets elements Geta ua a s TB rt LUE T EISE chain extract synsets l amp chain extract syns
14. the lexical linker it is allowed to construct chains of length 10 However if it is not at all possible to construct chains of that length using the selected relations it is likely that no relations is found This is because relations of length 5 have already been sought when constructing the chains This problem could maybe be solved by using other relations than in the chaining process An examination of which ones should be used along with which constraints could be a possible continuation of this project 21 Chapter 4 Discussion Compared to the traditional ways of analyzing natural language the use of lexical chains can be described as a hackers approach Instead of trying to analyze a whole sentence by some predefined rules everything which immediately can be assigned a meaning is used and everything else is thrown away The structure of sentences is not considered and only nouns are used However as the testing of the prototype system shows the model is in fact guite robust identifying words of similar meaning In text categorization the only thing of interest is to assign one maybe two labels to the text thereby placing it in a network of category labels For this purpose it is not needed to understand every single sentence of the text to be analyzed Instead the rather coarse grained view of the lexical chains might be adeguate An interesting property is the chaining algorithm s ability of specialization when an alyzing
15. words of the chains have good relations to each other However the most crucial parameter for its success I have found to be which re lations is used In the tests presented here I have restricted the analysis to the use of meronymy holonymy hypernymy hyponymy and antonyms When trying to include some of the other upward downward relations the search time was dramatically increased without any better results Also when trying to relax the constraints of the use of hor izontal relations it was found that only one or two chains was found containing words without any close semantic relations These experiments indicate that the restrictions on the medium strength relation search is of crucial importance for the algorithms success The results of the experiments shows that a number of lexical chains is found for each article In all three tests one chain grows very big covers about 30 of the article However the words in the chains are all naturally related somehow It is of course not an objective measure weather my personal opinion is that they are naturally related or not A better measure could be to make a large number of English speaking persons group the 20 words of the documents By comparing the results with the lexical chains found by the developed system a better measure of its success may be found Applying the lexical linker to the chains found for each article showed that no relations between the chains could be found In
16. Llisland 1 Taiwanese 1 China 1 Taiwan 1 embrace 3 wait 2 1 unification 2 Macau 1 Hong_Kong 1 autonomy 2 theory 3 theory 3 theory 3 editorial 1 editorial i1 daily 1 daily 1 daily 1 daily 1 newspaper 1 daily 1 editorial 1 separate 1 Sunday 11 March 1 days 1 Sunday 11 playing 1 election 2 warning 1 warning 1 playing 11 reunification 1 basis 1 law 1 sentiment 1 law 1 claim 1 independence 1 independence 1 tension 2 tension 2 independence 1 scare 1 independence 1 status 1 put 11 attempt 21 Lee 1 Lee 1 Lee 11 p1ot 41 true 11 equa1 1 equa1 11 July 111 fear 2 stance 1 Chen 1 vice 2 war 21 ruling 1 wage 1 1eaders 1 Title Bananas Cultural directions 49 result 1 leave 2 leave 2 leave 2 receipt 1 blena 3 b1end 3 blend 3 foliage 2 fertilizer 1 liquid 1 green 1 ground 1 air 4 growth 1 plant 1 banana 2 water 2 production 1 fruit 1 growth 1 fa11 1 In 1 drinker 2 feeder 1 plant 1 banana 2 soil 1 plant 1 banana 2 sun 1 fruit 1 produce 1 banana 2 plant 1 banana 2 rapid 1 growth 1 plant 1 banana 2 shoot 1 Iproduce 1 Lcorm 1 dry 1 water plant 1 ground 1
17. R1 R2 mm R2 R1 amp link ms R1 R2 ms R1 R2 amp link hs R1 R2 ms R2 R1 amp link mp R1 R2 mp R1 R2 amp link hp R1 R2 mp R2 R1 44 File sentence pl Part of the lexical chainer and linker system By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 Class holding the words of a sentence When the object is created amp sentence number is given as paramter The words of the sentence is morphologically trasformed and verified to be represented in the nuon index of WordNet A11 the words satisfiying these conditions are stored in this object A Er er ELEME LO E ee I E sentence super article amp attributes number 0 amp Sentence number get words Number Words get number X Y is X41 lt sentence Y SentenceWords lt normalize words SentencelWords Words set number Y Number Y set number Y get words Number Words amp normalize words normalize words Words VerifiedWord Rest2 morph get next word Words VerifiedWord Resti normalize_words Rest1 Rest2 45 File util pl By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 This predicate is included in Sicstus 3 8 but to run the program under 3 7 it is needed atom concat A B AB ground A ground B name A Alist name B Blist append Alist Blist ABlist name AB ABlis
18. Text categorization using lexical chains Tue Haste Andersen Supervisor L szl B la Kov cs Department of Computer Science Copenhagen University February 28 2000 Abstract In this report I present a prototype system for use in dynamic text categorization research The system implements lexical chaining as described in recent literature On top of this is built a simple extension to use for automatically identifying one or several categories to place a given text in The initial tests presented in this report does not give any useful results however it give rise to new guestions and possible directions for future research of lexical chaining and its uses in text categorization Along with the implementation previous research and the lexicographic database WordNet are discussed Keywords Text categorization dynamic lexical chaining WordNet text retrieval Contents Introduction 3 1 Background 5 T3 WordNet raton ik fo A rer RR Exe Ode Be A 6 1 2 Previous researches amp xo nb ee ad e A AE ae YA 8 1 2 1 Text Retrieval wa ar ARE x RR a AA 8 1 2 2 Text Categorization 10 2 Categorization 11 241 Eexical CASI 04 dc ad A e esee alo cl tg 11 2 2 Categorization rs hoe p y ate RR AR E aide AS 14 3 Implementation 16 3 1 Accessing the WordNet database 16 3 2 Morphologicanalysis 17 2 9 obexicalchainer zz egg Wee es cec dtr a Ce ed
19. a text A text describing animals humans and plants may be able to place all these references in the same chain whereas a document describing humans in one context and animals in another may be able to distinguish these things by placing the words in different chains Although not tested here this should be possible because of the dynamic word sense resolution In other words this means that a system based on this resolution mechanism can be applied to many different areas without further adaption because it is able to set a level of detail The system show robustness in the guality of output but another aspect is the running time As discussed in the tests the result is dependent on the number of relations used to find medium strength relations However I believe the running time can be reduced by better knowledge on which combinations is likely to give a valid relation and also the inclusion of verbs in the analysis To give a better understanding of which paths are worth searching for a statistical analysis could be performed The lexical linking process is obvious not a success in its present implementation However I believe that experiments on finding relations in an existing category network 22 will give more insight to what constraints should be used when searching for relations between chains But a better lexical linker is not the only thing that is needed Also a matching against the existing network is not a trivial proces
20. ambiguating hoods automatically the vector space model can be applied on the hood categories instead of the word symbols Unfortunately the results are at best no better than the symbol based approach For this reason I have chosen to base the developed system on another technigue for automatic word sense resolution namely lexical chains 1 2 2 Text Categorization In the previous section a system using hoods derived rom WordNet as categories was discussed These category labels was used for matching gueries against texts to find documents related to the guery The prototype system I am developing is to be used for a category browsing system which is guite different from the use of the hoods Finding a hood of a word or synset does not indicate anything about the structure or the context the word is used in An attempt to use WordNet and lexical chaining for categorization is Al Halimi and Kazman s lexical trees AHK98 These trees are build in a manner like lexical chains but instead of finding several chains only one is build preserving a tree structure between the words Another algorithm is QUESCOT by Stairmand Sta97 also based on lexical chains The chains are identified using Morris and Hirst s algorithm but a new additional concept is introduced namely lexical clusters which should correspond to the context at fixed points in the analyzed documents Here they define context as context can be specified by a word set cons
21. dsa juejd paas e jo Apog e4nonpoidai ajgipa 21 ynu ajgipa qs E Wy suyo gt gt snno gt suaga wem Ul umos dnd fanl pue puu ya Busey sn 15 snuab ayy jo syny shojawnu JO due 5 wbid Qu E m no qu E oya qu A 7 Screen shot from WordNet TreeWalk showing a search on Orange WordNet Tree Walk is available for download from http 1 oy Jews i abuelo suuuoque oj pasoddo s abuejo LA 1 D C 1 suuftuo1au yed abuejo jo yed e si su uoJaw asuejsqns abuejo jo a2uejsqns e sl swfuojoy yed jo yed e si abuejo swAuojoy Jaquaw jo Jaquiaw e si abuejo suufuod u abuejo jo pur e si sus Luad y 7 jo pury e si abuejo nina qs E E SO oo n Figure 1 1 www ac toulouse fr wordnet Name Word rs Attribute Semantic Adjectives nouns Synonymy i Semantic Adjectives Antonymy Lexical Hyponymy Hypernymy Semantic Nouns and verbs Meronymy Holonymy Semantic Nouns Entailment Semantic Verbs See also Lexical Adjectives and verbs Cause Semantic Verbs Participle Lexical Verbs adjectives Pertain Lexical Adjectives adverbs adjectives nouns Group Semantic Verbs Table 1 2 Relations in WordNet The column Abbreviation show the relation names as they are called in the Prolog distribution of the WordNet database files Type shows the type of relation and Word Forms tells possible constraints on which word forms the relation hold
22. e a cenis 18 3 4 Lexical chain linker 22er 19 3 9 Front endi i i 3 95 A ARX M RR WA URN BE CHR E 19 3 05 Testing oues oda e a A gl ee ee eet p Ra EUR epe d 20 4 Discussion 22 Bibliography 24 A Source code 26 B Test results 48 Introduction Categorization is of importance when large amount of information needs to be stored in some way for later retrieval Examples of uses are categorization of books and articles web pages email messages and message routing Today it is of growing importance to find good ways to do this automatically as the amount of information available electronically is growing fast Automatic categorization of text can be done in many ways depending on the use of the categorized texts Below is listed three approaches grouped by what kind of information is available prior to categorization Pre categorized texts If pre categorized material is available it is possible to deduce simple yet fast and effective rules to use for categorizing new material of same type Studies in rule induction have been done by Quinlan Qui96 and Cohen Coh95a primary focusing on efficiency in terms of speed Semantic networks When a semantic network relating word to actions concepts etc is available material can be categorized in a number of ways relying on the network User model In a system based on a user model it is possible to use the information collected about the users behavior preferences etc
23. eight S1 92 Rest relation list Weight R1 R2 Rest amp analyze path returns a valid relation type to apply to the given path Type eds uta te da analyze_path Type Dir lt direction Type Dir amp analyze_path Path Type Dir lt upward direction Path lt direction Type Dir amp analyze path Path Type Dir lt one direction Path lt direction Type Dir Dir up analyze path Dir l Type Dir lt direction Type Dir amp analyze path horizontal l Type down lt direction Type down amp Ensures that the given path only contains upward directions upward direction up amp upward direction up Rest upward direction Rest amp Ensures that only one direction is taken in the given path 4 2 2 2 2 22222 222 222 2222 22 222 2 one direction Path lt one direction Path amp 37 one_direction Dir Dir one direction Dir Rest Dir lt one direction Rest Dir 38 File morph pl Part of the lexical chainer and linker system By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 This class performs lookup in wordnet by trying to make compound words transform case from upper to lower inflection and derivation morph 4 super object amp use module library lists append 3 amp 4 Given a list of words returns
24. emp gt SentenceNo lt relation strong S1 82 Rel Rel amp relation strong relation strong S1 R1 R2 syn rel strong S1 82 Re1 lt direction Type horizontal lt link Type S1 82 Pred lt dbc Pred member S2 R2 lt relation strong R1 R2 Rel amp relation strong R1 R2 Rel lt relation strong R1 R2 Rel amp Medium strenght relations relation medium RecencyNo SentenceNo _ S1 _ S2 Rel Temp is RecencyNo 3 Temp gt SentenceNo lt relation_med S1 S2 Rel Rel amp relation med l amp relation med S1 R1 R2 syn rel medium 1 82 Re1 lt relation med S1 1 R2 S2 lt relation med R1 R2 Rel amp relation med R1 R2 Rel lt relation med R1 R2 Rel amp Perform breath first search to ensure that shortest path is found first relation med R1 R2 Path S3 member TestPath R1 length TestPath L L lt 5 self Self setof S2 Self breath rel R1 S2 Set member S3 R2 member S3 Path Set lt relation med Set R2 Path S3 amp breath rel R1 S2 Dir Path member S1 Path R1 lt analyze path Path Type Dir 36 lt link Type S1 52 Pred lt dbc Pred amp Given a weight two lists of synsets returns a list of relations for the synsets relation list 1 amp relation list Weight S1 R1 S2 R2 syn rel W
25. ets S IR1 S R2 ichain extract synsets R1 R2 amp chainer sentence get_words SentenceNo Words get chain ids C1 lt chainer Words SentenceNo C1 C2 set chain ids C2 lt chainer amp chainer write Done nl Given a list of valid sentence words updates chains chainer _ C C amp chainer Word Rest Sentence Cin Cout write Searching relation to write Word write lt addword Word Sentence Cin ID nl delete Cin ID CTemp lt chainer Rest Sentence ID CTemp Cout amp chainer Word Rest Sentence Cin Cout ls write fail 7 chain new ID Word write Creating chain write ID write write Word nl lt chainer Rest Sentence ID Cin Cout 4 Given a word sentence no and a list of chain id s addword tries to add the word to one of the chains 30 addword Word Sentence Chain ID lt addword Word Sentence Chain ID extrastrong amp addword Word Sentence Chain ID lt addword Word Sentence Chain ID strong amp addword Word Sentence Chain ID lt addword Word Sentence Chain ID medium amp addword Word Sentence ID ID RelationType ID add Word Sentence RelationType write found write RelationType write to 7 write ID amp addword Word Sentence Rest ID RelationType lt addword Word Sentence Rest ID RelationType 31 File
26. fined concepts and relations Another database that probably contains more information than WordNet is Cyc available from Cycorp However this database has been criticized for not being well structured and therefore difficult to use in text retrieval problems KMF96 In KMF96 it is demonstrated that a number of problems exists when Cyc is used in the solving of classic text retrieval problems These includes e Incomplete and non uniform coverage of knowledge concepts WordNet has also been criticized for incomplete coverage but has a uniform structure where Cyc lacks selectional constraints on the knowledge e Inefficient accessibility of knowledge because the whole database has to be searched to find everything about a given concept There are other electronically available ontologies including Pangloss Mikrokosmos and EDR KMF96 However none of these have as wide coverage of a language as WordNet or Cyc Among other these are the reasons why I have chosen to base this project on WordNet In this chapter I will give an overview of what WordNet contains and how it is structured Secondly I will present an overview of the field of text retrieval in the context of text categorization See http www cycorp com Word Synset id Synset gloss Orange 103880945 Any of a range of colors between red and yellow 105783752 Round yellow to orange fruit of any of several citrus trees 109007985 Any citrus tree
27. formance of the output Another aspect of the algorithm is the running time It is not discussed in the literature I know of It is important when comparing application based on other IR methods also to compare the running time Many of these tools are used where large amount of information needs to be processed and therefore the performance matters The issue is discussed in section 3 3 and 3 6 13 2 2 Categorization The idea is to match lexical chains against a network of categories As an example this could be the network of categories you can browse through at Yahoos portal Instead of representing the nodes by words the system should use synsets rom WordNet When the chains of a document has been identified they can be matched against the category network in several ways see figure 2 2 e Find relations between chains and category nodes Hereafter try to find a path in the matched categories and thereby selecting the one that is semantically close to the text e Find a path of relations between chains and then search for a similar path in the category network Both methods gives the possibility to add new categories to the network This can be done whenever a chain cannot be matched against a node in the network Of course it is not a trivial process to connect the chain to the right nodes in the network A way to examine weather this is a good idea or not could be to extract the category networks from e g Yahoo or the
28. h R Term amp Given a word return a list of corresponding synsets Fails if no noun synsets is found synsets Word List lt synsets Word Temp length Temp X gt X 0 list to ord set Temp List amp synsets Word Sid Rest Retreived lt dbc s Sid Word n No non member No Retreived lt synsets Word Rest No Retreived amp synsets amp direction ant horizontal amp direction at horizontal amp direction per horizontal amp direction sim horizontal amp direction sa horizontal amp direction per2 horizontal amp direction hyper up direction mm up amp direction ms up Y direction mp up Y direction cs2 up amp direction hypo down amp direction mh down amp direction hs down amp direction hp down amp 4direction cs down amp direction ent down amp Templates for different relation types Instantiating R1 or R2 to a synset returns a template for use 43 with wordnet dbc link sa R1 R2 sa R1 R2 link ant R1 R2 ant R1 R2 link at R1 R2 at R1 R2 amp link per R1 R2 per R1 R2 link per2 R1 R2 per R2 R1 amp link sim R1 R2 sim R1 R2 amp link cs R1 R2 cs R1 R2 amp link cs2 R1 R2 cs R2 R1 amp link ent R1 R2 ent R1 R2 amp link hyper R1 R2 hyp R1 R2 amp link hypo R1 R2 hyp R2 R1 amp link mm R1 R2 mm R1 R2 amp link mh
29. haining is done in chain and lexchain by performing the morphologic analysis on the words and allocating chain objects containing the actual chains By sending an add message to a chain object the object responds by either accepting the word or failing If it fails the lexchain object backtrack and tries another chain object If no chain will accept the word a new chain is created by sending a new message to the chain class The implementation differs from St Onge s in the way medium strength relations are found Instead of finding all of them then calculating a weight and selecting the optimal I perform a breath first search to find the relation with shortest path first When the first relation is found it is selected and the search is not continued The disambiguation is done in chain by first removing superfluous relations between synsets prune_synrel 2 and then removing the unconnected synsets prune_synsets 2 The class uses assert and retract to store the relations between the words and synset As described in section 2 1 extra strong relations are sought throughout all chains before seeking for strong or medium relations Therefore the running time must at least be O N log N where N is the size of the text Strong and medium relations are only sought in a limited scope of chains However as will be seen from the next section a typical analysis have one or two longer chains which continues to be updated throughout the whole analysis
30. ical chaining and then continue with how the identified chains can be used for categorizing text 2 1 Lezical chaining The lexical chaining algorithm was developed by Morris and Hirst The algorithm group the words of a given text so that the words of each group has a close semantic relation The purpose of the algorithms was to correctly disambiguate meanings of words but also to give an indication of the text structure SO95 Because chains are limited in scope they tend to indicate the structure of the text MH91 However as stated by Stairmand it is not clear whether or not there is a relation between the identified chains and the concept of text structure Sta97 The algorithm was developed by Morris and Hirst MH91 based on Roget s Interna tional Thesaurus which is a classification of words and phrases into ideas and concept However at the time of development Roget s was not available electronically and there fore the algorithm was never implemented This was later done by St Onge SO95 HSO98 and Stairmand Sta97 both adapted for WordNet 11 Information and knowledge WordNet Category heirarcy Figure 2 1 Overview of the prototype system The lexical chainer and linker are the modules that are already developed and presented in section 3 St Onge s algorithm is based on the original algorithm along with the notion of salience which was introduced by Okumura and Honda see 5095 for further refe
31. ight be confusing to keep passing long lists of parameters between the objects 3 5 Front end The front end consists of the predicate test The article to be analyzed should be placed in the file article pl before loading The system gives a progression indication while searching for the chains and finally prints out the chains and their interrelations 19 China warns Taiwan about making CNN web Jan 31 2000 383 two states theory legal Bananas Cultural directions From www plants com 330 Tae kwon do Encyclopedia Britannica 224 1999 electronic edition Table 3 2 The table show the four articles used in the testing of the developed prototype 3 6 Testing The testing of the system is done by applying three different articles as input to the system A description of the articles are show in table 3 2 The results is printed in appendix B An initial test of the chainer was first done using the following words Pear apple carrot melon tree apples blue red green and yellow Here we would expect two chains as result one containing ruits and vegetables and one containing colors In fact the result is apple melon tree carrot apple pear yellow green red blue Furthermore we see that the morph class seems to work in that the word apples has been transformed to apple and melon and tree has been made a compound word Looking at the chains produced rom the example texts also shows that the
32. inSynsets Instance find_chain_rel ChainSynsets ChainSynsets amp Initiates the process of finding links between the lexical chains Gh eee Seat See ee tee eee tee ete WAMA find_chain_rel _ amp find chain rel X Rest ChainSynsets ord del element ChainSynsets X C1 ord_union C1 C2 synset_rel X C2 find_chain_rel Rest ChainSynsets amp 4 Succeeds if a relation exists between S1 and S2 where S1 is a synset of C1 and S2 is a synset of C2 V A synset rel gt synset rel C C1 C2 member 1 C relation S1 82 27 member S2 C2 write chain rel write S1 write S2 nl assert chain rel S1 S2 amp synset rel C1 C2 synset_rel C1 C2 relation is true if there exists an upward link between the two synsets o O ee dees eee te eae eS eee oe eS relation From To relation From To amp relation From To 01d length 01d L L lt 10 wordnet direction Type Dir member Dir up horizontal wordnet link Type From Temp Pred wordnet dbc Pred Temp To relation From To Temp 01d Gyo SL Se a A EI 4 Given a list of chain id s returns a list of ordered synset sets Only sets of lengths greater than one are included in the output list ya Se eae A eo chain_synsets amp chain_synsets ID R1 S3 R21 ID get chain S1 length S1 L L gt 1 lt
33. isting of keywords of the context This is done by considering the distribution of the terms found in the lexical chains throughout a document It is assumed that there exists a tight relation between the contexts in the document and the main topic Therefore the most dominant context is selected as the topic of the document A system for categorization of email messages is described in Tak95 It is not based on WordNet but describes a dynamic system based on the semantic distance of the documents in a given category When the distance becomes too big the category is split into several sub categories In this way the system is able to extend its structure while adding new documents Unfortunately the paper only covers a preliminary study 10 Chapter 2 Categorization As the field of automatic categorization is very limited on published research I have chosen to select technigues successfully applied to other applications of text retrieval namely lexical chaining I have then made some initial experiments on how to use it for text categorization The general idea is first to identify all lexical chains by using an adapted version of Morris and Hirst s original algorithm MH91 5095 Hereafter I will try to identify relations between the found chains to find out which are general specific in meaning Finally it may be possible to place the article in one or several sub categories see figure 2 1 I will start with a description of lex
34. m editor WordNet An Electronic Lexical Database MIT Press 1998 Graeme Hirst Context as a spurious concept In Alexander F Gelbukh editor CICLing 2000 pages 273 285 2000 Graeme Hirst and David St Onge Lexical chains as representations of context for the detection and correction of malapropisms In Fellbaum Fel98 chap ter 13 pages 305 332 Jim Cowie Kavi Mahesh Sergei Nirenburg and David Farwell An assessment of cyc for natural language processing Technical Report MCCS 96 302 Com puting Research Laboratory New Mexico State University 1996 J Morris and Graeme Hirst Lexical cohesion computed by thesaural relations as an indicator of the structure of text Computational Linguistics 17 1 21 48 1991 Marti Hearst Mehran Sahami and Eric Saund Applying the multiple cause mixture model to text categorization 24 Qui96 SO95 Sta97 Sti99 Swe98 Tak95 Voo98 Yan99 J R Quinlan Learning first order definitions of functions Journal of Artificial Intelligence Research 5 139 161 1996 David St Onge Detecting and correcting malapropisms with lexical chains Master s thesis Department of Computer Science University of Toronto 1995 Mark A Stairmand Textual context analysis for information retrieval In SIGIR Philadelphia 1997 Rune R Stilling Using natural language to search the internet Master s thesis Computer Science Roskilde University 1999
35. move ful ending before doing so checkword WordIn Word0ut lt ending WordIn Base ful lt morphword Base WordTemp lt ending WordOut WordTemp ful lt validate WordOut amp Try morphing words which is not ending on ss checkword WordIn Word0ut lt ending WordIn _ ss fail Stop if word g ends on ss lt morphword WordIn WordOut Otherwise try to morph lt validate Word0ut Again try morphing of word by removing what is after possible characters in the word checkword WordIn Word0ut lt beforedelimeter WordIn WordTemp lt morphword WordTemp Word ut 40 lt validate WordOut amp Succeds if the word is a noun represented in wordnet and the word is not in the stoplist validate Word wordnet dbc stop Word fail amp validate Word wordnet dbc s Word n morphword WordIn Word0ut lt suffix noun End01d EndNew lt ending WordIn Base End01d lt ending WordOut Base EndNew amp suffix noun s amp suffix noun ses s amp suffix noun xes x amp suffix noun zes z amp suffix noun ches ch amp suffix noun shes sh amp ending Word Base End ground Word ground End name Word WordList name End EndList append BaseList EndList WordList name Base Baselist amp ending Word Base
36. nary and how much should be handled by a program In the case of English there are many regular inflections of words and therefore it would be possible to write an inflection algorithm that can handle a large part of the morphologic transforma tion On the other hand it would be hard to cover all inflections without listing some of them in lexicon For example there is no rule telling that the plural of child is children Cov94 The front end of the native distribution of WordNet handles the problem by doing inflection and if that fails it tries to look the word up in a list of irregular word forms This list is not supplied in the Prolog distribution but I have chosen to convert the files from 17 the native version to Prolog and then implementing my own inflectional transformation system This is done in the morph object Sending a checkword message to the morph object with a word as input instantiates the second parameter with the same word but in a possible different form which can be found in the WordNet database validate is used by checkword to ensure that the derivation can be found in WordNet s noun index Furthermore it searches for compound words and transform words from upper to lower case when needed 3 3 Lexical chainer The lexical chainer is implemented using the two dynamic classes chain and lexchain When used lezchain fetches a sentence at a time and then try to add each word to one of the chains The lexical c
37. nt relation of table 1 2 is synonymy Of other important relations represented in WordNet is antonymy which is a lexical relation describing relations be tween words of opposite meaning e g ascend and descend are antonyms Hyponymy and hypernymy is a semantic relation describing a hierarchy between word meanings Here tool is a hypernym of fork and fork is a hyponym of tool Meronymy and holonymy can be described by the has a or part of relation E g a motor is part of a car therefore motor is a meronym of car and car is a hypernym of motor In WordNet the actual representation of the car motor relation consists of several intermediate levels motor is a hypernym of engine which is a hypernym of automobile engine which is a part meronym of car Fur thermore meronymy is divided into three types of relations in WordNet namely member jual 10J02 abuejo ayy Busnpold quavubid due p gt 331 SAYIO S aa abuelo Ree sabuejo Buueaq aay sni due c E ssauabuelo AA moja pue pal Usamjag 1003 Jo aDuei e jo due 7 E anons eAnonpoidai os A WY E jue d paas e jo Apog sanonpoidai pauadu ayy 21 yy qu E jonpoid poor yyruspooj YA El yor USpleb 2e sala90 6 uaal gt spoo uaal gs aanpold 2 aew au 10 UmoJ ajqeyabaa pue syny usa B yor uapieb nij sn 331 snija jeja as JO Aue jo yn abuejo 0j mojan punos 1 E 1139016 uaaib spoob uasib aonpoid qs inj ajgipa az ysa jaams Sulaey auo fjerna
38. r ence Salience means that both recency and length of a chain is taken into account when building chains The algorithm works by looking up a word at a time and then try to establish relations between the word and one of the chains The relations can either be between the word symbols or the synsets assigned to the words Read word Skip word if found in stop list Find semantically egual term in WordNet by performing morphologic analysis Find relation between word and one of the words in the already initialized chains If none is found make a new chain and initialize it with the new word If relation is found disambiguate synset senses of words in chain by pruning all senses not used to find the relations between the words This continues until no more words is available To use the salience concept the chains 12 are searched in order of recency i e the read word is compared to the words in the chains most recently updated If a chain has not been updated according to a special threshold value the chain is not updated any more Finding relations between words is done by applying one of several rules each assigned a weight Here each relation in WordNet has been given a direction see table 3 1 Extra strong The words are equal Strong There exists one or more direct horizontal relations between the synsets of the two words Medium A relation can be made using the following rules e The number of rela
39. s between substance and part meronymy An ER diagram of WordNet is show in figure 1 2 Even though the participle relation connects verbs and adjectives it only consists of 90 relations This makes it difficult to use all the information stored in WordNet because of this weak connection between different syntactic categories 1 2 Previous research 1 2 1 Text Retrieval In Information Retrieval many underlying techniques have been tried on a wide variety of applications Most commons are models based on symbol manipulation Yan99 MSS This type of models relies on a correlation between words as symbols and their meaning which only to some extend is present In other words the models do not take into ac count the different meanings of the words which also is reflected in the results of their performance The problem of text retrieval is traditionally viewed as matching a query string against a set of texts and thereby finding the documents that are relevant to the query Voo98 One common way of doing this is by using a vector space model Voo98 Here each text Adjectives Adverbs Figure 1 2 ER diagram showing relations between different syntactic categories in Word Net is represented by a T dimensional vector where T is the number of different words in the text The length of the vector in a given direction is given by the number of times a particular w ord occurs in the text Given a guery the vector for that guer
40. s to implement This part should preferably also include feedback to the category network in form of added cate gories An overall view of the subject shows that there are great possibilities in a database like WordNet covering a wide variety of the English language Although extension of the database is needed to include better connections of the different syntactic categories more knowledge on how to use these relations is also needed The developed prototype makes it possible to investigate these properties of WordNet I think that what is done today using lexicographic databases like WordNet is only the top of the iceberg By using a system based on e g lexical chaining I believe it is possible to make advanced language based interfaces without a complete knowledge on the theory of natural language 23 Bibliography AHK98 Coh95a Coh95b Cov94 Fel98 Hir00 HSO98 KMF96 MH91 MSS Reem Al Halimi and Rick Kazman Temporal indexing through lexical chains In Fellbaum Fel98 chapter 14 pages 333 351 William W Cohen Fast effective rule induction In Machine Learning Pro ceedings of the Twelfth International Conference 1995 William W Cohen Text categorization and relational learning In Machine Learning Proceedings of the Twelfth International Conference 1995 Michael A Covington Natural Language Processing for Prolog Programmers Prentice Hall 1994 Christiane Fellbau
41. t Converts the term T1 to a term T2 only consisting of lower 4 cases lower case T1 T2 name T1 Upcase convert case Upcase Lowcase name T2 Lowcase Given a list of character codes convert those of upper case to lower case A a Aa A E E convert case convert case U R1 L R2 U gt 64 U 01 Verify that the char is uppercase Lis U 32 convert case R1 R2 convert case L R1 L R2 convert case R1 R2 46 File article pl Test data for the lexical chainer and linker system yA By Tue Haste Andersen lt haste diku dk gt February 1999 Tested in Sicstus Prolog 3 7 1 use module library objects article 1 super object amp name einstein amp sentence 1 pear apple carrot melon tree apples blue red green yellow 4T Appendix B Test results The following shows the found lexical chains for the three examples see table 3 2 For each word the number of associated synsets is shown next to it Because the example texts are copyrighted they are not included here Title Tae kwon do running time 54 minutes on Intel Pentium 100MHz Chains martial_art 1 tae_kwon_do 1 tae_kwon_do 1 martial_art 1 tae kwon do 1 karate 1 tae kwon do 111 contact 1 short 1 opponent 1 sparring 1 free 1 practice 1 student 1 attack 1 sequence 1 set 1 sparring 1 step 1 id 1 combination 1 sparring i1
42. tenceNo amp chain recency Y SSE el MA ZS SZESZ SS AUC AE EE aa Given a word of chain return associated synsets This method differs from wordnet synsets in that only the synsets which have not been removed by the word sense disambiguation is returned pea e ua A uem chain synsets W S get chain C member W S C amp assert synrel amp assert synrel X R assert X lt assert synrel R amp synset rel X Y syn rel X Y amp 33 synset_rel X Y c syn_rel _ Y X retract_synset_rel X Y lt syn_rel _ X Y retract syn rel X Y retract_synset_rel X Y lt syn_rel _ Y X retract syn_rel _ Y X When a chain is updated with new relations prune synsets is called to remove superfluous synsets prune synrel Wi W2 Find a relation from W2 to W3 using a synset 2 4 which also connects Wi S2a S3a lt chain synsets W1 R1 lt chain synsets W2 R2 member Sia R1 member S2a R2 lt synset_rel Sla S2a lt chain_synsets W3 R3 non member W3 W1 W2 member S3a R3 lt synset_rel S2a S3a Find a relation between a synset in W2 and W3 not connected to Wi S2b S3b member S2b R2 not synset rel S1b S2b member S1b R1 lt synset rel S2b S3b member S3b R3 Ensure that another relation than S2b S3b exists from W3 member S3c R3 c synset rel S3c Sx S3c
43. tions between the two synset pairs must be no greater than five e No more than one change in direction is allowed unless it is horizontal e Upward relations is only allowed if no other direction changes has been made before The medium relations are further weighted by their length and number of changes in direction When a relation is found synsets that are not directly connected to other synsets in the chain are pruned In this way disambiguation is done incrementally whenever new words is added and more information about what senses of the words is available As mentioned earlier it can be difficult to make tight relations between lexical chains and other semantic entities The word disambiguation might be coarse grained but still lexical chaining has many useful purposes in information retrieval Compared to other methods in this field lexical chaining has with success been applied to detection of malapropisms HSO98 and also initial experiments in categorization has been applied Sta97 St Onge s implementation is based on version 1 4 of WordNet The most important difference between this version and version 1 6 at least when lexical chaining is considered is that the 1 4 lacks relations between verbs and nouns This lead St Onge to limit the analysis to nouns When a word was found that could not be looked up in WordNet s noun database it was skipped With the use of version 1 6 it may therefore be possible to improve per
44. void this I converted the files to external Prolog database files In this way the programs using the database loads faster and I could easily adjust the indexing mechanism to my needs The convert program relies on the files from the WordNet distribution to be available in the same directory as the program runs rom Also some of the files from the native distribution reguires preprocessing before they can be read by the converter program For accessing the WordNet database rom Prolog I have written a static class called wordnet appendix A References to the databases are opened when the file containing the class definition is loaded The class contains two specialized methods dbc This method takes as argument a WordNet relation see table 1 2 for predicate names and instantiates its parameters synsets Given a word returns a list of valid synset for that particular word If the word cannot be found in the database the method fails Furthermore it contains template descriptions and definitions of the relations used in lexical chaining 3 2 Morphologic analysis Morphologic analysis is in general divided in two different processes Cov94 e Inflection which is the processes of transforming between different forms of a word and e Derivation which is a transformation between words of different syntactic categories For practical implementation purposes it is always a tradeoff of how much you want to list in the dictio
45. y is calculated and matched against each vector of the text collection Now only the text vectors that are similar to the guery vector is returned as results Obvious advantages of using simple symbol based approaches like this are e They are fast robust and well known Coh95a Coh95b Qui96 e They can be made general i e they do not need supervised configuration for use in different areas MSS The biggest disadvantage of this kind of model is that no better than half of the searched documents can be found The best results of this kind of model has achieved to find about half of the relevant documents Using a precision recall evaluation scheme this means that no more than 5096 of the found articles of a test set is in fact relevant and no more than 50 of the truly relevant documents are found Voo98 A way to enable the vector space model to take into account the different senses of words is to apply WordNet for resolving the senses of the Words George Miller has done exactly that by adding a new WordNet construct called a hood Voo98 The idea is to assign a number of categories to each word and then to disambiguate the categories so only one is left for each word The category labeling is defined by the hood which itself is defined by the hyponymy relation A hood is the synset that the word synset is a descendant of as long as no other relations exists from the hood and down to the word synset By assigning and dis
46. ype system implements the following components e Database access to WordNet e Morphologic analysis e Lexical chainer e Lexical chain linker e Front end The system is developed in Sicstus Prolog Swe98 using the Object Prolog extension of Sicstus Object Prolog is an object oriented extension of the Edinburgh dialect of Prolog Object Prolog is very similar to Prolog Swe98 The reason why I choose to develop the system in Object Prolog is primary because the description of the lexical chaining made by St Onge relied on object oriented concepts Futhermore I wanted to develop a program in an object oriented extension to Prolog as I have never tried this before The source code for the lexical chainer developed by St Onge and Stairmand is not public available My implementation is primary based on 095 although some relations used in the chaining process might be different The following description of the imple mentation is primary discussing implementation details only important for the developed system 3 1 Accessing the WordNet database WordNet is distributed both as databases with native browsing interfaces for Windows Unix and MacOS and as Prolog readable files I used the Prolog distribution for this 16 project along with some files from the native distribution which were not included in the Prolog distribution The total size of the files are about 19 Mb To load and compile this reguires heavy computation To a

Text categorization using lexical chains

Contents

Download Pdf Manuals

Related Search

Related Contents