Home

JAPANESE KANJI SUGGESTION TOOL A Writing Project Presented

1. E o REREAD ameblo jp anamkawashimal Ameba Room lt BBS profile ameba jp anamkawashima BE e boshuul SX GET 10 80 www e boshuu com kochi maido syosai php tidno
2. w skccobonve FE ameblo jp anamkawashima in fy BBS profile ameba jp anamkawashima BE e boshuul 10 80 www e boshuu com kochi maido syosai php tidno 18580 1 DQName BRE 0 7 anamu dqname jp m php md view amp c aa240 47 7 Figure 3 Yahoo search keyword suggestions for the word
3. DQN 3 BRE Y anamu dqname jpm php md view amp c aa240 Figure 2 Google search keyword suggestions for the word 5 7 Yahoo search works similarly to Google Yahoo search also gives suggestions in katakana but they are different from the ones suggested by Google Figure 3 Yahoo JAPAN Yahoo gt D hj 556 ee YAHOO 1 10 89233 0004 0 15 QZth ANABLOG ANAMLOG RENNIE
4. Bing search engine provides keyword suggestions that contain combinations of all the three letters such tp etc Figure 4 eoo Bing 4 http www bing com search q 4 40 amp go amp form QBLH amp filt all amp qs n amp sk amp sc 8 3 E Google MSN Hotmail AREA Bargi BE Ding sc 0 217 1 10 261 000 m ANAVISASuica tiyka6 blog45 fc2 com UR 1 22 amp Warner Music TOP gt amp amp PHAR
5. Where New gt r New probability from state q to state Total d r total probability from state q to state r 2 Total P Sum of all total probabilities My purpose of calculating total probabilities 15 to find out the new probability that will be greater than the previous probability If I think that the difference between the new probability and the old probability is almost zero I will stop iterating through the HMM and that will be my new re estimated set of emission probability matrix 4 N gram approach to text parsing N gram is one of the methods that can be used for Japanese text parsing It is used in the TANGO algorithm that I explained earlier 4 1 N gram definition and its use N gram is a probabilistic model which predicts the next item in the sequence by looking at previous N 1 items The n gram model is one of the most important tools in speech and language processing N gram can be of size 1 which is referred to as a unigram size 2 which is a bigram size 3 which is a trigram and size 4 or more which is simply called as an n gram 4 20 The n gram approach is widely used in statistical natural language processing For text parsing words are modeled such that each n gram 15 composed of n words In the above example of the Hidden Markov Model the sentence is eye drops off shelf sequence of words in trigrams would be eye drops off drops off shelf This approa
6. 0 6 emission probability q 0 4 drops 0 3 off 0 2 shelf 0 2 r eye 0 2 drops 0 4 off 0 1 shelf 0 3 gt Figure 5 Example of the HMM 1 Figure 6 An alternate view of the search for the best path 2 11 3 3 The Hidden Markov Model training The sub problems involved with the given Hidden Markov Model are as follows 1 Given the Hidden Markov Model x and a sequence of observations find Here we want to determine the likelihood of the observed sequence given the model 11 This problem can be solved using the Forward algorithm The Forward algorithm is similar to the Viterbi algorithm except the purpose of the Forward algorithm is to find the sum of all possible ways to get to some end state Thus in the Forward algorithm the probability of reaching some end state is the total product of all the probabilities of paths leading to that end state 2 Given the Hidden Markov Model A x and an observation sequence find an optimal state sequence for the underlying Markov process In other words we want to uncover the hidden part of the Hidden Markov Model 11 This problem can be solved using the Backward algorithm Calculations for the Forward and the Backward algorithm are explained in tables below 12 Time ticks ticks 1 eye eye off off shelf Forward 0 0876 0 0
7. 1 HMM Testing java Mila ln L T HMMEng3States java p 32 4 HMMEna4States java oen T T z 6 4 Precision and recall 6 4 1 Goal To verify the correctness of the outputs given by the HMM program I decided to run an experiment of precision and recall on my program The goal of this experiment is to evaluate how accurate the HMM program works I received help from Seichiro Inaba a Japanese professor at San Jos State University and my Japanese friend Tomoko Oshimo for conducting this experiment I gave them 20 strings of length two and asked them to write down what is the most commonly used word starting with the two lettered string given to them In this experiment of checking the accuracy of HMM program I followed two ways I used the usual way of determining the likelihood of the observed sequence where I used the Forward algorithm to determine the probability I also used another way of calculating the likelihood of sequence which is the Viterbi algorithm All the results are recorded in the table below Viterbi HMM Yahoo Bing Forward search search 6 4 2 Results 27 DRTE 34 Inaba Sensei lt Viterbi HMM Forward No results found 35 5 Google search Yahoo search Bing search gt 2 Inaba Sensei Viterbi HMM Google Yahoo Bing Forwa
8. In the HMM I know what the observable sequence is but I do not know what states it has travelled to get this sequence In the HMM the states are not observable 15 We know some probabilities that the model will pass through The HMM works as any other finite state machine having a set of hidden states observable states transition probabilities initial state probabilities and output probabilities The current state is not observable but each state produces an output with a certain probability The HMM consists of a set of states as in any other finite automata model e g states q r 66 23 2 transition probabilities i e traveling from state to r a set of observations such as string aabb and the emission probability of seeing a symbol b in state r The problem with the HMM is how to find out the state sequence that best explains the given set of observations Because it is possible to have more than one state sequences that will lead to the same observation sequence This problem could be solved with the use of the Viterbi algorithm The Viterbi algorithm uses dynamic programming techniques A dynamic programming can be viewed as the Forward algorithm where sum is replaced by 11 Hence in the Viterbi algorithm the purpose is to find the probability of the best overall path 11 Our example for the Viterbi algorithm for the given HMM is as follows Sentence
9. omnis asna CT Emission Probabilities 5 4 95E 05 0 447183784076656 0 00730490429186804 0 0750625828636172 0 0018879778794974 0 0404963564466616 0 055859237706221 0 0166509314934374 0 124259191281704 3 6 1 15E 04 0 018560375450746 0 0187819026052506 54 E CCE CS pp 55 D hh LL 56 Pes eee uem few pne sp CE CS eee pue ene eee ense mss _ bee jus 57 E CS p 0 0013685645829742 0 0315193025907701 3 61E 12 0 026045103838904 0 005 18549978425972 8 69E 05 58 CS peer ETS 59 CE CS CS CE CS CS 60 E 61 E CC p four 0 0232374381120533 2 00E 13 0 00112923903996145 6 40E 42 3 52E 05 8 36E 49 8 47E 04 3 50E 04 9 68E 55 62 nm sura CTT wp CS mh Log Prob gt 239038 55238082897 63 Appendix 2 HMM program results for the English text HMM parsing on English text No of observations from file 50000 No of states 2 No of observations 27 Iteration 100 Start Probabilities En Sea 0 999999999999999 1 07 15 Transition Probabilities LL eee IEEE NN 0 224583895172636 0 77541610482723 0 709229231887198 0 290770768113053 Em
10. Design and implementation 5 1 Japanese language processing 5 2 Hidden Markov Model program description 5 2 1 Number of iterations and observations 5 2 2 Number of states vi 12 15 20 20 21 21 22 23 23 23 5 2 3 Japanese corpus 5 2 4 The software 5 2 5 The GUI the Nutch web crawler 6 Experiments and Results 6 1 The HMM for English text 6 1 1 Goal 6 1 2 Results 6 2 The HMM for Japanese text 6 2 1 Goal 6 2 2 Results 6 3 Tanaka corpus programs for parsing 6 3 1 Goal 6 3 2 Results 6 4 Precision and recall 6 4 1 Goal 6 4 2 Results 7 Conclusion 8 References Appendix 1 HMM program results for the Japanese text vil 24 25 25 28 28 28 28 29 29 29 30 30 32 34 34 34 39 42 44 Appendix 2 HMM program results for the English text 64 List of Figures 1 10 11 12 13 Example of morphological analysis Google search keyword suggestions for the word 45 73 42 Yahoo search keyword suggestions for the word 45 7x t Bing search keyword suggestions for the word Example of HMM An alternate view of the search for the best path Guessed HMM Instant query suggestion Link displayed for Did you mean Search results after user clicks on Did you mean link Binary Search Tree Experiment Dictionary experiment w
11. characters have a unicode range between 0x3040 0x309F 17 Katakana characters have a unicode range between 0x30A0 0x30FF 18 5 2 Hidden Markov Model program description The HMM program is the heart of this project which is written in Java using JDK1 6 5 2 1 Number of iterations and observations In my program I used 100 iterations to build the final probability matrix of HMM I decided on this number because after 100 iterations I could see convergence in the HMM probabilities I could see a slight convergence at each iteration First I used a total number of 50 000 observations for HMM From the convergence of probabilities I thought it would be better to use more observations so that I can see how HMM converges drastically from the first iteration to the last iteration Hence I changed a total number of 100 000 observations 5 2 2 Number of states With more states more time was required to run the HMM I decided to build HMM for two three and four states and checked the difference in the probabilities With 23 15 MB of the Tanaka Corpus memory requirement to run HMM on four states will increase significantly The final probability matrix for two three and four states is almost the same Thus I decided to use the HMM with two states to carry out all my experiments All the resulting matrices for two and three states are listed in Appendix 1 5 2 3 Japanese corpus For carrying out all my experiments I used Tanaka Corpus wh
12. eye drops off shelf Set of states q r Set of observations eye drops off shelf As the Viterbi algorithm follows the dynamic programming approach it is required to store the best path and its probability for each possible state 1 16 The purpose of using the HMM training is to find the HMM that maximizes the probability of the given observation sequence If we know the observation sequence we have to find the three parameters initial state probabilities transition probabilities and emission probabilities To achieve this I used the Forward Backward Baum Welch algorithm I followed the hill climbing approach I started with some random HMM model and iteratively make small changes to the solution improving it a little each time When there is no more improvement possible the HMM stops I do not know what the model is I can work out the probability of the observation sequence using some randomly chosen model By looking at it I can see which state transitions and emissions were probably used the most By increasing the probability of those I can choose a better revised model which gives a higher probability to the observation sequence Thus I am following the process of maximization which is referred to as training the model Hence I will guess random transition probabilities from state to state r This is how 1 found the new re estimated probability New gt r a t x 1 1 x g gt Where New
13. the following section I will explain briefly the working of these two models 3 1 Markov model definition The Markov Model is a finite automata model which is used to find out what will be the next state of the model depending on the current state Hence to predict the future state you do not have to look at the history of states of the model e g Consider the sentence The monster swallowed In this sentence after the verb swallowed it is less likely that the next word will be a verb again Hence by looking at the current word which is swallowed I can say that the next word will be most likely some article such as a followed by a noun such as boy The purpose of the Markov Model is to find the next word with the highest probability The transition probability is assigned to each path It is better to follow the path with the highest probability to traverse from one state to the other 3 2 Property of the Hidden Markov Model In the HMM the states that the model is passed through are unknown to us We only know the observation sequence probabilities of transitioning from one state to other the emission probability for the current state what are the chances that given observation will be generated HMM Example states q T observations eye drops off shelf start probability q 1 0 r 0 0 transition probability 4 74 04 783 r q 0 4 T
14. 149 0 0023 Probabilit y of state q at timestamp t Forward 0 12 0 0444 0 007 0 0012 Probabilit y of state r at timestamp t Total 1 0 0 4 0 132 0 0219 Probabilit y Table 1 Forward probabilities for eye drops off shelf For given observation eye what is the total probability 1 sum of the 29 29 probabilities of all paths that one will end up in state 4 or in state r Hence in the above example 040 0 41 x q gt q x qeye 1001 x r gt q x Teye a2 1 0 x 0 7 x 0 4 0 0 x 0 6 x 0 2 a 2 0 28 Where a 2 forward probability of state at time 2 al forward probability of state at time 1 4 gt 4 transition probability from state 4 to next state 4 emission probability for observation eye from state q Oy e forward probability of state at time 1 r gt q transition probability from state r to next state q emission probability for observation eye from state r Similarly I can calculate the forward probability for other states and at different timestamps As the sequence moves forward it is called forward probabilities We can also use the Backward algorithm instead of the Forward algorithm The difference between the Backward algorithm and the Forward algorithm is in the Backward algorithm I calculate the probability of sequence by working backward eye drops drops off off shelf shelf off shelf shelf
15. Backward 0 0083 0 032 0 1 1 probabilit y of state timestamp t Backward 0 0105 0 018 probabilit y of state r at timestamp t Table 2 Backward probabilities for eye drops off shelf Hence in the above example 14 842 9 0 X q gt 4 X darops B 3 X 4 gt r X Garops 8402 0 032 x 0 7 x 0 3 0 018 x 0 3 x 0 3 0 0083 Where 842 backward probability of state q at time 2 643 backward probability of state q at time 3 q gt 4 transition probability form state q to next state q Qdrops emission probability for observation drops from state q 3 Backward probability of state r at time 3 7 transition probability from state q to 7 3 Given an output sequence what HMM transition probabilities maximize the likelihood of the sequence i e when we have the model and a set of all observations how can I adjust parameters to increase the probability of the observations given in the model This problem can be solved using the Forward Backward algorithm also known as the Baum Welch algorithm Using an initial parameter instantiation the Forward Backward algorithm iteratively re estimates parameters and improves the probability that given observations are generated by the new parameters I need to re estimate three parameters such as initial state distribution transition probabilities and emission probabilities 3 4 Working of the Hidden Markov Model
16. Constants java file If in the future anyone wants to run the HMM on different states he she should change the value in this Constants java file The final matrix of start transition and emission probabilities is stored as a serialized object in one file All these file names are also stored in the Constants java file For the rest of the experiments this serialized HMM is used so that execution time is saved If the HMM is already built and stored when a user enters a string to search showing the corrected string will not take much of time and the user will not have to wait longer to get the search results 5 2 5 The GUI the Nutch web crawler The Nutch web crawler is used in my project as GUI Nutch is an open source web crawler 14 It maintains a database of pages and links I implemented Nutch on the Windows 7 platform using Tomcat 5 5 amd JDK1 6 version We also used cygwin to create and deploy jar files in Tomcat folder We want to use Nutch so that when a user 25 enters a string it will first call my HMM program to get the suggested string and use this corrected string to display the search results from the crawled pages After downloading and installing Nutch in my system I made the following changes To crawl Japanese websites I used google co jp in crawl urlfilter txt file This is my domain name to crawl nutch domain xml is changed to add the agent name as google nutch site xml file is changed to mention
17. JAPANESE KANJI SUGGESTION TOOL Writing Project Presented to The Faculty of the Department of Computer Science San Jos State University In Partial Fulfillment of the Requirements for the Degree Master of Computer Science by Sujata Dongre December 2010 2010 Sujata Dongre ALLRIGHTS RESERVED il SAN JOS STATE UNIVERSITY The Undersigned Writing Project Committee Approves the Writing Project Titled JAPANESE KANJI SUGGESTION TOOL by Sujata Dongre APPROVED FOR THE DEPARTMENT OF COMPUTER SCIENCE Dr Chris Pollett Department of Computer Science 11 30 2010 Dr Robert Chun Department of Computer Science 11 30 2010 Dr Mark Stamp Department of Computer Science 11 30 2010 iii ABSTRACT JAPANESE KANJI SUGGESTION TOOL by Sujata Dongre Many times we can see that if we enter a misspelled search term in any of the search engines like Google it will provide some help with Did you mean In my project I am providing some suggestions for the wrong Japanese text entered by a user The Japanese language has three types of writing styles hiragana katakana and the most difficult kanji A single kanji symbol may be used to write one or more different compound words From the point of view of the reader kanji symbols are said to have one or more different readings Hence sometimes it becomes very difficult to understand what the kanji symbols are and how to read them even if you know the Japan
18. P q gt r New estimated probability from state q to state r at forward probability from state q at timestamp t p t 1 backward probability from state r at timestamp 1 1 q gt r transition probability from state q to state 17 We will use this newly calculated probability for start probabilities transition probabilities and emission probabilities Consider the following diagram for guessed HMM Figure 7 Guessed HMM Thus in the above example Total gt a tl x q gt r x Total P g gt 1 0 x 0 3 x 0 0105 Total P g gt r 0 00315 Where Total P q gt r total probability from state q to state r 2 forward probability of state q at time t1 4 gt r transition probability from state q to state r 8 12 backward probability of state r at time 12 q gt r for observa tion eye observa tion drops r gt r for observa tion off r gt q for observa tion shelf Table 3 New probabilities for eye drops off shelf Similarly I can calculate the new probability for other observations as well After calculating all the total probabilities for remaining transitions I calculate its total i e Total P This is nothing but the expected number of transitions from one state to the other state Total P q gt r n Total P 1 New P q gt r 19 0 003 New P g gt r oi New P g gt r 0 23
19. System out println Please enter input gt Acme Scanner input new Scanner System in 0 AdminCalc String userStr input nextLine AdmissionCalculatorDecorator System out println userStr Alphalnterpreter BrickAD ee CarRaceTournament gt ES i Zl Console Z3 p Zr CartaceTournamentAdapter Problems Javadoc 2 Declaration Progress X Ex 21 rj terminated BinaryTreeExp Java Application System Library Frameworks JavaVM framework Versions 1 6 0 Home bin java Nov 19 2010 12 17 30 gt 122 ChartSample Please enter input gt cs158a hw05 New EDA DigitalComponentVisitor 05 GUIFactory HalfAdderPubSub g HavalProject gt 2 HMM E Inter InterningCalculator InterpreterSlave LinuxCertified MasterSlave MazeGame MazeMakerFactory Normalization PDTournamentStrategy SampleAbstractFactory SampleAdapter 177 SampleCommand Figure 11 Binary Search Tree Experiment 32 0 2 TP 4219 ET TED String userStrWindow inputWindow nextLine Debug Package Explorer 23 _ fg Hierarchy R HMM Parse 23 7 KanjiSearch java a BinaryTreeExp java ja Viterbi_Newjava 74 amp call dictionary window pr
20. ation methods for Japanese text make use of either a pre existing lexicon or a substantial amount of pre segmented training data There are existing Japanese text analyzers like JUMAN morphological analyzer and TANGO algorithm 2 1 JUMAN Morphological Analyzer JUMAN is a morphological analyzer 6 for the Japanese language It was developed by Yuji Matsumoto Kazuma Takaoka and Masayuki Asahara at the Matsumoto laboratory Nara Institute of Science and Technology 8 JUMAN is a rule based Japanese morphological analyzer where rules are represented as cost to lexical entry and cost to pairs of adjacent parts of speech which are assigned manually The probability of the occurrence of the word depends on the cost of a lexical entry while a connectivity cost of a pair of parts of speech reflects the probability of adjacent occurrence of the pair 7 However this technique is a kind of labor intensive and vulnerable to the unknown word problem If the terms from the domain are not present in the lexicon JUMAN will not work properly and hence it is not a very robust analyzer to work with If the domain text changes then re estimation of costs will require much effort This analyzer provides word segmentation with some additional information such as their parts of speech Similarly there is also JTAG morphological analyzer 13 Location Department store Manager Particle of Location or sumame Wisteria Mountain s
21. ch help powered by de en fi fr hu it jp ms nl pi sh sr 21 Find IME 4 Net Previous Highlight all 7 Match case lone A eel ED Figure 8 Instant query suggestion J Nutch search results Mozilla Firefox E SN EN English United State Ble Edit View History Bookmarks Tools Help 0 htp localhost8080 searchjsprlang eniquery Btit 4 setTimeout 0 Most Visited Getting Started 2 Latest Headlines Worton Q ea Japanese charact jaxintroduction Twitter like http ueryzi ia D AJAX Suggest 4 Discoverthesecr 7 Google japanese Googl SE Window setTime Nutch sear Search About FAQ Did you mean help Hits 1 2 out of about 2 total matching pages http www google co jp intl ja help features html cached explain anchors Google http www google co jp accounts TOS cached explain anchors RSS nui calde en es filfr hulitljp ms nl pl pt sh sr sv th zh X Find IME 4 Net Previous Highlig
22. ch is extremely useful in any task in which word identification is necessary in noisy ambiguous input 9 In speech recognition the input speech sounds are very confusable and many words have a similar sound The n gram approach is also a part of statistical machine translation process If I have a set of potential rough translations using n gram I can tell which is the right translation that is being used most of the time 4 2 N gram model N gram model is a generalized Markov model As seen in the previous section the Markov property assumes that we can predict the future probabilities without knowing anything about history In bigram we assume that the probability of future words depends only on the previous words Similarly trigram will look into the past two words to predict the probability and n gram looks into the past N 1 words to predict future probabilities S Design and implementation In this section I will explain how I implemented the project Our project mainly involves two parts HMM program and the Nutch web crawler 21 5 1 Japanese language processing Japanese 15 a language spoken by over 130 million people in Japan 13 The Japanese language uses a combination of three writing styles hiragana katakana and kanji symbols The hiragana 15 used for native Japanese words The katakana 15 used for foreign words The kanji symbols are a more pictorial representation of the word Here is an example that uses all three scrip
23. d to any other domains and applications 2 3 Search engines suggestions I also studied how existing search engines like Google Yahoo and Bing work for keyword suggestions in the Japanese language I will first briefly explain how Google provides suggestions I think that Google search works better for suggestions in katakana rather than suggestions in hiragana Consider I start typing the query string for the word you in Japanese as As a Japanese language beginner I do not remember the Japanese characters and mistakenly types 7 instead of So my query string becomes but the suggestions given by Google search engine are katakana words although I am expecting it to show some simple suggestions as 7 7 lt which is the most frequently used word in the Japanese language Figure 2 Google Laje ttp www googl search hl ja amp source hp amp biw 1207 amp bih 595 amp q amp aq f amp aqi g r10 amp aql amp oq amp c Google Gmail v Google Em 9 616 000 0 21 9 qt ANABLOG ANAMLOG
24. de suggestions for the wrong Japanese word While learning the Japanese language or reading Japanese text I always feel that the translation tool or the search engine should suggest a better word The motive behind the development of 40 this tool is to encourage Japanese language learners to get a quick grasp of Japanese words 41 8 References 1 Viterbi Algorithm Retrieved November 23 2010 from http en wikipedia org wiki Viterbi_algorithm 2 Statistical Language Learning Eugene Charniak MIT Press 3 The Tanaka Corpus Retrieved November 23 2010 from http www csse monash edu au jwb tanakacorpus html 4 n gram Retrieved November 23 2010 from http en wikipedia org wiki N gram 5 Kanji Retrieved November 23 2010 from http en wikipedia org wiki Kanji 6 JUMAN information Retrieved November 23 2010 from http www lab25 kuee kyoto u ac jp nl resource juman eng html 7 Koichi Takeuchi Yuji Matsumoto HMM Parameter Learning for Japanese Morphological Analyzer Retrieved November 23 2010 from http www aclweb org anthology Y Y95 Y95 1022 pdf 8 Chasen Morphological Analyzer User s Manual Retrieved November 23 2010 from http sourceforge jp projects chasen legacy docs chasen 2 4 0 manual en pdf en 1 chasen 2 4 0 manual en pdf pdf 9 N GRAMS Retrieved November 23 2010 from http www mit edu 6 863 spring2010 readings ngrampages pdf 10 Rie Kubota A
25. ese language There are different translation tools available nowadays that provide translation help However none of them provides any good suggestions for the incorrect Japanese words In my project I am developing a tool that will help users to correct their Japanese words by suggesting to them the correct word Keywords Japanese Hidden Markov Model suggestions natural language processing find as you type ACKNOWLEDGEMENTS I would like to thank my project advisor Dr Chris Pollett for his support and guidance throughout this project I would also like to thank my committee members Dr Robert Chun and Dr Mark Stamp for their valuable suggestions on my project work I would also like to acknowledge my Japanese language professor Inaba Sensei and my Japanese friend Tomoko san for their inputs in my project I am also very thankful to my parents my husband and all my friends for their encouragement throughout my master s degree program Table of Contents 1 Introduction Prior work of keyword suggestions in Japanese 2 1 JUMAN Morphological Analyzer 2 2 algorithm 2 3 Search engines suggestions Hidden Markov Model of text parsing 3 1 Markov model definition 3 2 Property of the Hidden Markov Model 3 3 The Hidden Markov Model training 3 4 Working of the Hidden Markov Model N gram approach of text parsing 4 1 N gram definition and its use 4 2 N gram model
26. g it is given to the Nutch web crawler which outputs the list of search results with the given suggested search string This project is divided into two main deliverables The first deliverable was to develop a Japanese word segmenter using a HMM This deliverable includes a HMM program that makes use of the Tanaka Corpus to provide probabilities for observations Connected with the first deliverable I also tested working of the HMM and Viterbi programs with different states and iterations Then I compared results of HMM with the n gram binary tree program The second deliverable was to study and install the Nutch web crawler and incorporate existing HMM into the Nutch web crawler This report describes how programs were developed and experiments were conducted It is structured as follows Section 2 describes prior work about keyword suggestions in the Japanese text Section 3 gives information about HMM for parsing Japanese text followed by section 4 which provides the n gram approach of text parsing In section 5 I will explain the design and implementation of the project Section 6 describes various experiments that are carried out and results of those experiments At the end section 7 gives the conclusion 2 Prior work of keyword suggestions in Japanese In Japanese language processing the first key step is to find out accurate word segmentation because Japanese is written without delimiters between words Many previously proposed segment
27. gotta WPC7 10118 3 059 www wmg jp artisVanam_maki amp Wikipedia HRERS 1979 8 20 8 APRS 1979 12 13 BUS ja wikipedia org wikii 26 27 EBEAGA 1 THAMENESEREA 101 804 2 3500 6 31500 www e boshuu com kochi maido syosai php tidno 18580 ANABLOG ANAMLOG lt The Figure 4 Bing search keyword suggestions for the word 3 Hidden Markov Model of text parsing There are different approaches for text parsing and segmentation The most commonly used approaches are Hidden Markov Model and N gram In my project am using the Hidden Markov Model approach to parse the Japanese text as it maximizes the probability by considering all previous states In
28. here these words begin and end If you do not know which is the right combination of characters you will probably end up getting some error message The JUMAN morphological analyzer 6 is one tool used for Japanese text segmentation Rie Kubota Ando and Lillian Lee 10 tried an n gram approach to the segmentation Each of these approaches has some pros and cons 10 I will explain later how these analyzers and other algorithms for the Japanese language segmentation work 1 will also look at how existing search engines provide help for the Japanese language My goal in developing a tool is to provide suggestions to people who have a very basic knowledge of the Japanese language Even if they make mistakes while typing the Japanese characters my tool will suggest to them the string that they might be looking for This tool makes use of a Hidden Markov Model HMM which is trained on a corpus of Japanese text The most commonly used Japanese corpus 15 the Tanaka Corpus 3 The Tanaka Corpus was compiled by Professor Yasuhito Tanaka at Hyogo University and his students 3 The Hidden Markov Model suggests the character with the highest probability for given input Suppose a user enters the string as C1C2C3 then the HMM will suggest the string as CiC2C4 assuming the user might have entered incorrectly Then it might also suggest CsC2C3 assuming user might have entered as the wrong character Once the HMM suggests a strin
29. ht all 7 Match case Figure 9 Link displayed Did you 21 Nutch search results Mozilla Firefox http localhost 8080 search jsp query tat amp toshow 0 setTimeOut 2 Most Visited Getting Started a Latest Headlines Norton Expired 2 Japanese charact Ajax Introduction Twitter like Searc http uery 85 i AJAX Suggest Tu Discoverthesecr 4 Google 4 japanese Googl 22 Window Nutch sear x ficha Search help Hits 1 2 out of about 3 total matching pages http www google co jp intl ja help features htm cached explain anchors more from www google co ip Google http mww google co jp accounts TOS cached explain anchors more from www qoogle co jp show hits enles fi fr hu itl io ms nl pl sh srl th 20 X Find IME Net dt Previous 7 Highlight all 7 Match case Done Figure 10 Search results after user clicks on Did you mean link 6 Experiments and Results 6 1 The HMM for English text 6 1 1 Goal The main goal of te
30. ich has 212 000 sentence pairs 3 These sentence pairs are collected not only from textbooks used by Japanese students of English but also from lines of songs books etc The format of Tanaka Corpus is as follows It contains pairs of lines starting with A and where sentence is a Japanese sentence and its English translation Sentence B contains a list of Japanese words found in sentence A In sentence A the Japanese sentence and its English translation are separated by TAB In sentence B a line may contain a reading of kanji symbol in hiragana included in round brackets In square brackets you will find some IDs These IDs indicate a sense number If the word has multiple senses then this sense number indicates which sense applies in the sentence In the example below means to point to say Consider the following example to understand the format A and TAB The sign amp stands for ID 1 gt 03 24 As my project revolves only around Japanese text I have written program that will collect only Japanese sentences from the corpus file Then for my experiments I used this new corpus file 5 2 4 The software For my experiments with two states three states and four states in the HMM program all the parameters for the number of states and number of iterations are handled from the
31. ieval 43 Appendix 1 HMM Program results for the Japanese text HMM parsing on Japanese text No of observations read 50000 No of states 2 Language Japanese Emission matrix size 191 Iteration 100th Start Probabilities 0 959001807619696 0 0409981923803042 Transition Probabilities 0 500624298487603 0 499375701512407 0 500671848129107 0 499328151870889 Emission Probabilities SS Ss SS 0 255653541151235 0 255676715913046 0 02600445 15787143 0 0259164712050492 NNNM el 0 0486462157973132 0 0483152884871585 0 0146813851378645 0 0146391457565325 44 OOO NT ES _ 45 46 OOOO a hh eee eem e GA 47 py hm CRE 0 0181651721889989 0 0182356484560436 ____ 0 00854211981250269 0 00853821666422648 48 Tn pos CE am CS 49 EO 50 COO 51 e ee e 0 00523848204864668 0 00520167838263065 52 Log Prob gt 252466 21402948367 53 of observations read 50000 of states 3 Language Japanese Emission matrix size 191 Iteration 100 Start Probabilities Ee ee ae 0 999073535209081 9 26E 04 1 63E 40 Transition Probabilities
32. ission Probabilities uy ES 0 176598231360123 0 0939315356517259 000133177077919124 EN 0 00403692431521916 ES 0 00471250371119518 0 0201595377731672 64 T h 13E 05 0 0680381522273984 0 126060906030029 0 00115546233898728 0128782247292518 0 00594229848285901 0 0288288025406763 0043332127675367 0015438465078242 IET Log Prob gt 198637 03686732217 i n T S t u V X Ec Z space 65 of observations from file 50000 of states 3 No of observations 27 Iteration 100 Start Probabilities SS 0 000000000000 1 000000000000000000 0 000000000000 Transition Probabilities a ee Ee 0 13533663202 0 8646467711752 1 66E 05 0 50662789237 0 0515983099995569 0 441773797634675 s 064826047702 0 00153714039511984 0 35020238168674 Emission Probabilities sa om CT fs oo _ a ee 66 RE 25 nanna nn on a _ fans fous _ pare omae 2256 04 0 04 0 00242891950022893 00242891950022893 399804 93E 04 0 27102021648372 8 45E 04 0 148000620332349 Log Prob gt 193690 1452632959 67
33. ith window of length three Experiment of creating Japanese word dictionary 10 11 18 27 27 28 32 33 33 List of Tables 1 Forward probabilities for eye drops off shelf 2 Backward probabilities for eye drops off shelf 3 New probabilities for eye drops off shelf 4 Precision and recall experiment results 13 14 19 3 1 Introduction In this multilingual world everyone comes across different words in different languages Japanese is a language spoken by over 130 million people in Japan and in Japanese emigrant communities 13 There are several websites that provide online help with Japanese text which translates given Japanese text to any other language and vice versa The problem with these websites is if you enter an incorrect combination of Japanese words then you will get either a meaningless translation or a no search results found message The Japanese language is written with a combination of three scripts hiragana characters katakana characters and kanji symbols 12 The total number of hiragana and katakana characters is around 90 while the total number of kanji characters is much larger but common use kanji symbols number almost 2000 Like many other Asian languages Japanese text is written without any spaces in between which makes it hard to recognize the boundary between words Therefore it becomes necessary to figure out which words are present in the text and w
34. ndo Lillian Lee Mostly Unsupervised Statistical Segmentation of Japanese Kanji Sequences Retrieved November 23 2010 from http www cs cornell edu home llee papers segmentjnle pdf 11 Dr Mark Stamp A Revealing Introduction to Hidden Markov Models Retrieved November 23 2010 from http www cs sjsu edu faculty stamp RUA HMM pdf 12 Japanese Retrieved November 23 2010 from http en wikipedia org wiki Japanese language 42 13 Kenji Imamura Kuniko Saito and Hisako Asano Basic Japanese Text Analysis Technology as a Platform for Knowledge Extraction Retrieved November 23 2010 from https www ntt review jp archive ntttechnical php contents ntr200809sf4 pdf amp mode show_pdf 14 Nutch Tutorial Retrieved November 23 2010 from http nutch sourceforge net docs en tutorial html 15 Precision and recall Retrieved November 23 2010 from http en wikipedia org wiki Precision and recall 16 Brown Corpus Manual Retrieved November 23 2010 from http icame uib no brown bem html 17 Hiragana characters chart Retrieved November 23 2010 from http www unicode org charts PDF U3040 pdf 18 Katakana characters chart Retrieved November 23 2010 from http www unicode org charts PDF U30A0 pdf 19 Nutch wiki Retrieved November 23 2010 from http nutch apache org about html 20 Information Retrieval Retrieved November 23 2010 from http en wikipedia org wiki Information_retr
35. number of occurrences is also incremented If the string ends with the special character character other than hiragana katakana and kanji symbols then the number of occurrence variable is decremented otherwise the number of occurrence variable is incremented The aim to add or subtract the count is to find out words having negative number of occurrences The third experiment is about creating a dictionary of Japanese words using corpus file This experiment also stores words and the number of occurrences of each word The difference between this experiment and previous experiments is previous experiment uses a string of length three This experiment generates a bunch of Japanese words that can be used in security to create a dictionary of passwords in Japanese 31 6 3 2 Results After comparing the results of these experiments with the HMM program I saw that these experiments give more accurate suggestions of strings than the HMM program The binary tree program gives the list of the first three most common words that begin with the user entered string n Exp java 8 g 9os4 9 5 S 5 O Qu 1 He 9 Ej Debug Java Package Explorer 22 Hierarchy ELI HMM Parse Jap java n BinaryTreeExpjava 2 jn Viterbi New java 4 0 String userlp Th x AbstractProcess accept input from user x AccountMVC
36. ogram HashMap lt String Integer hmDicto DictionaryWindow_New createDicto 0 AbstractProcess accept input from user AccountMVC System out println Please enter input of length 2 gt Scanner inputWindow new Scanner System in Acme AdminCal 4 a System out println userStrWindow a search 1 15 user input exists in hmDicto with positive coun Y Alphalnterpreter h if thi input exists in hmDicto with positi t BrickCAD Ee 747 CarRaceTournament CarRaceTournamentAdapter EEE ChartSample cs158a hw05 New ogProb 488053 37265388446 EDA oldLogProb gt 488053 44983108435 DigitalComponentVisitor Please enter input of length 2 gt ps GUIFactory HalfAdderPubSub nt gt 9580 pressius 1 2 HavalProject gt 2 HMM Hrbiff Inter InterningCalculator InterpreterSlave LinuxCertified MasterSlave MazeGame MazeMakerFactory Normalization PDTournamentStrategy SampleAbstractFactory SampleAdapter 171 SampleCommand Figure 12 Dictionary experiment with window of length three Figure 13 Experiment of creating Japanese word dictionary 33 000 Java HMM src hmmalgo DictionaryTanaka java Eclipse Platform User
37. opment of a HMM program is 39 handling large number of observation set for Japanese text and generating the start and transition probabilities randomly To handle this large observation set I mapped all kanji characters to one symbol In the future I can modify the existing HMM code to recognize all kanji symbols and output suggestions for different kanji symbols To generate probabilities randomly I used Random class from Java The experimental results of precision and recall demonstrate that the n gram approach of binary search tree is more reliable than the HMM approach The binary tree program was written with an iteration window of size three based on Tanaka Corpus Hence if the user enters a wrong set of kanji and hiragana characters a binary tree program fails to return any results as no such string exists in the corpus file giving a precision of 0 53 The HMM program gives results for any characters typed by the user but those results are sometimes irrelevant Hence the precision of the HMM program is lower than the precision of a binary tree program The search engines such as Google Yahoo and Bing also suggests keywords but most of the time they are irrelevant In the future I can modify the existing binary search tree program by changing the iteration window size and running it more than one corpus file for Japanese text With thousands of online translation tools and search engines none of the tools are efficient enough to provi
38. or words such as 71 2 0 In these cases we can say that even native Japanese speakers got confused with the correct sequence of Japanese characters 38 From the above results 1 can calculate the precision and recall values for all binary tree program Viterbi HMM and the search engines Precision Number of correct results Number of returned results 20 Precision for the Viterbi 8 20 0 4 Precision for the HMM Forward 8 20 0 4 Precision for a binary tree 8 15 0 53 Precision for the Google search 4 17 0 23 Recall Number of correct results Number of results that should have been returned 201 Recall for the Viterbi 8 20 0 4 Precision for the HMM Forward 8 20 0 4 Recall for a binary tree 8 20 0 4 Recall for the Google search 4 20 0 2 Hence from the above experiment I concluded that the binary tree program gives more relevant results as compared to HMM because the precision value of HMM is less than the precision value of the binary tree With a set of 20 strings there is no difference between the precision and recall values for HMM using the Forward algorithm and HMM using the Viterbi algorithm However the recall value is same in the HMM and a binary tree programs The Google search has the lowest precision and recall 7 Conclusion The HMM technique and a binary tree are used to find the correct string with the highest probability The most difficult part in the devel
39. periment assumed the last character is wrong Similarly I did experiments changing the first character For this project I am assuming the query string of length three 6 3 Tanaka corpus programs for parsing 6 3 1 Goal I did a few experiments using Tanaka Corpus In these experiments I followed the n gram approach The first experiment I did involved iterating through each character in the corpus file and creating a binary search tree A binary tree node consists of a key word and value number of occurrences Each word that is stored as a key in the binary tree node has a 30 length three When any special character characters other than hiragana katakana and kanji symbols are found that special character is stored as EOW End Of Word After the experiment is completed a binary tree with all the words and its number of occurrences is created based on the corpus file When a user inputs a query string this program looks into the binary search tree for the words starting with the user input and having the highest number of occurrences The goal of this experiment is to come up with suggestions for a possible next character for the given user query string The second experiment is about finding word boundaries using the corpus file In this experiment the program I developed is iterating through each and every letter in the corpus file reading strings of length three For each string of length three its corresponding
40. rd search search search No EOW suggest ions No results suggesti suggestio found ns T No gt suggest ions T No No No results suggesti suggest found ons ions 36 Inaba Sensei Viterbi HMM Forward No results found No results found 37 Google search No suggesti ons Yahoo search No suggest ions Bing search No suggestio ns debug assertion failed Inaba Viterbi HM Google Yahoo Bing Sensei Forward search search search ET N gt Table 4 Precision and experiment results Precision is used as a measure of exactness whereas recall is a measure of completeness 15 The program is more precise if it gives less number of irrelevant results If the precision score is 1 0 means that the results returned from the program are all correct And the recall score of 1 0 means that the program returns all the correct strings for all input strings For some of input strings we can see either the input is changed to katakana word or remain unchanged or the string is completely converted to some different word e g Consider a response given from Sensei and Tomoko f
41. s sujata Work 1410 6138 G 65 7 4 218 FO Oro Ej Ss Debug Sava F Package Explorer 3 fs Hierarchy 8 HMM Parse D KanjiSearchjava _ 7 BimaryTreetxpjava _ 0 Viterbi Newjava _ 2 DictionaryTanakajav 3 3 O a s v e sToRead TanakaCorpus JP txt F Integer gt hmDicto new HashMap lt String Integer gt to store words and its of occrences in corpus new StringO DigitalComponentVisitor os 0 121 CUlFactory 2In new File corpusToRead HalfAdderPubsub scanner new Scanner fileIn HavalProject 24 HMM T Bsr Problems Javadoc 55 Declaration El Console Progress X x r hmmalgo terminated DictionaryTanaka Java Application System Library Frameworks JavaVM framework Versions 1 6 0 Home bin java Nov 19 2010 12 20 197346 3 Backward java 0 BinaryTreeExp java 2 java 9 54224 43951 in DictionaryWindow Newjava pes J DictionaryWindow java a 25 323 us DictionaryWindowHM java nea 0466 J Forward New java 25778 3 Forward java E FwdViterbiT java 7 HMM FileReader java 7 HMM 3Stjava 15549 HMM_Jap_4Stjava h HMM Latest java s i HMM Newjava n 0 HMM Parse English java ph J HMM_Parse_Jap_Trialjava ric iiini 0 HMM Parse 5 0582
42. sting the HMM program on English text is to understand how the HMM converges For this experiment I used Brown Corpus 16 6 1 2 Results My experiments were run with 100 iterations and 50 000 number of observations on Brown Corpus 16 with two three and four states After 100 iterations the matrix shows the letters with the highest probability are a 1 and these are found in the first state which indicates that these letters appear more frequently at the beginning of the word Hence I could use HMM to distinguish between vowels and consonants Also 28 the significant change after 100 iterations is the observation space which is the character that has the highest probability among all 27 observations HMM for three and four states is not as helpful as it is for two states In HMM two states I can clearly see the separation between different states and characters All the results for HMM English text with two and three states are listed in Appendix 2 6 2 The HMM for Japanese text 6 2 1 Goal Once testing the HMM for English text I decided to run the same HMM program on Tanaka Corpus My goal for this experiment is to check whether I can make a clear distinction between Japanese letters to identify space If I can locate word boundaries then it will help with Japanese text segmentation 6 2 2 Results I ran this experiment on HMM two states three states and four states It does not give any usef
43. the searcher dir property tag This searcher dir path is the path of my crawled dir Command to crawl bin nutch crawl urls dir crawljp depth 3 topN 10 Once the pages are crawled I changed the existing search html search jsp and nutch 1 0 jar file to incorporate the HMM program I included my class files of HMM program under the WEB INF classes folder of the Nutch project in Tomcat server Also new jar file is created including the HMM class files and the existing nutch 1 0 jar file is replaced with this new jar file It helped me to keep my code clean and easy to maintain From the search jsp page I created an instance of the HMM program and called the right function to retrieve the suggested string Also I implemented the instant search help functionality to the user using AJAX Once a user starts to type in a query string a suggested string is shown to him her instantly 26 Moz Frk os A O United States 2 Help lt ile Edit View History Bookmarks Tools Help QU X A 2 http localhost 8080 en y setTimeout Most Visited Getting Started 2 Latest Headlines lorton 9 Expired _ Japanese charact Ajaxintroduction Twitter like searc http uery fa 0 AJAX Suggest 4 Discoverthesecr 74 Google tl japanese LE Window setTime gt Nutch x suf About FAQ Sear
44. tream Department Shop Long store Location Connection rules exicon Probabilistic model Figurel Example of morphological analysis 13 JUMAN is available on the internet for download and installation Generally morphological analysis is used to separate the text into the words and these words are obtained from the dictionary It also verifies if these words are connected to each other If they are then it evaluates the cost of word connection 2 2 TANGO algorithm Tan go bun katsu is Japanese for word segmentation This algorithm was developed by Rie Kubota Ando and Lillian Lee 10 In this algorithm the authors used the n gram approach that provides counts for an unsegmented corpus to make segmentation decisions The authors used a 4 gram approach to decide the word boundaries 10 If the given sentence of eight kanji symbols says CiC2C3sC4CsCeC7Cs then the word boundary occurs in between C4 and Cs They pose series of questions like is number of occurrences of C1C2C3C4 gt number of occurrences of C2C3C4Cs Each affirmative answer makes it more reasonable to place a segment boundary at the location under consideration 10 In their paper the authors present this simple TANGO algorithm that segments Japanese sequences into words based on statistics drawn from a large unsegmented corpus 10 This algorithm is more robust than JUMAN analyzer as it can be porte
45. ts kanji K hiragana H and katakana KT K H KTI K HI K HI This line can be translated in English as I am a student of San Jose University Kanji symbols are frequently used for nouns and the stems of the adjectives and verbs 5 Hiragana is mostly used for endings of the verbs and for grammatical particles while foreign borrowings are normally written using katakana The most commonly used writing styles are kanji and hiragana In general in a Japanese English dictionary there are 150 00 entries of Japanese words There are approximately 90 hiragana characters and 90 katakana characters The total number of possible kanji symbols is disputed but approximately 2 000 to 3 000 characters are used commonly in Japan 5 Because of the huge number of kanji characters in this project I am going to map all kanji symbols as one symbol The set of English characters is very small compared to the total number of Japanese characters Hence Japanese language cannot be encoded using only one byte Thus it has to be encoded as two byte characters To deal with this problem many 22 encoding formats such as Shift JIS EUC and UTF8 are available to handle Japanese characters In the project I decided to use UTF8 character encoding to read corpus file as it is the most commonly used encoding and works consistently for all operating systems and web browsers Hiragana
46. ul information for letters that have the highest probability as I found in the English text We can see that from the experimental results in the Appendix 1 the characters such as have slightly higher probabilities than the other characters I decided to use this HMM final probability matrix and Viterbi algorithm program along with the Tanaka Corpus If a user enters a query string as C1C2C3 then using the HMM final probability matrix the Viterbi program checks all the possible combinations of CiCoH to C1CoHs and the string with the highest probability is 29 returned Once I found the character with the highest probability from the Viterbi program I verified that the string exists in the Tanaka Corpus If no such string exists in the Tanaka Corpus then I discarded that particular string and searched for the second highest probability string from the Viterbi program Once the suggested string is also found in the Tanaka Corpus that string is displayed on the screen for the user e g User enters string Viterbi program checks all the combinations such as then it returns the string with the highest probability If this string also exists in the Tanaka corpus then it will be displayed to the user As in the above case the suggested string is 4 7 75 means you in English This ex

JAPANESE KANJI SUGGESTION TOOL A Writing Project Presented

Contents

Download Pdf Manuals

Related Search

Related Contents