Home
        ChaSen Morphological Analyzer version 2.4.0 User`s Manual
         Contents
1.                                                                                        RAIA  A  FREE  DARA  RER  TRAGBRESMN Y ATA JUMAN  EA  version 2 0    NAIST Technical Report  NAIST IS TR94025  1994           LL  PERE  RARA  TERESA LY AT A ViJUMAN version 1 0 EHAE    NAIST Technical  Report  NAIST IS TR96005  1996                 LIFE  RARA  TERESA ROMA ATA ViJUMAN   OF HBREL  HUES  SERE 96 NL 115  pp 29 34  September 1996                       Vi eh  THEON ZAJA U 7 RE BA ARA LANA O EREM REREN   ARA  SHR EBACE EAN E AOL  NAIST IS MT9551092  March 1997                                                        heme  EL ERE TI MORA T EORR   RR AmE DONA ARO   NAIST IS MT9551119  March 1997     PEKRE  MARIA  TIA REUNIR LE REE TI VOM DEBERAN  TERA RE  96 NL 119  May 1997                       AEA E  UT ERE  PAK RA      A ARR Y AT AAOWMA RRM OSR    AA  RP  ERK tC  pp 437 440  1997                          ABRES 7A SAA RIN ASE  ER EE  TANYA VII  T  BANAL SHU EE TERE  1998     ARE a  mA PALS I NAPE ERO RS  AR ER EBT AS eK EE ttt   NAIST IS MT9851103  March 1999                                                                 HA  AYE GUE  PAS R  RO BOR MERIC ES AAR OPER ET VEE    ERU Si CHE Vol  40  No  5  p p 2325 2337  May 1999                                      REA Sa  Well AAA  LOR ES  Be E  PARRA  BO ERE A OR NAD MEARE  Hj  ALLA AAA 99 NL 134  p p 23 30  Nov  1999                             Masayuki Asahara  Extended Statistical Model for Morphological Analysis  R R mE At KS be
2.     surface form  base form    first reading candidate  conjugated form    first reading candidate  base form    all readings  conjugated form    all readings  base form    first pronunciation candidate  conjugated form    first pronunciation candidate  base form    all pronunciation  conjugated form    all pronunciation  base form    surface form with ruby  i e     A Kanji B kana C      X 1   first semantic information candidate   all semantic information   semantic information  if    NIL    print c   X 1    part of speech  name  of all layers in the part of speech hierarchy  joined together by c    part of speech  name  of first    n    layers  n 1 9  in the part of speech hierarchy  joined together    part of speech  code    part of speech  name    part of speech  name  at the nth layer  n 1 9   or the deepest layer   0  only for backwards compatibility    sub part of speech  name   if    NIL    print POS    sub part of speech  code   if    NIL    print c   X 1    conjugated type  code    conjugated type  name   if    NIL    print c   X 1    conjugated form  code    conjugated form  name   if    NIL    print c   X 1    cost of morpheme   the input sentence      x    if optimal path            otherwise   the index of the path of the output lattice   Starting position of the morpheme in the   Ending position of path   s morphemes  1   Cost of path   indices of the elements in the preceeding path  joined together by C  costs of the elements in the preceeding path  joined tog
3.    Represent each morpheme as a Prolog compound term and output them as a list  Detailed display mode for VisualMorphs   Output in the format specified in the format string    F Display the help for output formatting options    Treat full stops and empty lines as sentence boundaries   Specify output file   Manually set cost threshold   Use rc_file as the chasenrcfile   Read the default chasenrc file  PREFIX etc chasenrc    Specify input language   Show a list of POS category codes and their names   Show a list of inflection category codes and their names   Show a list of inflection type  code   inflected form  code   inflected form  name   Select the input encoding  e  EUC JP  s Shift_JIS  w  UTF 8  u UTF 8  a ISO 8859 1   Show the help message   Show the version number    Restricted analysis    About the  j option Normally ChaSen treats the end of a line as the end as the end of an input sentence     Because of this  when analyzing a file where newlines appear in the middle of a sentence  the correct results    are often not obtained     In these cases adding the  j option will cause full stops and other sentence final punctuation  by default           o         or empty lines to be used for identifying sentence boundaries     The characters used to split sentences with the  j option can also be specified by setting the     punctuation       characters     Xt  Y X     value appropriately in chasenrc     1 4 Output Formats    The output format of analysis results can be chang
4.    s author  Taku Kudoh  and we will simultaneously release versions for use with both ChaSen  and MeCab    One problem that neither JUMAN  ChaSen  or MeCab has addressed is unknown word processing  i e   the handling of words not in the dictionary   Machine learning models to solve this problem are currently  under development at NAIST  32  33   Sometime in the future we would like to release a morphological  analyzer with a different framework than ChaSen that can support unknown word processing        3 http   mecab sourceforge net    4 http    mecab sourceforge net  soft html  5 http   nlp kuee kyoto u ac jp nl resource juman html    19    
5.   Cost threshold    In the process of morphological analysis  there may be situations where users want to allow all analyses  within a beam search cost width  This setting is used to specify a cost width  To ouput all solutions  within the cost width  use the  m and  p options          cost roma 0    cost width    default value         The cost width can also be specified with the  w option  overriding the value set in the chasenrc file     Undefined connectivity cost    This setting specifies the connectivity cost for morpheme sequences not defined in the connection rule  file  If an undefined connectivity cost is not given  or it is set to 0  then morpheme sequences not in    the connection rule file will never be permitted  The default value is 0          oer com cost 500    undefined connectivity cost of 500         Output format    This settings lets users change the output format of ChaSen   s results          covreur ronnan     An thy t P  n           The output format can also be specified using the  F flag  overriding any value set in chasenrc  For    more information on formatting  see Section 1 4     BOS string    The setting specifies the string to display at the beginning of the results for a sentence  Using     S     will display the entire input sentence  The default is the empty string         cs stan  Input sentence    S  n     BOS string is  Input sentence    S           EOS string    The setting specifies the string to display at the end of the results for 
6.   KEL  NAIST IS MT9851001  March 2000                       RSA H  FAAS AR  MRA ZN S SN APES GUI Y     I VisualMorphs    PEPA AA   244 2000 NL 137  p 98  June  2000           AR EF  BAAS  878  CRRA AFBI RT Oo SRR HMM EDU    RU ARR  2000 NL 137  p p 39 46  June  2000        Masayuki Asahara  Yuji Matsumoto  Extended Models and Tools for High performance Part of Speech  Tagger  Proceedings of COLING 2000  July  2000        BOR ESE  MA OR  FRU RC LSA TOUET VOR    HUE ZAR  2000 NL 139  p p 25 32  Sep  2000     16    19    20    21    22          23    24    25          26     27     28    29    30          31     32      33    PA AR  DEBRA ATA TZI   BLE Vol 41 No 11  p p 1208 1214  Nov  2000                 fe Beha  UR ESE  TY LV FDA TARA NIRO BAR  1 RU  SHOR EL TAE  VIV av 7  Feb  2001                 a BEN  FER AC  UHE E  UE EF  PK ARA FR LU SEMI U LBL BORE    B2  E  HU SHORS ELS  YI Y av 7  pp  39 46  Feb  2002                          RE EF  BAAS  814    BRR EF Y Y AV TOMAR HUI ARRE RA NRO RBA  free    PAUL WR  AAS ULA  SIGNL 154  pp 47 54  2003                               I  EB  TER M  RAK 1878   Support Vector Machine     Av  JEEP ARO    ERUS i Mak  Vol 44  No 5  pp 1354 1367  May 2003           Taku Kudo  Kaoru Yamamoto  Yuji Matsumoto      Appliying Conditional Random Fields to Japanese  Morphological Analysis     EMNLP 2004  2004                          ARAA  multi      FG  REESE  CHA  TAR L AN KS AAT Re AZ OR EIA    ATARE ARE  Vol 19  No 3  pp 334 339  2004          
7.   sions of their file names     Multiple dictionary sets may also be specified     10    Relative paths  i e  paths not starting with          are assumed to start in the same directory as the    grammar files  Here is an example         DADIC chadic   home rikyu mydic chadic        In the example below  two sets of dictionaries are read in    a  chadic  da lex dat  in the grammar file directory   b  chadic  da lex dat  in  home rikyu mydic     When dictionary lookups are done  both of the above dictionary sets will be used     2    The setting DADIC is used to specify a double array dictionary for Darts       omr chadic       In the above example  chadic da  chadic lex  and chadic dat in the same directory as the grammar  files will be read           The maximum number of usable dictionaries is set to 32     3  Unknown word part of speech  When an unknown word is detected  this setting indicates what part of speech to treat it as while  applying ChaSen   s connection rules  If multiple parts of speech are given  then the connection rules    for each part of speech are applied         UNKNOWN_POS  4a  Y ARE      one part of speech   UNKNOWN_POS  4a  VAR      ail    fikK     multiple parts of speech             4  Part of speech cost  The morphological analyzer calculates analysis precidences as costs  When there is ambiguity while  analyzing  the result with the lowest total cost is given precidence   The part of speech cost setting is used to define the magnitude of cost ass
8.   w    H 30 H             ER 19 4          Please send any inquiries regarding ChaSen to the following address     Computational Linguistics Laboratory Graduate School of Information Science Nara Institute of Science and  Technology 8916 5 Takayama  Ikoma  Nara 630 0192  Japan Tel   81 743 72 5240  Fax   81 0743 72 5249  E mail  chasen is naist jp   URL  http    chasen legacy sourceforge jp     1 Chasen User s Manual    1 1 Installation  1  Install the necessary tools  The following tools are necessary to compile ChaSen     e Darts  version 0 3 or later    e libiconv  if not part of your system s standard installation     2  Run    configure         3   configure    e When specifying the location of Darts    header files                 configure   with darts  usr local include       e When using libiconv        Q   configure   with libiconv yes       e When specifying the location of libiconv       Nee ON  uy     a   configure   with libiconv  usr local       The compiler and options will be determined automatically     For more information on how to use configure consult INSTALL or the output of      configure    help        3  Run    make            make      ChaSen   s executable is created in chasen chasen  the libraries in mkchadic   and the dictionary          cration program in mkchadic   Sometimes compilation will fail when using the OS standard make  In  that case  GNUmake should be used     4  Run  make install          make install      The installation directory 
9.   we decided to  fork into separate projects  and Kyoto University   s expanded version was soon released as JUMAN 3 0 beta  in June of 1996    NAIST   s fork was renamed ChaSen and version 1 0 was released in February of 1992  The planed im   provements to JUMAN were made through the release of versions 1 5 through 2 3  and with the release of  ChaSen 2 4  almost all of the planned features had been added  Development progressed on the following    schedule     1   ChaSen 1 0  development of system independent dictionaries  replacement of NDBM with binary    trees   2   ChaSen 1 0  Refactored and improved performance of system    3   ChaSen 1 0  Support for undefined connectivity costs  compound parts of speech   and user definition    of output formatting  4   ChaSen 1 0  Support for JIS encoding  5   ChaSen 1 0  Definitions for readings of inflectional endings  6   WinCha 1 0  Support for Windows  7   ChaSen 1 5  Converted to library  8   ChaSen 1 5  Converted to server    9   ChaSen 2 0  Stratification of POS definitions    18    10   ChaSen 2 0  Variable length connection rules   11   ChaSen 2 0  Created a dictionary for words with half width characters  dictionary using SUFARY   12   ChaSen 2 0  Expansion of output formats   13   ChaSen 2 0  Training of models using variable length connection costs    14   ChaSen 2 4  Restricted analysis    C The Future of Morphological Analyzers    A morphological analyser called     MeCab    has been released by Taku Kudoh    MeCab us
10.  Chooi Ling Goh  Masayuki Asahara  and Yuji Matsumoto     Chinese Word Segmentation by Clas   sification of Characters     International Journal of Computational Linguistics and Chinese Language  Processing   Vol 10  No 3  pp 381 396  September  2005     Chooi Ling Goh  Masayuki Asahara  and Yuji Matsumoto     Training Multi Classifiers for Chinese  Unknown Word Detection      Journal of Chinese Language and Computing  Vol 15  No 1  pp 1 12   2005        H             ERK FE    E             ly    T F 37 VY  WS  ARER  PABA  TRENMRAZO BE   HAU 11 El  Hii XA  pp 245 248  2005                            A 11 EER          E  aut  cu  El    BUSES  AA  ARAA  E ARE NG ULM BOM    ALE    KHER AMA  pp 604 607  2005                    Like fh  TERESA AES RA KD D5 EIO COIS    AUGE 11 HEM ADE  Beith CHE  2005        Chooi Ling Goh  Masayuki Asahara  and Yuji Matsumoto      Machine Learning based Methods to Chi   nese Unknown Word Detection and POS Tag Guessing     Journal of Chinese Language and Computing   Vol 16  No 4  pp 185 206  2006                          TREE BORUIES2  RARA  TRENERKA KES AAA  A AER  BABB  LEEA  SIGNL 173  pp 67 74  2006           HE  CIRH  SES  PAAR RAIA  ORERE DO AM AMENT AORE  AEREO  Fos  AAS LEAS  SIGNL 179  2007                    17    Appendix    A Regarding Copyright and Usage Restrictions    The ChaSen morphological analyzer was developed as free software to widely aid research on natural  language processing  ChaSen s copyright is held by Computational Li
11. 2 PITIKIZTDILDL VANS      specifying that  Ik  1229    is a noun  or that     IZ   amp  Y    should be treated as a single morpheme  202  amp  analysis candidates  that violate the constraints  like the fourth character      amp     being treated as an independent morpheme  or      ZHE Y    getting split into     Z4    and     amp  Y     will be rejected     Input format The input for constrained analysis is the same as ChaSen   s standard output format  but  reading and lemmatization information is ignored  In the following examples  tabs are represented by      a  gt    D t Z7 Xt IZ 4 tUNSPEC       KIDDA AAT ll 4i   10M Z7 RUN 1102 Y tUNSPEC  POD    EOS    Se p    Each line consists of a    segment     A segment can be one of the following     morpheme specification       l  l     l             2 o           sentence fragment       end of sentence        comment        e morpheme specification    This segment represents a single morpheme  a unit that will not be split any further    Morpheme specification segments have part of speech information from the fourth column onward   The format is the same as ChaSen   s standard output    If you write  PUNSPEC    instead of part of speech information  ChaSen will look up the segment in  its dictionary and use the corresponding entry as its results  If there is no entry  the segment will be    labeled as an unknown word     sentence fragment    A segment without any part of speech information represents a sentence fragment   The co
12. ARE DISCLAIMED  IN NO EVENT SHALL THE Nara Institute of Science  and Technology BE LIABLE FOR ANY DIRECT  INDIRECT  INCIDENTAL  SPECIAL  EXEMPLARY  OR CONSEQUEN   TIAL DAMAGES  INCLUDING  BUT NOT LIMITED TO  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES   LOSS OF USE  DATA  OR PROFITS  OR BUSINESS INTERRUPTION  HOWEVER CAUSED AND ON ANY THEORY  OF LIABILITY  WHETHER IN CONTRACT  STRICT LIABILITY  OR TORT  INCLUDING NEGLIGENCE OR OTHER   WISE  ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE  EVEN IF ADVISED OF THE POSSIBILITY    OF SUCH DAMAGE     JUMAN  version 0 6  version 0 8  version 1 0  version 2 0  ChaSen  version 1 0  version 1 5  version 2 0  version 2 2 0  version 2 3 0  version 2 4 0  ChaSen for Windows  version 1 0  version 2 0  version 2 4 0  NAIST Technical Report  1st edition NAIST IS TR99008   2nd edition NAIST IS TR99012     17 February 1992  14 April 1992  25 February 1993  11 July 1994    19 February 1997  7 July 1997   15 December 1999  06 December 2000  16 February 2003  30 March 2007    29 March 1997  15 December 1999  30 March 2007    20 April 1999  15 December 1999    BR    1 Chasen User s Manual  11   Installation    4 2 94  ban Did were ee Rd Gin ee AE a i ae ga  1 2  sRumning Ghasen  yo eas a ei ee SE Lee ee A Se ot he de  1 3  Runtime Options     24 224264 A Aa ee ee ee De ee  1 41  Output Formats  s s aupa ai Coe ie ae Ged ae ar oe eee ee da  1 5   Constrained Analysis 36 5 ook Sepang he AAA ee eM a Se ee Se EOS    2 The chasenrc Resource File  3 
13. ChaSen  Morphological Analyzer version 2 4 0    User   s Manual    Yuji Matsumoto and Kazuma Takaoka and Masayuki Asahara    2007 03 19    Copyright  c  2007 Computational Linguistics Laboratory  Graduate School of Information Science  Nara Institute of Science and Technology    Morphological Analysis System ChaSen 2 4 0 User s Manual  Yuji Matsumoto  Kazuma Takaoka and Masayuki Asahara    This translation of the ChaSen user s manual was made with support from the non profit organization GSK by Eric Nichols   Copyright  c  2007 Nara Institute of Science and Technology All rights reserved     Redistribution and use in source and binary forms  with or without modification  are permitted provided that the following    conditions are met     1  Redistributions of source code must retain the above copyright notice  this list of conditions and the following disclaimer     2  Redistributions in binary form must reproduce the above copyright notice  this list of conditions and the following  disclaimer in the documentation and or other materials provided with the distribution     3  The name Nara Institute of Science and Technology may not be used to endorse or promote products derived from this    software without specific prior written permission     THIS SOFTWARE IS PROVIDED BY Nara Institute of Science and Technology    AS IS    AND ANY EXPRESS OR IM   PLIED WARRANTIES  INCLUDING  BUT NOT LIMITED TO  THE IMPLIED WARRANTIES OF MERCHANTABILITY  AND FITNESS FOR A PARTICULAR PURPOSE 
14. The ChaSen Libarary    4 Using ChaSen from Other Systems  Al  Using Chasen  from Perl d aa paitia a ee RO De ek oe ee da    Bibliography   Appendix   A Regarding Copyright and Usage Restrictions   B The Connection between JUMAN 3 0 and ChaSen    C The Future of Morphological Analyzers    10    14    15  15    16    18    18    18    19    Introduction    In the computational analysis of Japanese  unlike American and European languages  to begin with there  are the following two problems  The first is the problem of morphological analysis  With the spread of  word processors a big problem in the input of Japanese has gone away  but in computational analysis of  Japanese  first the individual morphemes in the input sentence need to be recognized  For this  we need a  dictionary as large as can be practically supported  so  at the same time  there is also the problem of how  to maintain this dictionary  One more problem is the reality that in Japanese there is no widely accepted  or agreed upon grammar or grammatical terminology  In grammars taught in school  in general there are  word classifications and grammatical terminologies  however  amongst researchers they are not held in very  high regard and are not suitable for computers    Although morphological analyzers  a tool of foremost necessity in Japanese analysis  have already been  developed by many research groups and many technological problems brought to light  there is no common  tool in circulation in the world  This 
15. a sentence  Using    ZS    will  display the entire input sentence  The default is    EOS n        12         cens_sranne  END n     EOS string is  END          11  Whitespace part of speech  ChaSen treats the halfspace whitespace character  ASCII code 32  and tab  ASCII 9  as whitespace and    ignores them during analysis  Normally whitespace information is not included in ChaSen   s output   but this can be changed by using the    SPACE_POS    setting  For example  the setting given below    will output    punct whitespace    for whitespace          cseace_pos  punct whitespace     whitespace part of speech is  punct wstespacet         Furthermore  by setting the output format to     m    and specifying a whitespace part of speech  uesrs    can get output that is corresponds exactly to the input sentence  whitespace included     12  Annotations    This setting allows strings that begin and end with a certain sequence to be treated as an annotation  and ignored during morphological analysis  In the results  the annotation string will be output as a  single morpheme   Each annotation definition consists of a list of a start string and stop string followed by optional part  of speech information or a formatting string  The stop string can also be omitted  in which case the  start string itself will be treated as the annotation  If the part of speech information and format string  are omitted  then absolutely no information about the annotation   s morpheme will be output      
16. atenated together  and displayed as     punctuation  it      Compound word output  ChaSen can be configured to treat compound words defined in the morphological dictionary file in      dic  two different ways      a  compound  244   the morphological information for the entire compound word is output     b  compositional          the compound word is decomposed into individual words  and the mor     phological information for eachword is output    The default setting is     compound  A48           courro courouo   Aak     output compound morphological information         Compound word output can also be controlled by the  0c and  0s options     Delimiters    This setting allows users to define the characters that are used as sentence delimiters when the  j  option is set  see 1 3   Both half width and full width characters can be used as delimiters  For  example  the following definition treats the full width characters                 the half width characters                   and whitespace as sentence delimiters       oeum Me dogs che  ie      Encodings          The character encoding that ChaSen supports can be changed by reencoding the morphological file  and recompiling ChaSen  The ENCODE setting is used to indicate the encoding that ChaSen will use     For example  the following definition denotes Unicode       croone  u        The supported encodings are e  EUC JP  s Shift_JIS  w UTF 8  u UTF 8  a ISO 8859 1           The ChaSen Libarary    The ChaSen module can be i
17. e_tostr char  str_in       4    These functions perform morphological analysis on the input  If ChaSen has not been initialized  it  is initalized before proceeding  There are 4 functions  differing on whether the input and output are    strings or file pointers     chasen_fparse   and chasen_fparse_tostr   performs morphological analysis on strings read from a  file pointer  When the  j option is set in chasen_getopt_argv    ChaSen tokenizes the input sentences    with delimiters before parsing   chasen_sparse   and chasen_sparse_tostr   perform morphological analysis on the string str_in     chasen_fparse   and chasen_sparse   output the results of morphological analysis to the file pointer    fp out  The return value of these functions is 0     chasen_fparse_tostr   and chasen_sparse_tostr   store the results of morphological analysis in  ChaSen s internal memory  and return a pointer to the region of memory  This region of memory can    be accessed until chasen_fparse_tostr   or chasen_sparse_tostr   is called again     Using ChaSen from Other Systems    4 1 Using ChaSen from Perl    ChaSen can be called in Perl by using the per1 ChaSen pm Perl module  Consult the perl README file    for information on installation and usage     15    SAM    1    2          10    11       12    13    14    15    16       17    18                                           A  HER  TAREA DAA ld  lt  4 UBH  1992              WAR  PRAGA  REI  TINA AA RS LOREM Y ATA UE 42 BEA  FHE  1991        
18. ed by using the  F option or setting the value of     output  format     Hi 7 4    Y Y F  in chasenrc     If there is an An    at the end of the output format string  a newline will be inserted at the end of each    piece of morphological information  and    EOS    will be output at the end of each sentence  If there is no An     at the end of the output format string  then the morphological information for one sentence will be output    on one line with a newline at the end   Also  if the output format string contains     f         e     or  option   Here are some examples of output format string usage   e Same as default   f option    hm thy thM tZU CAPI At  T Nt F An  or        e Input word  reading  POS delimited by tabs      n thy  thP  n   e Only the input word       m n      e    Wakatigaki mode     input words divided by spaces     Amu    e Kanji kana conversion     hy     e Ruby mode  output in the form of    Kanji  kana       Yr O      6     c     output will match that of the corresponding    Below we give a list of all output format conversion strings and their meaning           Conversion string    Function       fm   7M  4y   AY       yO     YO  ha  nA   a0   A0   xr ABC  hi  hit   i0   hic   Pc   Pnc    h   7H      Hn   hb      BB      Bc   ht      Tc   hf      Fc   hc      8   hpb   hpi     y1  71       ps   pe   pc   ppiC   ZppeC    7B STR1 STR2     71 STR1 STR2     T STR1 STR2     F STR1 STR2     U STR1 STR2    U STR    hh       surface form  conjugated form
19. es a discriminitive  training model known as Conditional Random Field   as opposed to the generative Hidden Markov Model  used by ChaSen  In  24   the MaCab   s model is shown to have better accuracy than ChaSen   s  MeCab   s  other characteristic is it can output    Soft Wakatigaki     30    In ChaSen   s current framework  it is not  possible to support new models of analysis and freely design training features like in MeCab    Recently there have also been various improvements relating to dictionaries  For the new JUMAN  dictionary  together with the selection of a fundamental lexicon of Japanese  information about orthi   graphical variations forms is also being prepared  UniDic  a dictionary developed by Professor Den   s group  at Chiba University that was recently released  is said to be easy to use not just for natural language pro   cessing researchers  but also for researchers in Arts and Humanities and speech processing  At NAIST  we  plan to screen the entries in IPADIC and release a Japanese dictionary annotated with information about or   thographical variations and compound words  We plan to rename the new dictionary and remove the ICOT  entries that were a pending problem for IPADIC  We are also planning to release a dictionary for Chinese  morphological analysis with Penn Chinese Treebank part of speech information once issues regarding usage  rights have been settled  We have discussed the release of the Chinese morphological analysis dictionary  with MeCab
20. ether by C  STR1 if detailed POS category   STR2otherwise  X 2    STR1 if not the empty string  even if auxiliary information is    NIL      STR2 otherwise  X 2   STR1 if conjugated  STR2 otherwise  X 2    Same as   T STR1 STR2    STR1  if unknown word  STR2 otherwise  X 2        RAIRE    if unknown word  STR otherwise  X 2     percent sign                Conversion string   Function    specifies field width    specifies field width  1 9 specifies field width   n newline   t tab     backslash  y  single quotation mark     double quotation mark          X 1 In ipadic  when morphemes have multiple readings as in the case of  47  lt   W lt  W lt    the readings  are displayed with half width braces and back slashes readings like so     4 4 1  7     In the standard  output format  i e  that of  y   the word   s first reading candidate      T 7     is output and with output  format  y0  all of the readings       1 2  7     are output     X 1 When A B C c are empty strings  nothing is displayed     X 2 The string divider         can be an arbitrary string  Brackets like           lt  gt     are also usable  For    example     o  7THSTRIRHSTR2H  e    B STR1   STR2   e   U STR1  STR2     e  U STR     1 5 Constrained Analysis        Constrained analysis    refers to a special kind of analysis that satisfies constraints used when the mor   phological information or boundaries for a portion of the input sentence are already known    For example  it is possible to analyze the sentence   
21. gt    ANNOTATION     lt     gt      m n     output as is   c   T   Gd    RB0    punctuation   CC      Gis    R0    punctuation   CONN   4 STASHED    noun quotation sting   C             nothing will be output      ER X    For example  when using the above annotation definition  ChaSen will output its results in the following                format     e text starting with    j    and ending with           such as  lt img src  cha gif  gt   will be output as is    e    i 5     Ax will be output for     1    and    J           e 2il B HXFZYI will be output for strings in double quotes like    hello  again      e strings enclosed in square brackets like  ChaSen  will be ignored in morphological analysis and no  information will be included in its output  13  Part of speech concatenation    This setting is used to concatonate together morphemes of certain parts of speech that appear in    succession and output them as a single morpheme            COMPOSIT_POS  WAHE  AE Edad AB  GAR BE      a0           13    3    14     15     16     For example  with the above declaration of COMPOSIT_POS  parts of speech are concatonated to     gether in the following manner      a  Consecutive nouns  414   noun prefixes  P445 4 5 Left   numeric prefixes  Hew ai   A Reise   are concatenated together and displayed as     compound noun  1444 1      However  this part of  speech must be defined in the part of speech definition file grammar  cha                 b  Consecutive punctuation  te  is conc
22. has changed from version 2 1 onward  Now it installs to the locations below  by default  PREFIX can be specified with   configure   prefix  the default is  usr local            PREFIX bin chasen the ChaSen executable  PREFIX libexec chasen  dictionary construction programs  PREFIX 1lib libchasen    the ChaSen libraries  PREFIX include chasen h ChaSen   s header file  PREFIX share chasen doc  documentation       l http   cl aist nara ac  jp  7etaku ku software darts     However  the following is not installed     per1 ChaSen pm    Perl module    chasenrc is not installed with ChaSen  Instead when the dictionary  ipadic 2 6 0 or above  is installed     chasenrc   s path is taken from chasen config  and if there is no chasenrc in PREFIX etc  a copy is made    automatically  When PREFIX etc already contains a chasenrc file  it is not copied and must be manually    updated     1 2 Running ChaSen    The morphological analyzer   s executable is installed into PREFIX bin chasen by the    makeinstall    com     mand     e Running the morphological analyzer    ChaSen is started by running the chasen command in the following manner          gt  chasen  options   filename         J       ChaSen reads files from standard input or specified by command line arguments one line at a time and    conducts morphological analysis on each sentence     e Processing details    ChaSen finds the lowest cost solution  the solution where each morpheme s boundary has a variation    from the minimum cost 
23. is true of machine readable Japanese dictionaries as well  This system  was developed to offer the many reasearchers aiming at computational analysis of Japanese a commonly  usable morphological analyzer  Under these circumstances  we took into account the above two problems   and gave special consideration to making it easy for users to change the definition of the grammar and the  connective relations between words  This system was developed at a university by a small number of people   and there are still areas that are not perfect  but we plan to make a series of improvements as much as  possible  We hope that you will bear this in mind when using ChaSen    The ChaSen system is based on the Japanese morphological analyzer JUMAN  version 2 0  developed at  Nagao Laboratory of Kyoto Univeristy and the Graduate School of Information Science at Nara Institute of  Science and Technology  JUMAN was made with the cooperation of many students and the staffs of Kyoto  University and NAIST  Also  regarding the dictionary we used the dictionary from the Kana Kanji conversion  system  Wnn  and a Japanese dictionary publically released by ICOT  adding our own modifications  We  are especially grateful to Sadao Kurohashi of Tokyo University  with whom we developed JUMAN 2 0  and  Yutaka Myo   ki  who is currently working at Canon    First  we would like to thank Professor Makoto Nagao for creating the opportunity to develop JUMAN   We are also grateful to Takehito Utsuro of Tsukuba Uni
24. ncluded in other programs using the ChaSen libraries libchasen a and    libchasen so  To do so  include the header file chasen h  The following library functions and variables    are accessable      include  lt chasen h gt     14    int chasen_getopt_argv char   argv  FILE  fp      extern int Cha_optind     Pass ChaSen options  If ChaSen has not been initialized  initialize it before setting the options  If  ChaSen   s defaults options are acceptable  calling this function can be omitted     argv is an array of NULL terminated strings containing the command line options for ChaSen  argv  0   always contains the program name  When there is an error in the options  an error message is output    to the file at file pointer fp  No output is produced when fp is set to NULL    When there are no errors in the option settings  0 is returned  When there is an error  1 is returned   The number processed options  including argv  0   is stored in the external variable Cha_optind   The following is a usage example     6    In the program chawan  the options     r  home rikyu chasenrc proj  j  are passed to ChaSen     After chasen_getopt_argv   is called  Cha_optind is assigned 4        char  option       chawan     r     home rikyu  chasenrc proj     j   NULL      chasen_getopt_argv option  stderr          include  lt chasen h gt     int chasen_fparse FILE  fp_in   fp_out       int chasen_sparse char  str_in  FILE  fp_out      char  chasen_fparse_tostr FILE  fp_in       char  chasen_spars
25. nguistics Laboratory  Graduate School  of Information Science  Nara Institute of Science and Technology  There are not any particular restrictions    imposed on use and modification of this software  however  the following conditions apply to its redistribution     B The Connection between JUMAN 3 0 and ChaSen    Since JUMAN 2 0 was relased in July of 1994  Nagao Laboratory at Kyoto University and Matsumoto  Laboratory at Nara Institute of Science and Technology have been trying different approaches to its expan   sion  At Kyoto University  researchers have been working on adding functionality for processing multi word  expressions and parsing bracketed expressions in order to describe connective relations that cannot be rep   resented by existing bi gram models  and they have produced expanded versions of the grammar files and  morphological dictionaries with large scale updates  At NAIST  anticipating the accumulation of a large  amount of tagged Japanese data  we focused on adding functionality for automatically learning connection  rules that go beyond bi grams  including word and part of speech label tagging  and the development of  dictionaries that do not depend on the NDMB Unix hash database  The latter improvement aimed at ad   dressing the requests to use the software on operating systems other than UNIX and improve the compilation  time and search speed of the dicitonaries    Because the two approaches to connetivity rules going beyond bi grams are fairly different
26. ntents of this segment will be processed without any contraints  However  no candidates that    cross the segment boundaries will be generated     e end of sentence    Lines starting with EOS        BOS EOS     or     XX     and lines containing nothing but a newline mark    the end of a sentence     e annotations    Putting     ANNO    in the part of speech information column will make that segment an annotation   Annotations are displayed in ChaSen s output  but they are not used in its analysis   The display is determined in chasenrc     Example analysis An example of restricted analysis is given below   Input     F  gt       chasen  s    ZA Z7  t  2  tUNSPEC   le    ENZO e DN lio e ki  fe  TIDE Vt ZY hY  t II 2 VY  tUNSPEC                   HOJ   EOS  0  ss  Output      gt   D t t t ARE    Ze   t  Et BH 8  4      HE   HL AZ DION A al      M   THEY ZY RUANDA Ae gi  HE  A t A t D t Bd Bi HR  CM TVG WS t Bi  Hot    Be t HARE  o Wto  to  t   o   EOS    l  l     l                                  Areas of caution in restricted analysis    e During restricted analysis  even if    ANNO    is set  no output will be displayed unless comments are    enabled in chasenrc     e During restricted analysis  whitespace part of speech tagging and whitespace skipping are disabled     this is to support comments      2 The chasenrc Resource File    The chasenrc resource file is used to define the various necessary options for running the ChaSen morpho     logical analyzer   These definition
27. ociated with each part of    speech as well as set the cost of unknown words  Costs must be integer values        X     POS_COST   6  1    any part of speech    default cost 1x   CRABB 500    unknown words    cost 500x   CA  2    nouns    cost 2x        iil BAZA  3    proper nouns    cost 3x                Ne A    When multiple costs are defined for a part of speech  the last cost is given precedence  In the above                example  the cost of nouns  44  is 2  but the morpheme cost of proper nouns  41 144  increases  ple  th t of  4  is 2  but th ph t of prop  2 44                 2 The same morpheme cannot be registered in a single dictionary set multiple times  but a given morpheme may appear in  multiple dictionary sets  In this case  there will be duplicates of a morpheme     11    10     to 3  The           setting at the top indicates that the morpheme cost for parts of speech not explicitly  defined should be set to 1  i e  no change in the total cost of the path   The cost of unknown words is  set to 500     Relative weights of connectivity and morpheme costs    The cost in morphological analysis is calculated as the sum of morpheme cost and connectivity cost   This setting lets users assign weights to these two kinds of costs   The cost of an analysis result will be  calculated as the sum of each cost multiplied by its weight  If this setting is omitted  it defaults to 1         CONN_WEIGHT 1    connectivity cost of 1   MORPH_WEIGHT 1    morpheme cost of 1     
28. s are usually kept in PREFIX etc chasenrc  but they can also be stored in the file      chasenrc  in the user s home directory   The chasenrc file can also be specified by an option when chasen is initialized   The following precendence order wil be used to determine which chasenrc file will be loaded when ChaSen    is run   1   Unix  Windows  the file specified by the  r option at initialization time  2   Unix  Windows  the file set in the CHASENRC environment variable  3   Windows  The chasenrc set in the registry key chasenrc in HKEY_CURRENT_USER Software NAIST ChaSen  4   Unix  the  chasen2rc file in the user s home directory  5   Unix  the file  chasenrc in the user s home directory    6   Unix  PREFIX etc chasenrc  not installed by default     A list of settings is given below   Of these settings     DADIC        UNKNOWN_POS     and    POS_COST    absolutely must be defined     1  The grammar file directory setting    This setting specifies the directory where the grammar files  grammar cha  ctypes cha  cforms cha     connect   cha  reside          conamur  usr local lib chasen ipadic dic          This setting can be omitted  in which case it is assumed to be the same as the directory that the    chasenrc file resides in   In the chasenrc file distributed with version 1 01 or later of chasen   s dictionary  ipadic      GRAMMAR     is omitted    2  System dictionaries    This setting is used to specify double array dictionaries  chadic  da lex dat   omitting the exten   
29. that is within the established cost threshold  and display the results following    the formatting options  The meaning of each option is summarized in the next section     e Example usage    The input file can be given as arguments to ChaSen  For example                                   fe    cat temp    chasen temp  AL TADA  Ik 2N  HH  725  FR Ayay  A A  ee 4   Fla A  7z 32  EOS       AL   lk       MIER FRANTS k UA     Ai 4 R      AR  Bi 485  20 514 AT Re   4   Hl     HE   Ehia  EHRE      A  Bhad  Ave   Eaa    By  a    LA A                      LR  NITRE HA  KEK YA EHH  RIR   A SEAR                                                       1 3 Runtime Options    ChaSen supports several runtime options  They are summariezed below  For options that take arguments    such as  r  the argument may be optionally separated from the option with whitespace     e Display options for ambiguous input  all display methods display in the same format for unambiguous    results      b Show only the solution with   the rightmost longest match    default       m    Show multiple morphemes for only the ambiguous areas     p Expand the ambiguous combinations and show all possible solutions separately    e Display options for individual morphemes    sp   Fh    format    e Other options     j   0     W     lp   1t   1f    file  width    rc_file    lang    Display arranged in columns  default    Show all morphological information by category name   Show all morphological information by category code
30. versity who helped us in many ways with JUMAN   s  development  Ken ichi Chinen gave us many suggestions about the ChaSen system   s development while he  was at NAIST  We received a variety of assistance from Osamu Imaichi  Tomoaki Imamura  Akira Kitauchi  while they were at NAIST during the development of ChaSen vesions 1 0 and 2 0 8  and from Tatsuo  Yamashita  Yoshitaka Hirano  and Hiroshi Matsuda during the development of versions 2 0 and 2 2  We  are extremely grateful to both these groups and all of the other members of Matsumoto Laboratory who  helped with ChaSen   s development  The     Japanese Speech Dictation Software Development Group     whose  representative member is Professor Kiyohiro Shikano of NAIST carried out large scale mantinence of the  IPA POS dictionary  In particular  we would like to thank Katunobu Itou of Housei University and Kaoru  Yamada from ASTEM for their assistance  We are grateful to Yasuharu Den of Chiba University for the  various pieces of advice about dictionary mantinence focusing on the analysis of spoken language  We also  received a lot of advice about converting ChaSen to autoconf automake and making RPM packages from  Tetsu Takabayashi and Taku Kudoh while they were at NAIST  Chooi Ling Goh  Cheng Yuchang  and Jia  Lu helped with the maintenance of the Chinese dictionary  Finally  although there are too many to name    individually  we would like to thank all of ChaSen   s users for the many comments and questions              tt
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
Mode d`emploi - Portamess 911 pH  ACMMD001 Blood Pressure Monitor Wrist MAN  Asrock E3C224-V+  LG W100 Brochure  Voucher Management System (VMS) User`s Manual, Release 3.0.0.1    Copyright © All rights reserved. 
   Failed to retrieve file