Home

The Czech Academic Corpus 2.0 Guide

1. AF Description AF Description AF Description AF Description Pred predicate a Pnom nominal AuxC conjunction AuxK terminal punctuation of node not predicate or subord a sentence depending on nom part of another node predicate with depends on copula be Sb subject AuxV auxiliary verb AuxO redundant or ExD a technical value for a be emotional item deleted item also for coreferential the main element of a pronoun sentence without predicate externally dependent Obj jobject Coord coord node AuxZ emphasizing AtrAtr an attribute of any word several preceding syntactic nouns Adv adverbial Apos apposition AuxX comma not AtrAdv structural ambiguity main node serving as a between adverbial and coordinating adnominal hung on a conjunction name noun dependency without a semantic difference Atv complement AuxT jreflexive AuxG other graphic AdvAtr dtto with reverse so called tantum symbols not preference determining terminal technically hung on a non verbal element AtvV complement AuxR passive AuxY adverbs AtrObj structural ambiguity so called reflexive particles not between object and determining classed adnominal dependency hung on a verb elsewhere without a semantic no 2nd gov difference node Atr attribute AuxP primary AuxS root of the tree ObjAtr dtto with reverse preposition F preference parts of a secondary preposition 51 Appendix
2. Plural e g nohami P S Singular e g noha W Singular for feminine gender plural with neuter can only appear in participle or nominal adjective form with gender value Q X Any Table C 5 Case Value Description Nominative e g Zena Genitive e g eny ees x Dative e g en ees Accusative e g zenu ees Vocative e g Zeno cy x99 Locative e g en x Instrumental e g enou xui my B S Nj Any Table C 6 Possessive gender Value Description F Feminine e g mat in jej M Masculine animate adjectives only e g otc X Any Z M I N Not feminine e g jeho Table C 7 Possessive number Value Description Y P Plural e g n S Singular e g m j x Any e g your Table C 8 Person Value Description 39 66 1 Ist person e g p u p eme M 66 2nd person e g p e p ete 2 c 2 3 3rd person e g p e p ou X Any person 48 Description of tags Table C 9 Tense Value Description F Future H R P Past or Present P Present R Past x Any Table C 10 Grade Value Description 1 Positive e
3. ozidlo pokud to nevy aduje bezp st silni n ho provozu Pod je kafaz no ustanoven d v j ho RI MV Je formulov no tak aby p i stanoven stanoven urcit 3it NNNS dopravn ch nehod ch nedoch z utomatick mu p edpokladu t AU INNES2 A ispoluviny fidi idla jedouc ho vp edu zde Je t eba a 4 ji bi povinnost chovat se v silni n m provozu opatrn a to se z etelem tomu Items 3639 Cur w W REC201X DAT s31w4 Ze nehod zavin nych z nedodr en bezpe n vzd lenosti za vp edu l atusbar 1 Navigator For navigating through words of the document that have been filtered by different criteria and the selection of words for disambiguation 19 The Czech Academic Corpus 2 0 CD ROM 2 Da Panels For displaying and disambiguating morphological information lemmas tags of a word The panel consists of two windows a grouping list and a list of items The latter displays all the lemma tag pairs associated with the current word on the particular m layer The former makes it possible to restrict the items to a particular group e g items with a particular lemma detailed pos or gender One of the panels is always defined as primary certain actions apply to that panel only e g Ctrl T activates the list of lemmas and tags in the main panel 3 Context Windows Contain various context information e g plain text of the document syntactic Structures etc
4. s42w P irozen jazyk v informa n ch syst mech s43w Cesk literatura s44w NA s45w V deckotechnick revoluce a socialismus s46w Zesilova e se zp tnou vazbou s47w Teorie a po ta e v geofyzice s48w V zkum hlubinn geologick stavby eskoslovenska s49w Podstata hypn zy a sp nek s50w Nukle rn medic na s51w Hutnictv a stroj renstv s52w Z ru n lh ty potravin sk ch v robk s53w Mineralogie s54w Pt ci s55w Elektronicky obzor 6 1974 s56w Tepl renstv s57w V decko technicky rozvoj za socialismu s58w Jak na pr ce se stavebninami s59w NA s60w Obkl d me interi ry a fas dy s61w Alpink v sv t s62w Opravujeme a modernizujeme rodinn domek s63w Jak na pr ce s kovem s64w Astronomie s65w Pokroky matematiky fyziky a astronomie s66w Elektrotechnick obzor s67w Hv zd sk ro enka s68w L ka sk fyzika 41 Appendix B Description of lemmas In the CAC 1 0 lemma has a form of string lemma P1 P2 P3 K where lemma is the lemma proper and P P2 P3 K stand for the optional additional info Table B 1 Additional information of the lemmas Labelling Separator Description Notes P1 morpho syntactic flag part of speech or its detailed specification P2 semantic flag common semantic clasifi
5. sse 19 3 3 3 TrEd Editor for syntactical annotations sss 20 3 3 4 Corpus viewer Netgraph ssssssssssssssssssese eee emen en enne ren 21 3 3 5 The automatic processing of texts ssssssssssssseseeee e eene 23 4 Bonus m terial 44 erect eet eS bok ooo dobe nooo hers ob pb 27 4 1 The STY X electronic exercise book 27 4 2 Voice control of the TrEd editor via the TrEdVoice module eee 28 S Tutorials cete a EIE utes ni to 30 InstallatiQn s2 5 ox eese esed cs eade e ios i oer aed Aeon bono at 31 7 Distribution and license information sss e mee men ne enne 32 Si Project VIPS 53 ee s E Mena ep d m eU See NS 33 9 Financial Support ceder tede aee ed teo ade Boo ee AO ooo Le oo ohon 34 TO Bibliography 345 12m ttc tele C ob eei teretes Ata terit des 35 A Sources of the texts ese eot ep teer E etel ee a Oe CE STIR DID Ue PORTU IR ES E 37 B Description of aoin aT KAERRA EE EE ee eme emen enn e nnne ee nen nr re rere EE 42 C Descriptionioftagss ise RETE Vtt std Ves olet bero de uen Tei ele 44 D Analytical function description cc cece cece cece ce eece cece ee ce eeee eee emen enne men nne ne nennen 51 F World Wide Web links 32655 ou e re odd eI beton e Li evite beds encode Sian aa 52 iii List of Figures 2 1 Example of an a layer annotation ssssssssssssssssssses eee emen enne nnn ne nennen 4 2 2 Technical interconnectio
6. zvr _zajmenoasiice v ak Neumannov S pro 1 INNIS7 A P7 X3 J NNFS1 A BRR 4 Adv AuxT Coord Sh AuxP Fantastickym zlato fantastick lato AAIS7 1 A NNNS4 A Atr Adv vylouzene olympijsk vytou en 3il olympijsk AANS4 1 A AANS4 1A Atr Atr Fantastickym tini amp em si v amp ak Neumannov dob hla pro vytou en olympijsk zlato We recommend the users to test the tools by running the script tool chain tA on an arbitrary Czech text The results of the script can be opened in the LAW tool which also enables the disambiguation of the assigned tags Run the script tool chain P on the manually disambiguated file The result of the script can be opened in the TrEd tool which also enables correcting the dependencies and analytic functions 26 Chapter 4 Bonus material 4 1 The STYX electronic exercise book The bonus material is aimed at advanced students in primary and high schools and their respective teachers The bonus material section labelled STY X 36 presents the user with an electronic exercise book for practising Czech morphology and syntax The most noteworthy feature of this material is the number of sentences offered More than 11 000 sentences have been compiled along with the corresponding annotations in the PDT to facilitate effective training In addition to this large vocabulary the application provides immediate verificat
7. 3 3 2 2 The usual workflow The usual annotation work proceeds as follows 1 Open the desired m file File Open Ctrl O The associated w file opens automatically 2 Switch to the ambi list Ambit name of m file in the Navigator that is displaying the ambiguous words words with more than one result of the morphological analysis and select the first word 3 Press Enter The cursor moves to the primary Da Panel Select the correct lemma and tag and press Enter again The cursor will move to the next ambiguous word In case you make a mistake switch to the list of all entries in the Navigator All find the word you want to review and select it The Da Panel will display the corresponding annotation You can now select the correct lemma and tag and then switch back to the Ambi X list 4 Save the annotations File Save Ctrl S 3 3 3 TrEd Editor for syntactical annotations The Tree Editor TrEd 37 1s a fully integrated environment primarily designed for the syntactical annotations of tree structures assigned to sentences The editor can also be used for data viewing and searching with the help of several kinds of search functions The TrEd supports the PML and CSTS formats of input and output More details on these formats can be found in 3 2 1 The TrEd system is highly modular which means support for other formats can be easily plugged in The TrEd offers various possibilities of custom settings User defined mac
8. P Personal pronoun ja ty on lit T you he incl forms with the P pronoun enclitic s e g tys lit you re gender position is used for third person to distinguish on ona ono lit he she it and number for all three persons Q Pronoun relative interrogative co copak co pak lit what P pronoun isn t it true that R Preposition general without vocalization R preposition S Pronoun possessive m j tv j jeho lit my your his gender P pronoun position used for third person to distinguish jeho jej jeho lit his her its and number for all three pronouns T Particle T particle U Adjective possessive with the masculine ending v as well as feminine in A adjective V Preposition with vocalization e or u ve pode ku lit in R preposition under to W Pronoun negative nic nikdo nijak dn lit nothing P pronoun nobody not worth mentioning no none X temporary Word form recognized but tag 1s missing in dictionary due to delays in asynchronous dictionary creation Y Pronoun relative interrogative co as an enclitic after a preposition o P pronoun na za lit about what on onto what
9. See Table 3 7 for an example of the complete annotation of the sentence V boj je i na m bojem Lat Your fight is our fight too in CSTS format Table 3 7 An example of sentence annotation in CSTS format s id n0lw s14 gt f id n0Olw s14W1 gt V lt l gt tv j pfivlast t PSYS1 P2 lt r gt 1 lt g gt 2 lt A gt Atr f id n01w s14W2 boj l boj t NNIS1 A lt r gt 2 lt g gt 3 lt A gt Sb f id n01w s14W3 je l byt t VB S 3P AA r 3 g 0 A Pred f id n01w sl4W4 i l i t Jg lt r gt 4 lt g gt B lt A gt AuxZ f id n0Olw s14W5 gt na m lt l gt m j p ivlast lt t gt PSZS7 P1l lt r gt 5 lt g gt 6 lt A gt Atr f id n01w s14W6 bojem l boj t NNIS7 A lt r gt 6 lt g gt 3 lt A gt Pnom lt D gt d id n0lw sl4W7 gt lt 1 gt lt t gt Z lt r gt 7 lt g gt 0 lt A gt AuxKk The DTD file for CSTS format can be found in the directory data schemas For more detailed information on this format see the PDT 2 0 documentation 13 Directories tools tool chain csts2pml andtools tool chain pml2csts provide conversion scripts for the two formats File naming conventions Each data file used in the CAC 2 0 relates to one annotated document The base ofthe file name contains a single letter that classifies the subject ofthe text contained in the file Namely n indicates newspaper articles s marks scientific texts and a denotes administrat
10. The Czech Academic Corpus 2 0 Guide Barbora Vidov Hladk Jan Haji Jirka Hana Jaroslava Hlav ov Ji M rovsk Jan Raab The Czech Academic Corpus 2 0 Guide by Barbora Vidov Hladk Jan Haji Jirka Hana Jaroslava Hlav ov Ji M rovsk and Jan Raab Table of Contents Preface rone po bz utu aaron TE cs Mahan shane tetas tons oec Eo Me toten dok i ch a tee 1 PASCUIS I PLE 2 2 1 Introducing the Czech Academic Corpus CAC 2 0 een 2 2 2 S0urces OF MEME KS iss ies ek ashe zastat eee tesis ite o Moto bobo bt ban 2 2 3 Annotation layers a neh ois avec utes a s A dr bed s oo ded cas a Pooh ER okay 2 2 4 The project S progress seve see et Ps bida soiree eet kila tei ue tese OI RE SER QU bd oe Ne EE E ERE 6 2 4 1 On the road to the CAC 2 0 Morphological annotation 2 6 2 4 2 On the road to the CAC 2 0 Syntactical annotation 2 6 3 The Czech Academic Corpus 2 0 CD ROM ssssssseee e em en ee mene e eene 9 3 T Directory Structut sires oso ole etat ves oso Stag odd olovo balast Ee qoot onde e Bebe 9 EMIL RM T cT 10 3 21 D t totmats 5 5 peces docent RAS US INDE 10 3 2 2 File naming conventions sssssssssssssessee nemen e enm 14 3 23 Data SIZE EE 15 3 3 To0l83 5 RE o k E o eene ud de ende dx 16 3 3 1 Corpus manager Bonito ecs aiea n Ea E eene eme emer enne 17 3 3 2 LAW Editor for morphological annotation
11. after for what Z Pronoun indefinite n jak n kter koli cosi lit some P pronoun some anybody s something a Numeral indefinite mnoho malo tolik n kolik kdovikolik C numeral lit nuch many little few that much many some number of who knows how much many b Adverb without a possibility to form negation and degrees of comparison D adverb e g pozadu naplocho lit behind flatly i e both the C 11 as well as the C 10 attributes in the same tag are marked by Not applicable c Conditional of the verb b t lit to be only by bych bys V verb bychom byste lit would d Numeral generic with adjectival declension dvoj desater lit C numeral two kinds ten e Verb transgressive present endings e ic ce V verb f Verb infinitive V verb g Adverb forming negation C 11 set to A N and degrees of comparison C 10 set to 1 2 3 comparative superlative e g velk za ji ma vy lit big interesting h Numeral generic only jedny and nejedny lit one kind sort of C numeral not only one kind sort of 1 Verb imperative form V verb j Numeral generic greater than or equal to 4 used as a syntactic noun t
12. and that were manually inserted into the CAC to replace missing words and numbers written as digits Table 3 9 Quantitative characteristics of the CAC 2 0 replacement characters equ and 00 Style Form Number of Number of Number of Number of characters in a in a specified or in a sentences not specified number of specified number containing number of sentences ofsentences replacement sentences symbols Journalism Written 1 769 1 187 925 680 2 694 1 563 8 671 Journalism Transcription 5 5 25 25 30 30 1 403 Scientific Written 2 149 1 222 2 230 1 418 4 379 2 030 9 083 Scientific Transcription 9 9 1 31 108 140 113 4 463 Administrative Written 901 611 635 476 1 536 915 2 447 Administrative Transcription 0 0 16 15 16 15 974 Every experiment conducted on the CAC 2 0 data made public should contain information about the data that was used to obtain the derived results The Czech Academic Corpus 2 0 CD ROM The Annotation of the CAC 2 0 is divided into three layers the w layer word layer m layer morphological layer and a layer analytical layer Each of these layers includes its own PML schema located in the directory structure data schemas files wdata schema xml mdata schema xml adata schema xml The directory structure data pm1 is composed of a total of 496 files 180 w files 180 m files a
13. dementi t natural sciences foreign word FFUK Faculty of Arts _ B K abbreviation institution culture Charles University u Filozof fakulta Univerzity Karlovy education abbreviation description l n lazy 1y derivation remove one character from the end ie add character y l n 43 Appendix C Description of tags Table C 1 Part of speech Value Description A Adjective C Numeral D Adverb I Interjection J Conjunction N Noun P Pronoun V Verb R Preposition T Particle X Unknown Not Determined Unclassifiable Z Punctuation also used for the Sentence Boundary token 44 Description of tags Table C 2 Sub part of speech Value Description POS Sentence boundary Z punctuation Author s signature e g ha 99 B S N noun ii Word kr t lit times C numeral i Conjunction subordinate incl aby kdyby in all forms J conjuction Numeral written using Roman numerals XIV C numeral Punctuation except for the virtual sentence boundary word which uses Z punctuation the C 2 Number written using digits C numeral Numeral kolik lit how many how much C numeral Unrecognized word form X unknown A Conjunction co
14. 27 30 Tutorial on TrEd http lectures ms mff cuni cz video recordshow index 2 23 31 XML http www w3 org XML 53 World Wide Web links Name description Location TOOLS 32 33 Bonito graphical user interface of the Manatee corpus manager http nlp fi muni cz projekty bonito LAW morphological annotation editor http www ling ohio state edu hana law html 34 Mor e morphological tagger of Czech http ufal mff cuni cz morce 35 Netgraph tool for searching dependency corpora http guest ms mff cuni cz netgraph 36 STY X electronic exercise book of Czech based on PDT http ufal mff cuni cz styx 37 TrEd syntactical annotation editor http ufal mff cuni cz pajas tred 38 TNT Trigrams n Tags tagger http www coli uni saarland de thorsten tnt 54
15. 7 AAIST 1A AAMP3 1A AAMS6 1A 7 AAMST 1A AANP3 1 A AANS6 1A 7 AANS7 1A fini em finis NNIS7 A finis NNIS7 A si b t VB S 2P AA 7 se zvr z jmeno stice se zvr z jmeno stice P7 X3 P7 X3 v ak v ak J v ak J Neumannov Neumannov S NNFS1 A NNFS5 A Neumannov S NNFS1 A dob hla dob hnout W VpQW XR AA 1 dob hnout W VpQW XR AA 1 pro pro 1 RR 4 pro 1 RR 4 vytou en vytou en 3it AAFP1 1A AAFP4 1A vytou en 3it AAFPS 1A AAFS2 1A AAFS3 1A AANS4 1A AAFS6 1A AAIPI 1A AAIPA 1A AAIP5 1 A AAMPA 1A AANS 1 1A AANSA A AANSS 1A olympijsk olympijsk AAFPI 1A AAFPA4 1A lolympijsky A ANS4 1A AAFPS 1A AAFS2 1A AAFS3 1A AAFS6 1A AAIPI 1A AAIPA 1A AAIP5 1 A AAMPA 1A AANSI 1A AANSA A AANSS 1A zlato zlato NNNS 1 A NNNS4 A zlato NNNSA A NNNSS A Z Z 25 The Czech Academic Corpus 2 0 CD ROM Figure 3 7 An example of sentence parsing a sample sentence txt 001 p1s1 AuxS dob hla ido b h nout W k VpQW XR AA 1 Ze Pred AuxK si v ak Neumannov pro find se_
16. All three examples were chosen from the CAC 2 0 deliberately so that the user can directly view the instances the name of the document and number of the sentence is provided for every sentence Figure 2 2 serves to illustrate the 1 1 ratio ofthe layers The layers do not differ except for the final punctuation Figure 2 3 exemplifies the situation where a word token is inserted into the text the year information was clearly missing Since it is almost impossible for the corrector to add the missing year the symbol is used as this symbol has no counterpart on the w layer In contrast Figure 2 4 illustrates the situation where more m layer units corresponds to the same w layer unit the word unit pedagogicko psychologick E psychological pedagogical has been divided into three separate units Figure 2 2 Technical interconnection of the w layer and m layer No changes other than the final sentence punctuation document n08w sentence No 155 Salesgir is brisk and smiling s Prodavacka je hbit a usm vav S prodava ka 2 byt Fbity a usm vav E NMES1 A VB S 3P A amp AAFS l LA J AAFSlI 1A Zi o o o o o o 9 O o o o o T Prodava ka je hbit a usm vav Figure 2 3 Technical interconnection of the w layer and m layer The insertion of a word token document n46w sentence No 100 J has been discovered 7 v Bylo objeveno roku S byt objev
17. K Files Next result occurence has been loaded Users always use the client side of the Netgraph application The client connects to the public server auest ms mff cuni cz through the 2001 port Another possibility for the user is to install the server part of the application and then search the corpus offline 3 3 5 The automatic processing of texts The data and applications for the morphological and syntactical analysis of the Czech texts were developed simultaneously The CD ROM contains two fundamental morphological applications morphological analysis and tagging and one syntactical application parsing Also the procedure for tokenisation is included Tokenistion is the process of splitting the given text into word tokens Its result is so called vertical which means it is a file containing each word or punctuation on a separate line The term tokenization is often used for both splitting the text into words and segmentation 1 e marking sentence and paragraph boundaries Our tokenisation procedure also segments the text However we understand tokenisation even more broadly the procedure vartically converts into the CSTS format see Chapter 3 2 1 2 This conversion includes adding the file header to the beginning of the vertical column and marking each word with a simple tag distinguishing the word properties that are clear straight from the orthographic form of the word Punctuation digits
18. ROM Each tool is described by its main features and its appointed kind of use The following sections describes the tools in more detail The Czech Academic Corpus 2 0 CD ROM Table 3 11 Tools outline Tool Description Purpose Bonito Corpus manager Searching within CAC 2 0 texts Searching within the morphological annotations of the CAC 2 0 Searching within the analytical functions assigned to words in the CAC 2 0 as a part of the a layer Basic statistics on the CAC 2 0 LAW Morphological annotations editor Morphological annotation manual disambiguation of morphological analysis results TrEd Syntactical annotations editor Syntactical annotations assigning analytical functions and syntactical dependencies Netgraph Corpus viewer Searching within the trees in the CAC 2 0 tool chain Automatic procedure processing Czech texts Tokenisation Morphological analysis Tagging automatic disambiguation of morphological analysis results Parsing automatic syntactical analysis with analytical functions assignment 3 3 1 Corpus manager Bonito The graphic tool Bonito 32 simplifies tasks commonly associated with language corpora especially searching within them and calculating basic statistics on the search results Bonito is a system upgrade of the corpus manager Manatee which conducts various operations on corpus data A detailed documentation
19. TrEd tool can only be used in MS Windows OS Installing the TrEd in MS Windows using the installation package distributed with the CAC 2 0 tools TrEd tred wininst en zip also installs the TrEdVoice tool Please note that even though the TrEdVoice is offered as bonus material its user manual is placed in the directory tools TrEd docs not in bonus tracks due to the TrEdVoice s close interconnection with the TrEd 31 Chapter 7 Distribution and license information The full distribution of the CAC 2 0 CD ROM can be ordered from the Linguistic Data Consortium 10 publishing house during the ordering process you will be redirected to the license agreement web page see the license agreement text at http ufal mff cuni cz corp lic cac20 reg cs html http ufal mff cuni cz corp lic cac20 reg en html To complete the order the user must fill in the license agreement form Some of the distributed tools are covered by the GPL License GNU Public License This fact 1s always explicitly stated in the README EN txt file ofthe tool which is placed in the home directory of the tool on the CAC 2 0 CD ROM In these cases the GPL takes precedence over the CAC 2 0 license 32 Chapter 8 Project VIPs All the people who contributed to the CAC 2 0 are introduced by name Czech Academic Corpus version 2 0 Morphological annotations checking Ji M rovsky Syntactical annotations checking Alla B mov
20. a possibly surprising conclusion They have decided to ignore the original annotation completely and process the manually morphologically annotated texts Introduction of the CAC 1 0 by an automatic procedure parser This procedure assigns a dependency tree to each sentence and an analytical function to each node These automatically assigned trees have been manually verified annotated The maximum spanning tree parser MST parser described below has been used For details see 3 3 5 Professional linguists conducted the analytic annotation of Prague Dependency Corpus Two annotators from the PDT group became the main arbiter for our project Among the other annotators were two students of philology and two Czech and three Slovak annotators experienced in annotating the Slovak National Corpus 21 under the leadership of Prague linguists trained in the PDT annotations Therefore the CAC annotation had two phases annotation arbitration In the beginning each document was annotated by two annotators the annotators worked in parallel The two annotations were automatically compared and the result proceeded to the arbiter As soon as the arbiter agreed that the work of the annotators was fluent enough each document was annotated only once During the second stage of annotations the arbiter reviewed the complete documents not only the differences in parallel annotations The documents were then processed by the automatic scripts verifying th
21. call this annotation an analytical layer While creating the CAC 1 0 the omitted words and numerical expressions were manually replaced by wildcard symbols and these corrections and the reasons why those changes were deemed necessary are described in detail in the CAC 1 0 Guide Vidova Hladka a kol 2007 These wildcard symbols were not further processed during the phase of CAC 2 0 s creation The CAC 2 0 offers e For linguists Language material reflecting the real usage of the language For computational linguists The tools and a considerable amount of data that could help amend applications working with natural language and are not feasible without morphological and syntactical text processing For TrEd annotation tool users The possibility to use voice control for the tool For teachers and their students An interesting didactic tool for practising Czech language morphology and syntax 2 2 Sources of the texts The CAC contains mostly unabridged articles taken from a wide range of media These articles include newspapers magazines and transcripts of spoken language from radio and TV programs covering administration journalism and scientific fields The texts are taken from the 70s and 80s of the 20th century and thus the selection of texts is influenced by the political and cultural climate of this time period A complete list of resources can be found in Appendix A 2 3 Annotation layers
22. eene nee nen ne nennen 9 3 2 The PML schema of the w layer in the CAC 2 0 ener 11 3 3 Part of the header of the m layer instance nOlw m 2 22 0000 een 11 3 4 Part of the header of the a layer instance NOLW a ssssssseee e 11 3 5 An example of sentence m layer annotation in the PML format 12 3 6 An example of sentence a layer annotation in the PML format 13 3 7 An example of sentence annotation in CSTS format 14 3 8 Size of the CAC 2 0 parts according to style and form re 15 3 9 Quantitative characteristics of the CAC 2 0 replacement characters and 15 3 10 A comparison of the CAC 2 0 and the PDT 2 0 nn 16 3 11 T6018 o tlt e 5 rites e ESL Rd PE He T HEP do zs n 17 3 12 Seript COOL Chain qxcsssei csset ees eei ee oc ou ta aae coe M No ed eh Poe HE oe VU hues VE Hebe elu Pre ste 24 3 13 An example of text treated with morphological analysis and tagging sssus 25 5T Data tutorials 2 55 o Hsbc eee em orm ep itte did E UD Gees 30 512 Tool tutorials eee eS me eee eoo teeta pide see xteeloto c eret ded dap dne 30 6 1 Tools compatibility with Linux and MS Windows operating systems 31 AA Adrmmnistr tive documents 3 neni rte e nete LH end E EHE SERE P Ed deed 37 A 2 Documents covering journalism c cece ceecce eee eece cece cece eese eee en en enm e ne eren nennen 38 A 3 Docu
23. em esky Academia Praha 2006 Ribarov 2004 Kiril Ribarov Automatic Building ofa Dependency Tree The Rule Based Approach and Beyond Doktorsk pr ce MFF UK Praha 2004 Ribarov B mov Hladk 2006 Kiril Ribarov Alla B mov Barbora Hladk When a statistically oriented parser was more efficient than a linguist A case of treebank conversion Prague Bulletin of Mathematical Linguistics 86 pp 21 38 2006 35 Bibliography Savick Hlav ov 2002 Petr Savick Jaroslava Hlav ov Measures of Word Commonness Journal of Quantitative Linguistics Swets amp Zeitlinger Vol 9 No 3 pp 215 231 2002 milauer 1972 Vladim r milauer Nauka o esk m jazyku Praha 1972 Vidov Hladk a kol 2007 Barbora Vidov Hladk Jan Haji Ji Hana Jaroslava Hlav ov Ji M rovsk Jan Votrubec Pr vodce esk m akademick m korpusem 1 0 Karolinum Praha 2007 Votrubec 2005 Jan Votrubec Volba vhodn sady rys pro morfologick zna kov n e tiny Selecting an Optimal Set of Features for the Morphological Tagging of Czech Master thesis MFF UK Prague Czech Republic 2005 Hana Zeman 2005 Ji Hana Daniel Zeman Jan Haji Hana Hanov Barbora Hladk Emil Je bek Manual for Morphological Annotation TR 2005 27 Institute of Formal and Applied Linguistics MFF UK Prague Czech Republic 2005 36 Appendix A Sources of the texts Ta
24. g velk 2 Comparative e g v t 3 Superlative e g nejv t Table C 11 Negation Value Description A Affirmative not negated e g mo n N Negated e g nemo n Table C 12 Voice Value Description A Active e g p c P Passive e g psan Table C 13 Reserve 1 Value Description not applicable Table C 14 Reserve 2 Value Description not applicable 49 Description of tags Table C 15 Variant Value Description Basic variant standard contemporary style also used for standard forms allowed for use in writing by the Czech Standard Orthography Rules despite being marked there as colloquial Variant second most used less frequent still standard Variant rarely used bookish or archaic Very archaic also archaic colloquial Very archaic or bookish but standard at the time Colloquial but almost tolerated even in public Colloquial standard in spoken Czech Colloquial standard in spoken Czech less frequent variant Abbreviations NO Oo U DI Ul UO NI gt Special uses e g personal pronouns after prepositions etc 50 Appendix D Analytical function description Table D 1 Analytical functions AF in the CAC 2 0
25. gt lt lemma gt tv j p ivlast lt lemma gt tag PSYS1 P2 tag m m id m n01w s14W2 src rf manual src rf w rf wiw n0lw sl14W2 w rf lt form gt boj lt form gt lt lemma gt boj lt lemma gt tag NNIS1 A tag m m id m nO01w sl14W3 src rf manual src rf w rf wiw n0lw sl4W3 w rf lt form gt je lt form gt lt lemma gt byt lt lemma gt lt tag gt VB S 3P AA lt tag gt lt m gt m id m nOlw s14W7 gt src rf manual src rf form change insert form change lt form gt lt form gt lt lemma gt lt lemma gt lt m gt lt s gt Table 3 6 shows an example of the analytic annotation of a sentence Vas boj je i nasim bojem Lit Your fight is our fight too The less important elements have been left out to make the example more transparent The dependency structure of the sentence is represented by structured nested elements Daughter nodes are enveloped by the element children Furthermore each node is enveloped in the element LM with the identifier of this node as an attribute lists of single nodes are the only exception as this element can be omitted for them The identifier of the node becomes an attribute of the element children The element m rf links to the corresponding element of the lower layer containing the particular word form The element afun contains the analytical function of the node The element
26. nl2w Tydenik aktualit n13w Zem d lsk noviny nl4w Gramorevue G 73 nl5w Tribuna nl w Z b r nl7w der n18w Svoboda n19w SluZba lidu n20w Zpravodaj TIBY n21w Nov Hradecko n22w Pochoden n23w Technicky tyden k n24w Horn k a energetik n25w S zavan n26w Cel kovicky zpravodaj n27w Nov Klatovsko n28w Pravda n29w Pr boj n30w Zpravodaj TIBY n31w Krkono sk pravda n32w kolstv a v da n33w Str lidu n34w Zbrojov k n35w Nov svoboda n36w Vlasta n37w Mlady sv t n38w Na amp e rodina n39w Ahoj na sobotu n40w Kv ty 38 Sources of the texts File Written form File Transcription eS n41w Signal n42w Zahr dk n43w Film a doba n44w Melodie n45w Stadion n46w V da a technika ml de i n47w Hal sobota n48w Sv t socialismu n49w Zahradnick listy n50w Kino n51w Chovatel n52w Z pisn k Z 73 39 Sources of the texts Table A 3 Documents covering the scientific field file written form file transcription sOlw D jiny esk hudebn kultury s69s Divadeln p ehl dka s02w Motivace lidsk ho chov n s70s V klad Z kon ku pr ce s03w Skola opora socialismu s71s Opera o Bratrech Karamazov ch pr
27. tools Table 5 1 Data tutorials Video clip m layer 23 a layer 22 PML 15 Table 5 2 Tool tutorials Video clip Demo Text Bonito 24 Bonito tutorials bonito en htm B o n i t o tutorials bonito text en htm LAW 25 LAW tutorials law en htm TrEd 30 TrEd tutorials tred en htm bTrEd 12 Netgraph 26 Netgraph tutorials netgraph en htm STYX 29 STYX tutorials styx en htm TrEdVoice tutorials tredVoice_cs htm 30 Chapter 6 Installation To streamline your work with the CAC 2 0 we provide installation programs for Linux and MS Windows operation systems Please note that in both operating systems the components of the CD ROM are copied to the hard drive not installed Users must install the selected tools themselves the README EN txt file with the installation instructions is available for every tool in its home directory within the CD directory This file contains the system requirements documentation references and installation instructions Most parts of the CAC 2 0 can also be used directly from the distributed CD ROM or its copies Table 6 1 summarises all tools contained on the CD ROM and the possibility to run them in Linux and MS Windows operating systems Table 6 1 Tools compatibility with Linux and MS Windows operating systems Tool Linux MS Windows Bonito yes yes LAW yes yes STYX yes yes Tr
28. 2005 Ryan McDonald Fernando Pereira Kiril Ribarov Jan Haji Non projective Dependency Parsing using Spanning Tree Algorithms Proceedings of HLT EMNLP 2005 pp 523 530 Vancouver Canada 2005 Mikulov a kol 2006 Marie Mikulov Alevtina B mov Jan Haji Eva Haji ov Ji Havelka Veronika Kol ov Lucie Ku ov Mark ta Lopatkov Petr Pajas Jarmila Panevov Magda Raz mov Petr Sgall Jan t p nek Zde ka Ure ov Kate ina Vesel Zden k abokrtsk Anotace na tektogramatick rovin Pra sk ho z vislostn ho korpusu Anot torsk p ru ka TR 2005 28 Institute of Formal and Applied Linguistics MFF UK Prague Czech Republic 2005 M ller Psutka m dl 2000 Lud k M ller Josef Psutka Lubo m dl Design of Speech Recognition Engine TSD 2000 Lecture Notes in Artificial Intelligence Springer Verlag Berlin Heidelberg pp 259 264 2000 Pajas t p nek 2005 Petr Pajas Jan t p nek A Generic XML based Format for Structured Linguistic Annotation and its Application to the Prague Dependency Treebank 2 0 TR 2005 29 Institute of Formal and Applied Linguistics MFF UK Prague Czech Republic 2005 P ikryl 2007 Leo P ikryl Rozhran v mluven m jazyce pro korpusov anota n n stroje Diplomov pr ce MFF UK Praha 2007 Psutka M ller Matou ek Radov 2006 Josef Psutka Lud k M ller Jind ich Matou ek Vlasta Radov Mluv me s po ta
29. 4 similarly shows the referential part of the header of the instance of the a layer n01w a referring to the PML schema of that instance adata schema xml and the corresponding m layer instance n01w m and w layer instance nOlw w Table 3 4 Part of the header of the a layer instance nOlw a head schema href adata schema xml gt references reffile id m href n0lw m name mdata reffile id w href n0lw w name wdata lt references gt lt head gt The annotation is expressed using XML elements and attributes named and used according to their corresponding PML schema Table 3 5 illustrates an example of the morphological annotation of a part of the sentence V boj je i na m bojem Lit Your fight is our fight too The opening tag of the 11 The Czech Academic Corpus 2 0 CD ROM element s contains an identifier of the whole sentence followed by the opening tag of the element m which contains identifiers to the annotation corresponding to the token of the w layer that are being referred to from the element w x Other elements contain the form form morphological tag tag and src rf provides the source of the annotation in this case a manual annotation Table 3 5 An example of sentence m layer annotation in the PML format s id m nO01w s14 m id m nOlw s14wW1 gt src rf manual src rf w rf wiw nO0lw sl4Wl1 w rf lt form gt V lt form
30. 9s Dlouhodob skladov n masa s22w esk literatura s90s Personalistika s23w Ceskoslovensk informatika s91s Archeologick n lezy v Tou eni Jaroslav pa ek s24w N rodopisn aktuality s92s P edn ka o geografii s25w Vlastiv dn sborn k moravsk s93s vod do d jin feudalismu s26w esk lid s94s Filosofie fyziky RNDr Ji Mr zek CSc s27w Otazky lexik ln statistiky s95s O v voji knihovnictv s28w Pam tkov p e 4 1974 s96s Z kladn podm nky pro p stov n zeleniny s29w Z kladn a rekrea n t lesn v chova 10 1974 s97s O v chov socialistick inteligence s30w Spole ensk v dy ve kole 2 1974 s98s Petrologie sediment a rezidu ln ch hornin s3lw Hospod sk pr vo s99s Organizace a zen vnit n ho obchodu s32w Soci ln jistoty v era a dnes s00s Rozbor situace v JZD 40 Sources of the texts file written form file transcription O s33w Arbitr n praxe s34w Filosofick asopis 5 1974 s35w eskoslovensk psychologie s36w Spole ensk struktura a revoluce s37w Humanismus v na filosofick tradici s38w Spole nost vzd l n jedinec s39w Rozvoj osobnosti a slovesn um n s40w Ke kritice bur oasn ch teori spole nosti s41w Spisovn jazyk v sou asn komunikaci
31. CAC 2 0 tags are designed according to the PDT as strings of definite length 15 positions where each position corresponds to a single category Appendix C contains the complete list of these morphological positional tags and their detailed description Example The word form Prahu a form of Prague is analysed as an affirmative 11th position noun 1st and 2nd position feminine 3rd position singular 4th position and accusative 5th position All of the other positions are correctly filled with the symbol that represents the irrelevance of the morphological category towards the part of speech For example one does not determine a person and tense with nouns 8th and 9th position Table 2 1 Examples of lemmas and tags of particular word forms Word token Lemma Tag Description Prahu Praha NNFS4 A Noun feminine singular accusative affirmative 123 123 C Digit token Zee Punctuation mark right parenthesis An a layer annotation assigns each word unit the corresponding data characterising the syntactical features of the unit and therefore its relation to the other sentence elements along with its sentence function Formally the sentence relations are represented by a dependency tree The word unit functions in the sentence are represented by so called analytic functions which are listed and described in Appendix D Example Figure 2 1 shows the syntactical an
32. Discriminative Training Methods for Hidden Markov Models Theory and Experiments with Perceptron Algorithms Proceedings of EMNLP 2002 University of Pennsylvania Philadelphia USA 2002 Cerm k Blatn 2005 Franti ek erm k Renata Blatn Jak vyu vat esk n rodn korpus Nakladatelstv Lidov noviny Praha 2005 Haji 2004 Jan Haji Disambiguation of Rich Inflection Computational Morphology of Czech Karolinum Praha 2004 Haji et al 2004 Jan Haji Jarmila Panevov Eva Bur ov Alevtina B mov Jan t p nek Petr Pajas Ji K rn k Anotace na analytick rovin N vod pro anot tory Institute of Formal and Applied Linguistics MFF UK Prague Czech Republic 2004 Hladk Kr l k 2006 Barbora Hladk Jan Kr l k Prom ny esk ho akademick ho korpusu Slovo a slovesnost 67 pp 179 194 2006 Jelinek Be ka T itelov 1961 Jaroslav Jel nek Josef V clav Be ka Marie T itelov Frekvence slov slovn ch druh a tvar v esk m jazyce FSSDTCJ SPN Praha 1961 Kop ivov Kocek 2000 Marie Kop ivov Jan Kocek esk n rodn korpus vod a p ru ka u ivatele FF UK Prague Czech Republic 2000 Ku era 2006 Ond ej Ku era Pra sk z vislostn korpus jako cvi ebnice jazyka esk ho Prague Dependency Treebank as an Exercise Book of Czech Master thesis MFF UK Prague Czech Republic 2006 McDonald Pereira Ribarov Haji
33. E World Wide Web links Name description Location PROJECTS Resources and tools for information systems http ufal mff cuni cz rest Morphological tagging of Czech a complete guide http ufal mff cuni cz czech tagging Parsing of Czech a complete guide http ufal mff cuni cz czech parsing INSTITUTIONS Academy of Sciences of the Czech Republic http www av cz Grant Agency of the Academy of Sciences of the Czech Republic http www gaav cz Department of Cybernetics of the University of West Bohemia in Plzen Czech Republic http www kky zcu cz Ministry of Education Youth and Sports of the Czech Republic http www msmt cz Charles University in Prague Czech Republic http www cuni cz Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague Czech Republic http ufal mff cuni cz 10 Linguistic Data Consortium Philadelphia PA USA http www ldc upenn edu 11 Institute of Czech Language Academy of Sciences of the Czech Republic http ujc cas cz DATA RESOURCES GUIDELINES TUTORIALS 12 bTrEd and nTrEd tutorial tutorial on bTrEd and nTrEd http ufal mff cuni cz pdt2 0 doc tools tred bn tutorial html 13 14 csts DTD an internal data format based on SGML http ufal mff cuni cz pdt2 0 doc pdt guide cz html ch03 htmlfFa data formats csts Czech National Corpus http ucnk f
34. Ed yes yes TrEdVoice no yes Netgraph yes yes tool chain yes no Use the following commands to run the Installation e Installation in Linux OS Run the program Install on Linux p1l from the root directory of the CD ROM Installation in MS Windows Launch the installation program by double clicking the Install on Windows exe icon in the root directory of the distribution The installation process starts with one of these two types of installation The user is then prompted to enter the destination folder the structure of the destination folder will follow the directory structure of the CD ROM Basic Copies ofthe documentation tutorials and installation packages of Bonito TrEd including the TrEdVoice module for voice control in MS Windows and STYX tools Custom Copies all components selected by the user from the CD ROM Warning for CD ROM CAC 1 0 users The installation programs contained on the CD ROM CAC 2 0 are independent of CAC 1 0 installation We recommend installing all the tools that were part of the CAC 1 0 installation again from the CAC 2 0 CD ROM The CAC 2 0 distribution contains updated versions of the tools Warning for Bonito tool users To search within the CAC 2 0 using the Bonito tool itis not necessary to copy the CAC 2 0 in XML format from the data pml directory Warning for TrEd and TrEdVoice tool users The TrEdVoice module for the voice control of the
35. Figure 2 5 illustrates operations on the data since the CAC 1 0 release up until the CAC 2 0 release Introduction Figure 2 5 CAC 2 0 preparation data processing Chapter 3 The Czech Academic Corpus 2 0 CD ROM 3 1 Directory structure This section describes the visual representation of the directory structure contained in the CD ROM up to its second or third tier see Table 3 1 Any references made regarding the content of the CD ROM that resides deeper within the tree structure notes the full path to the file Table 3 1 CAC 2 0 CD ROM Directory structure index html CAC 2 0 Guide in Czech html index en html CAC 2 0 Guide in English html Install on Linux pl Install script for Linux English Install on Windows exe f Installation program for MS Windows English Instaluj na Linuxu pl Installation script for Linux Czech Instaluj na Windows exe Installation program for MS Windows Czech bonus tracks Bonus material STYX Electronic exercise book of Czech language data Data component csts CAC 2 0 in CSTS format files ans 0 9 0 9 sw csts pml K CAC 2 0 in PML format files ans 0 9 0 9 sw amw schemas PML schemes and dtd of CSTS format doc Documentation cac guide CAC 2 0 Guide in Czech and English pdf tools Tools Bonito Corpus manager Java Java Runtime Environment 6 Update 3 for Linux and MS Windows LAW Editor of morphological annotati
36. Katar na Gajdo ov Katarina Kandra ov Ivana Kl mov Kiril Ribarov Zde ka Ure ov Miroslav Zumr k Tools Bonito Pavel Rychl Old ich Kr za LAW Jirka Hana TrEd Petr Pajas e Netgraph Ji M rovsky Segmentation and tokenization of Czech texts Jan Haji Michal K en Czech morphological analyser Jan Haji Jaroslava Hlav ov David Kolovratn k Pavel Kv to Tagger Jan Raab e Parser Ryan McDonald V clav Novak Kiril Ribarov Automatic morphological and syntactical processing of Czech texts Michal Kebrt Bonus material e STYX Ond ej Ku era TrEdVoice Leo P ikryl CD ROM Web page Installation script Ond ej Bojar CD booklet web page Michal Sotkovsky CAC Guide Technical editor Jan Raab Czech language corrections Magda ev kov English translation Alena Chrastov Proofreading Sezin Rajandran 33 Chapter 9 Financial support The development of the Czech Academic Corpus version 2 0 has been supported by the following organizations and projects Grant Agency of Czech Academy of Sciences grants no 1ET101120413 1ET101120503 Grant Agency of the Charles University grant no 207 10 257559 Ministry of Education Youth and Sports grant no MSM0021620838 Faculty of Mathematics and Physics of the Charles University in Prague Charles University in Prague 34 Chapter 10 Bibliography Collins 2002 Michael Collins
37. L A Edit S AaSrrxKeHEBQQRQ E PMLA Celn unie v ohro en e a cmpr9410 001 p2s1 0 AuxS unie 1 ExD o o Celn 2 v 3 Atr AuxP ohro en 4 Sb id a cmpr9410 001 p2s1 Hlasov ovl d n P kaz rozpozn n vyber ohro en P kaz rozpozn n p epni na editaci Nahr v m gramatiku 0K P kaz rozpozn n uzel ty i je podm t P kaz rozpozn n uzel dva je podm t P kaz nebyl rozpozn n s ur itosti uzel dva je p edm t P kaz rozpozn n uzel t i je ve t et m p d P kaz rozpozn n uzel dva je lenem koordinace P kaz rozpozn n zp t 29 Chapter 5 Tutorials We provide two kinds of tutorials to simplify introducing the data and the tools to the user Mainly there are videos and handouts of the lectures given at the tutorial on the PDT Prague Treebanking for Everyone A two day tutorial 28 held in the autumn of 2006 The videos and text documents provided are in English The second kind of tutorials are the demos guiding the user through the graphical interface controls of the provided tools The demos are placed directly on the CD ROM while the videos are linked from an external source Table 5 1 lists all tutorials videos concerning the data the tutorials on annotation layers m layer a layer and the tutorial on the inner data representation PML format Table 5 2 lists all tutorials videos and demos or texts concerning the
38. We cannot call a corpus annotated without specifying what kind of annotation the corpus contains In other words from the linguistic theory viewpoint one must first characterise the so called layers This text contains both bibliographic references e g Vidov Hladk a kol 2007 and Internet references in the form of a number in brackets e g 1 referring to the list of internet URLs in Appendix E Introduction of annotation The annotation of the CAC 2 0 covers two layers morphological and analytical To be absolutely accurate we must add that we also operate on another layer the layer of words In fact the word layer is not a layer for annotation as it consists of the original text divided into word tokens words numbers written in digits and punctuation However for the sake of convenience we will refer to the word layer as an annotation layer Henceforth we will refer to the word morphological and analytical layer as the w layer m layer and a layer respectively A morphological layer of annotation provides the word tokens with further data annotation which characterises the morphological properties of the word tokens as apparent in the lemma which is the canonical form of a lexeme the part of speech and morphological categories case number tense person etc Formally part of speech classes combine together with values of morphological categories to represent morphological tags or simply tags In the
39. a CES eE LER 50 D 1 Analytical functions AF in the CAC 2 0 sss emere 51 Chapter 1 Preface The Prague family of annotated corpora has a new member the Czech Academic Corpus version 2 0 CAC 2 0 the morphologically and syntactically manually annotated corpus of the Czech language The precise formulation of the CAC 2 0 would be new and old member as there was only one version preceding the current one The first version contained only morphological annotations it was published a year ago therefore it can be understood as outdated The new phenomenon brought about by the CAC 2 0 is syntactical annotation therefore we can characterise our corpus by another Praguian attribute dependency The CAC 2 0 Guide is a guide to the CD ROM just like the previous CAC 1 0 Guide The contents of the Guide provide all the necessary information about the project however the user does not need to be familiar with the CAC 1 0 Guide The CAC 1 0 Guide can be referred to for the details of the CAC project s history and its preparation details Nevertheless if you are already familiar with the CAC 1 0 Guide navigating it will be easy as we have maintained its chapters organisation into three main units The first unit Chapter 2 describes the main characteristics of the Czech Academic Corpus 2 0 the structure of its annotations and the documentation of the partial steps of the syntactical annotations The second unit Ch
40. a summary of the main characteristics of the PML format detailed information has been published in a technical report Pajas St p nek 2005 Part 3 2 1 2 contains a summary of the main characteristics of the CSTS format For more detailed information see the PDT 2 0 documentation 13 3 2 1 1 The PML format These layers of annotation can overlap or be linked together in the PML as well as with other data sources in a consistent way Each layer of annotation is described in a PML schema file which can be seen as the formalisation of an abstract annotation scheme for the particular layer of annotation The PML schema file describes which elements occur in that layer how they are nested and structured what the attribute types are for the corresponding values and what role they play in the annotation scheme this PML role information can also be used by applications to determine an adequate way to present a PML instance to the user New schemata can be automatically generated out of the PML scheme e g Relax NG 19 This means that data consistence can be checked by common XML tools Both versions of the schemata are available in the directory data schemas An example of the w layer part of the PML schema of the CAC can be found in Table 3 2 data schemas wdata schema xml In the illustrated example the paragraph type para the whole document in the case of the CAC 2 0 consists of an array of w node type elements This type is closely defin
41. apters 3 through 6 contain the CD ROM information and the documentation of the data component tools bonus material and tutorials Part 3 2 introduces the corpus as a data file with an inner representation A considerable amount of information concerns the corpus viewing tools Bonito part 3 3 1 and Netgraph part 3 3 4 annotation editors LAW part 3 3 2 and TrEd part 3 3 3 and tools for morpho syntactical processing of texts part 3 3 5 Chapter 4 is decorated with two bonuses these are the STY X Czech electronic exercise book part 4 1 and the TrEdVoice module for the voice control of the TrEd part 4 2 All the tools provided and their graphical interfaces are documented and equipped with tutorials in the form of demos see Chapter 5 for the complete list Chapter 6 contains the installation instructions for the CD ROM components Chapter 7 summarises the information on the distribution of the CD ROM Chapters 8 and 9 form the third unit of the Guide They cover the personal and financial aspects of the project You will find five annexes Appendix A enumerates the sources of corpus texts Appendix B describes the structure of lemmas for the simple orientation in the morphological annotations Appendix C describes the structure of a morphological tag Appendix D guides the user through syntactical annotations Appendix E completes the Guide with web links This CD is being published in the final year ofthe project Resources and Tool
42. ble A 1 Administrative documents File Written form File Transcription a01w Vyhl ka 100 al6s Zelena vlna a02w Hospoda en s domovn m bytov m majetkem al7s Zpr vy o po as a03w Pracovn d al8s P ehled rozhlasov ch po ad a04w N rodn poji t n 12 1977 al9s Hl sen v metru a05w Kolektivn smlouvy TIBA a06w Materi l TIBA a07w Zpr va o innosti stavu pro jazyk esk a08w Metodick pokyny a09w Z pisy z porad al0w Zavazky allw Z pisy ze sch z al2w Pokyny SURPMO al3w Pracovn n vody pokyny al4w Ob n ky stavu pro jazyk esk al5w Zpr va o innosti odd len matematick lingvistiky a20w Hl en v obchodn m dom 37 Sources of the texts Table A 2 Documents covering journalism File Written form File Transcription nOlw Rud pr vo n53s Rozhlasov report e a rozhovory n02w Sv t pr ce n54s Televizn koment e n03w Pr ce n55s Zpr vy s rozhlasu n04w eskoslovensk rozhlas I n56s Televizn diskuse n05w Mlad fronta n57s Televizn zpr vy a report e n06w eskoslovensk rozhlas II n58s Rozhlasov diskuse n07w Ve ern Praha n59s Televizn zpr vy a lekce n08w eskoslovensk sport n60s Televizn diskuse a koment e n09w Svobodn slovo nl0w Lidova demokracie nllw Obrana lidu
43. cation P3 style flag stylistical classification K comment explanatory note derivational comments other comments Table B 2 Morpho syntactic flags of the lemmas Value Description B abbreviation T imperfect verb W perfect verb Table B 3 Semantic flags of the lemmas Value Description member of a particular nation inhabitant of a particular territory geographical name chemistry company organization institution natural sciences product surname family name medicine given name S KC U RT RFE an economy finances o computers and electronics technology in general oo justice B other proper name color indication o politcs government military culture education arts other sciences sports hobby leisure travelling Nl Is e we ecology environment 42 Description of lemmas Table B 4 Style flags of the lemmas Value Description a archaic e expressive h colloquial slang argot n dialect S bookish t foreign word v vulgar x outdated spellimg or misspeling Table B 5 Examples of lemmas Lemma Additional info Description Abchaz Abkhazian _ E member of a particular nation Agned JY t given name foreign word dobromysl oregano L
44. context Both contexts offer only a single display style PML A To view the list of all defined macros and the hotkeys assigned to them for any currently used context choose View List of Named Macros from the menu Corpus viewer Netgraph Netgraph 35 is a client server application for searching through and viewing the CAC 2 0 Several users can view the corpus online at the same time The Netgraph has been designed for simple and intuitive searching while maintaining the high search power of the query language 21 The Czech Academic Corpus 2 0 CD ROM A guery in Netgraph is formulated as a node or tree with defined characteristics that should match the reguired trees in the corpus Therefore searching the corpus means searching for sentences annotated into the form of trees containing the given node or tree The user s queries can range from the very simple e g searching for all trees in the corpus containing a desired word to the more advanced queries e g searching for all sentences containing a verb with a dependent object where the object is not in dative and there is at least one dependent adverbial of direction etc So called meta attributes enable searching for even more complex structures The Netgraph tool offers a user friendly graphical interface for query formulation See Figure 3 5 as an example This simple query searches for all the trees containing a node marked as the predicate that has at least two dependent
45. e different phenomena between the annotation stages The automatic scripts verification was inspired by the scripts used in the PDT 2 0 preparations similarly to the morphological annotations The scripts marked suspicious positions in the data The relations of the nodes on the analytical layer have been checked for their grammatical permissibility and the possible combinations of the morphological tag and analytical function of each node has been checked In the next stage the marked suspicious positions were highlighted and a brief description ofthe possible problem was displayed on the annotator s screen The problem could occur either in the morphological or in the analytical annotation All of the verifications conformed to the rules of PDT morphological annotation 18 As an example of the analytically morphological verifying script we will describe the script as it checks the annotation of the word form se The script checked the following condition for each node for the word form se Each node for the word form se is either a reflexive pronoun with the analytical function AuxT or AuxR or it is a vocalised preposition with the analytical function AuxP Other scripts reviewed the agreement of morphological tag categories or the permissibility of the combination of the governing and dependent nodes analytical functions e g the preposition and its dependent noun or the permissibility of the position of a node marked as subject Subj
46. ed as a structure also containing obligatory elements id unambiguous identifier with the role of ID and token word unit The Czech Academic Corpus 2 0 CD ROM Table 3 2 The PML schema of the w layer in the CAC 2 0 type name w para type gt sequence lt element name w type w node type sequence type type name w node type gt structure name w node gt member as attribute 1 name id role ID required 1 gt lt cdata format ID gt lt member gt lt member name token required 1 gt lt cdata format any gt lt member gt lt member name no space after type bool type gt lt structure gt lt type gt Every PML instance begins with a header referring to the PML schema The header contains references to all external sources that are being referred to from this instance together with some additional information necessary for the correct link resolving The rest of the instance is dedicated to the annotation itself Table 3 3 provides an example of a PML schema mdata_schema xm1 and the appropriate instance within the w layer n01w w being linked to the part of the head of the m layer instance nOlw m Table 3 3 Part of the header of the m layer instance nOlw m lt head gt lt schema href mdata schema xml gt lt references gt lt reffile id en w href a0lw w name wdata gt lt references gt lt head gt Table 3
47. ed namespace http ufal mff cuni cz pdt pml this is not a real link it is just a name of the namespace The PML format offers unified representations for the most common annotation constructs such as attribute value structures lists of alternative values of a certain type either atomic or further structured references within a PML instance links among various PML instances used in the CAC 2 0 to create links across layers and links to other external XML based resources 3 2 1 2 CSTS format A single file in CSTS format can contain all layers of annotation 13 The Czech Academic Corpus 2 0 CD ROM 3 2 2 A CSTS format file opens with a facultative header element h followed by at least one doc element The element doc consists of a header element a and contents element c The element c is then formed by a seguence of paragraphs element p and sentences of those paragraphs element s Each word token of the sentence is placed on a separate line in the file element or d for punctuation The line continues with the annotations of this word token on all layers The element 1 is filled with the lemma the element t contains its morphological tag The element A is filled with the analytical function of the word token The unique identifier of the word token in the sentence is stored in the element r The element g contains a link to the governing node of the word in the form of an identifier of that governing node
48. em that consists of a large number of hotkeys is also complicated for the user s memory One of the ways of how to rid the user from these complications is the voice control system which is quite rarely used for application programs That was why we have developed the TrEdVoice module P ikryl 2007 This module s purpose was not to create a complete voice control of all TrEd functions and enable its full control without using the keyboard and mouse However it is a useful accessory extending the original control possibilities menus hotkeys and mouse Figure 4 3 shows the main TrEd screen with voice control enabled The automatic speech recognition module so called ASR module created by the Department of Cybernetics of the University of West Bohemia in Plzen s team 6 M ller Psutka Sm dl 2000 is used for voice commands recognition The ASR module is not embodied into the TrEdVoice it runs independently as the ASR server and the TCP IP network protocol is used to communicate with the TrEdVoice The ASR module is based on statistics and it is speaker independent which means it can recognise an arbitrary speaker s voice For more details on voice recognition see Psutka M ller Matou ek Radov 2006 28 Bonus material Figure 4 3 The TrEd editor screen with the TrEdVoice module enabled E TRee EDitor Default 2 1 D Leos MFF PDT tred data cmpr9410 001 a p E JEJ File View Node Session Bookmarks User defined Help PM
49. en in a scientific style The file links to s1 7w mand s17w w files file s17w m links to s17w w file The code s17w csts defines a CSTS file containing all layers w layer m layer a layer annotation of a document written in a scientific style Data size The CAC 2 0 is composed of 180 manually annotated documents containing 31 707 sentences and 652 132 tokens as calculated from the m files Tokens without punctuation total 570 761 and tokens without punctuation and digit tokens reach 565 928 Table 3 8 states the sizes of the individual parts of the data according to its style and form Table 3 8 Size of the CAC 2 0 parts according to style and form Style Form Number Number of Number of Number of Number of of docs sentences word tokens word tokens word tokens w o w o punctuation punctuation and digit tokens Journalism Written 52 10 234 189 435 165 469 163 700 Journalism Transcription 8 1433 28 737 24 864 24 859 Scientific Written 68 11 113 245 175 216281 214 132 Scientific Transcription 32 4576 115 853 100 281 100 272 Administrative Written 16 3362 58 697 51 431 50 530 Administrative Transcription 4 989 14 235 12 435 12 435 Total Written 136 24 709 493 307 433 181 428 362 Total Transcription 44 6998 158 825 137 S80 137 566 Total Written and 180 31 707 652 132 570 761 565 928 transcription Table 3 9 contains separate quantitative data for the characters
50. es for viewing the syntactical annotations another context might enable changing the annotations e g the PML A Edit context allows for editing the annotations To change the context click on the current context name and choose another context from the pop up list 5 Current display style The display style can be changed in the same way as the context 6 Editing the display style 7 Viewing the list of all sentences in the open file 8 Buttons for opening saving and re opening a file 9 Buttons for moving to the previous or following tree in the open file and for window management Figure 3 4 TrEd Main screen amp TRee EDitor Default 1 1 D REST CAC cac20 data pml s0 2w a jm 6 Mode Session Bookmarks Userde ned Help Cortext PML A View n Hse AA r AAA 9 DAaGR Sub PLA A y Probl my motivace jsou tak star jako lidstvo e 1 187 o a s02w s1 AuxS o o jsou b t VB P 3P AA Zee Pred Aux e o Probl my star probl m starj 2 lov k vec NNIP1 A AAIP1 1A Sb Pnom o o o molivace tak jako molivace tak 3 jako NNFS2 A Ob J Atr Adv AuxC o lidstvo lidstvo NNNS1 A ExD id 2 s02w slV 1 m lemma probl m mtag MNIPT A e The CAC 2 0 files open in the PML A View context by default In this context the user can view the trees and the editing is disabled In case you wish to edit the trees switch to the PML A Edit
51. f cuni cz 52 World Wide Web links Name description Location 15 16 Prague Markup Language an internal data format based on XML http ufal mff cuni cz jazz pml Prague Dependency Treebank http ufal mff cuni cz pdt 17 Manual for Morphological Annotation of PDT http ufal mff cuni cz pdt2 0 doc manuals en m layer html index html 18 Manual for Analytical Annotation of PDT http ufal mff cuni cz pdt2 0 doc manuals en a layer html index html 19 Relax NG XML scheme http www relaxng org 20 SGML http www w3 org MarkUp SGML 21 Slovak National Corpus http korpus juls savba sk index en html 22 Tutorial on the a layer http lectures ms mff cuni cz video recordshow index 17 29 23 24 Tutorial on the m layer http lectures ms mff cuni cz video recordshow index 17 28 Tutorial on Bonito http lectures ms mff cuni cz video recordshow index 2 24 29 Tutorial on LAW http lectures ms mff cuni cz video recordshow index 2 22 26 Tutorial on Netgraph http lectures ms mff cuni cz video recordshow index 2 25 2 Tutorial on PML format http lectures ms mff cuni cz video recordshow index 17 34 28 Tutorial on the Prague Dependency Treebanks Prague Treebanking for Everyone http lectures ms mff cuni cz video categoryshow index 1 29 Tutorial on STYX http lectures ms mff cuni cz video recordshow index 2
52. for the Bonito tool is included in the application itself and can be launched from the main Help menu Figure 3 1 illustrates the Bonito main screen The command ofthe tool is demonstrated in the following examples The Czech Academic Corpus 2 0 CD ROM Figure 3 1 Bonito Main screen Manager Corpus Query Concordance View Select Hep New query z name z ek mezin rodn ho hudebn ho festivalu Pra sk a edmdemttumr ter sv tlosti zaz ila na nebi Pra sk ho tern ve er v Dom um lc Poprasku kter divadlouvedlona ja e Jepohybliv mimickyihlasov 4 sklen ky se eliminuj nep zniv vlivy jarn ho po as Tim se zvy uje produktivita vystoup na hudebn m festivalu Pra sk jaro A proto e dosud nen p esn ur en to e je tu hezky po cel rok od jara dozimy Lv stus Mytjevie nezapomn l vyzvednout z ist rny man el in jarn cz pl n pr ce na leto n rok Akce jarn ee kam Okres div k m za p ze p ej v em hezk jaro a l to a na podzim se t na shledanou ohledu na to zda ji jare inapodzim Letos ji tvrt na ich z vazk budeme muset po kat na jaro a se bude moci d lat venku mame Skl dan sukn pat tak mezi jarn m dn novinky I modely letn ch ist modr nebe p jemn jarn sluncea il ruch v mezin rodn ch oble en V era seuzav ely br ny za jarn etapou v stavy F
53. he morphological annotation extensive semi automatic checks have been already run during the CAC 1 0 preparations These checks have been motivated by the similar processes during the building of the Prague Dependency Treebank 2 0 Detailed descriptions can be found in the CAC 1 0 Guide The automatic scripts verifying the data went through the corpus and marked suspicious positions the annotators then checked the marked sentences and corrected them if needed The main point of this work was to ensure that the morphological categories of the original tag in the CAC and of the positional morphological tag in the CAC 1 0 matched For example as for the noun s case category the scripts have marked 1 258 suspicious tags the annotator found 332 of them to be wrong and corrected them There have been 177 suspicious instances of adjective s case and the annotator corrected 41 of them All of the verifications conformed to the rules of the PDT morphological annotation 17 On the road to the CAC 2 0 Syntactical annotation The analytical annotation of the corpus has raised the question of how to map the original annotation to the Prague Dependency Treebank style of annotations Based on the experiences from the morphological annotation we have split this question into three sub questions Automatically Semi automatically Manually The article by Ribarov B mova Hladka 2006 describes our search for the answers in detail The authors have reached
54. ion of user s parsing accuracy It is important to stress that the academic notion of Czech syntax presented in the PDT 2 0 differs in some ways from the concepts traditionally taught in the school system These differences are closely documented Ku era 2006 Each exercise processes an arbitrary number of sentences according to Czech syntax Each word in the sentence will be morphologically analysed and the entire sentence will be parsed including determining the constituents of the sentence Only a small subset of the 11 000 sentences is available on the CD ROM to avoid overloading the user 50 sentences see bonus tracks STYX sample styx The steps for using STYX are clearly illustrated in Figure 4 1 First the user selects the part of speech associated with each word and then s he determines the morphological analysis and appropriate morphological categories upper part of the right window The word nodes are juxtaposed together at the beginning of the parsing and each node is removed when it has been successfully parsed The next step leads to determining the constituents of the sentence including the basic clause elements predicate and subject Figure 4 2 demonstrates the parsing evaluation process The user in our example morphologically analysed the word p edm ty E subjects correctly also the syntax and analytical functions analysis is correct the top tree has been constructed by the user the lower tree serves for evaluation pur
55. it W rok acyear E vpNS XR AA VsNS XX AP NNIS2 A Cz Z o o o o o 5 o o o Bylo objeveno roku Introduction Figure 2 4 Technical interconnection of the w layer and m layer The division of a word token document n46w sentence No 227 pedagogic psychological service p pedagogicko psychologick poradny t pedagogick psychologick poradna A2 A Ze AAFS2 14 NNFS2 A o o o k EI x 5 o o m pedagogicko psychologick poradny The interconnection between the a layer and m layer means that each m layer word unit corresponds exactly to one node of the dependency tree on the a layer and vice versa The only exception is the technical root which has no counterpart on the m layer Figure 2 1 illustrates the interconnection described above 2 4 The project s progress The project of the Czech Academic Corpus comes down to us the centuries as we have described in detail in the article Hladka Kralik 2006 We will not address the long journey of the CAC leading to its first version published here The CAC 1 0 Guide Vidova Hladka a kol 2007 contains all of that information Here we would like to summarise the process of building up the layers of the second version of the CAC 2 4 1 On the road to the CAC 2 0 Morphological annotation 2 4 2 The data preparation of the CAC 2 0 involved further semi automatic checks of t
56. ive texts Next the file name specifies a two digit ordinal number of the document within a group of documents of the same style Following this two digit number a letter indicates if the text is derived from a written text letter w or if it is a transcript of spoken language letter s The file names ofthe documents are included as the identifiers of sentences and elements in these sentences e g lt m id m n0O1w s1w1 gt in table 3 5 See Appendix A for file names of each document Example Instances noted according to template a 0 9 0 9 s contain transcripts of the spoken language in an administrative style In PML format the file extension embodies the layer of the document s annotation The extension of w layer files is w m denotes m layer and a denotes a layer Then they will be referred to as w files m files and a files Each a file exactly corresponds to one m file and one w file Each a file contains links to the corresponding m file and w file and each m file contains links to the corresponding w file see above Due to this dependency it is critical that files not be renamed There are no links from w files to m files or a files as well as there are no links from m files into a files In CSTS format there is the csts extension for all the files 14 The Czech Academic Corpus 2 0 CD ROM 3 2 3 Example The code s17w a defines a PML instance containing the a layer annotations of a document writt
57. l ra Olomouc Na J Number of hits 54 gt Query ja rr Displayed 1450 54 92 Line 11 Selected 2 Figure 3 1 description Main menu 2 Corpus selection button 3 Query line 4 Main window displaying query results 5 Column of the appropriate query results 6 Concordance lines 7 Selected concordance lines 8 Window displaying query history and broader context 9 Status line Bonito makes it possible to run the Czech morphological analyser directly through the menu Manager Morphology This command opens a new window the user can keep this window open while working with the corpus tool It can be used to run morphological analysis or synthesis generating The morphological analysis of a given word lists all possible lemmas and tags corresponding to the entered word form In case a synthesis is selected the tool generates all possible word forms that can be generated from the given lemma and the corresponding tags See Figure 3 2 18 The Czech Academic Corpus 2 0 CD ROM Figure 3 2 Bonito Running the morphological analyser 74 Morphology Morphology Word jard Analyze Generate NNNSZ2 A Close Save The tutorial tutorials bonito text en htm contains more detailed information how to master Bonito 3 3 2 LAW Editor for morphological annotation The Lexical Annotation Workbench LAW 33 is an i
58. m file CSTS morphological analysis PML m file CSTS output b Parsing PML m file CSTS PML a file CSTS Example Let s have a look at the analysis of Fantastickym finisem si vsak Neumannov dob hla pro vytou en olympijsk zlato E Neumannova powered down the final straight to win the longed for gold The results of the morphological analysis run by the command tool chain tA and tagging run by the command tool chain T is summarized Table 3 13 In case more possible lemmas exist for the given word form e g the word form si is analysed either as the verb byt to be or as the reflexive particle se the word form possibilities are separated with the pipe symbol To 24 The Czech Academic Corpus 2 0 CD ROM spare the reader from searching for errors the tagger itself made we confirm that there are no errors in this output Figure 3 7 shows the parsing result parsing run by the command tool chain P Each node of the tree displays a word form disambiguated lemma disambiguated morphological tag and analytic function To spare the reader from searching for errors the parser has made we confirm that there are no errors in this output Table 3 13 An example of text treated with morphological analysis and tagging Text Morphological analysis Tagging Fantastickym fantastick AAFP3 1A AAIP3 1A fantasticky AAIS7 1A AAIS6 1A
59. matical differences into account the final accuracy will be 93 1 percent After tagging the next step of text processing is parsing The parsing procedure assigns each word in the sentence its syntactical dependency on another word along with its analytical function The program carrying out the parsing is called parser The parser included in the CD ROM is based on the same methodology as the tagger The input of the parser is a text consisting of words labelled by a single pair lemma tag The output is a tree structure labelled by analytical functions for each sentence The parser has been trained on the PDT 2 0 training data and its accuracy on the CAC 2 0 data is TBA The script tool chain is provided for the user s convenience This script uses basic switches to run the needed tool For the switches documentation see Table 3 12 Concatenating more switches enables running more tools in sequence Example The following command morphologically analyses raw text tool chain tA Note When working with files in the PML format the directory containing the input file of the tool chain script must contain all files linked from the processed file In case the m file serves as input it has to be accompanied by the corresponding w file Table 3 12 Script tool chain Parameter Processing type Input file format Output file format t Tokenisation Raw text CSTS A Morphological analysis CSTS PML m file CSTS T Tagging PML
60. ments covering the scientific field sss 40 B 1 Additional information of the lemmas sssssssssssssss eme 42 B 2 Morpho syntactic flags of the lemmas sssssssssssssee em emn ern 42 B 3 Semantic flags of the lemmas ssssssssssssssssesee ee menm e menn enne rr eene 42 B4 Style flaes of the lemmas i iore e HR D E Re IH RERO ERES op ode EHI Us 43 B 5 Examples of lemmas eee eh S ERR I AR EE NER E PY Ser e Y RR dC d 43 GT Partof speech ioci iei et bt de SR RR E eed EE UE 44 C2 Sub part of speech sv ne p ET ce ees e EE u ka E Tp e e e da 45 C23 Gender isset RISE oe QU RES eeu eee Ue dc ciam desee ease De ceca 47 4 N mb6er eite ru o ete dee ret rie e ru ER erae eee ere qt reete 48 XE PEE 48 C6 Possessive gender arsine eia ei EE E PEERS EE D pectin EA wow oe Eee belt 48 C1 POSSESSIVE n tnb et 2 oce ette Paste steer tei dettes S N eode ee bk oa doba wakes 48 7 6 Person e ee ear e De t ees 48 o MISC E MR 49 ClO Grade so ccs ea ERI Un WIE ned rp D eU 49 CII Negation ere Det RM CR CE eee M E Cere e E REPRE A UE ERR US 49 7 12 VOICE 1 foes cs resent iis occas Er olv eg act re evades ge EY Cre Due Pep ub KOK EK ELE vod KES RITE MERE EUER 49 13 Reserve T ito cette ete tpe gre EE be deporte ree EE EEVEE 49 CAA ReSeTVe 2 eh rr ete ha Tee a eed T OU iE e eue ro e oh 49 7 15 Variatit cierre tei IR SEP EUREN sien dedres tete turae viso titans seabed es
61. n of the w layer and m layer No changes other than the final sentence punctuation 44 244 4444444444 4444444444440104 400000000 rhe ree re rennen nnn 5 2 3 Technical interconnection of the w layer and m layer The insertion of a word token 5 2 4 Technical interconnection of the w layer and m layer The division of a word token 6 2 5 CAC 2 0 preparation data processing ssssssssssssssses eee emen emen en enne rne 8 3 T Bonito Main Screen 55i rere e Hee dpEe 18 3 2 Bonito Running the morphological analyser sese 19 3 3 LAW MAII SCIO eei eie oec re e pe ado ne es td eed 19 3 4 TFEd Main SCI GEI ue eo oe on io Dei ood age ee E e ou ONE ole E ud PERRO 21 3 5 Netgraph Query formulation sss enne nne nen ne nennen 22 3 6 Neteraph Query result ieri e pe DUE I E T opt EROS oprUe sates 23 3 7 An example of sentence parsing sssssssssssseseeseene eee eme me he nhe nre rr nennen 26 4 1 STY X EXefCIS6S lt b e coat os takes eee iu qu e Sorteo Eee oo ero ea en dosh Sona Do ue nu Pepe eee 27 4 2 SLY X Exercise evaluations mte t ends 28 4 3 The TrEd editor screen with the TrEdVoice module enabled eese 29 List of Tables 2 1 Examples of lemmas and tags of particular word forms 2442000000000 3 3 1 CAC 2 0 CD ROM Directory structure 2 0 0 0 cccc cece cece eee ceeeeeeeee e e
62. nd 136 a files The written texts have not been annotated on the a layer It is impossible to apply the guidelines for the syntactical annotation of the written texts to the annotation of the spoken texts The directory data csts contains 180 files of this same data in CSTS format With regards to target to integrate the CAC into the PDT we present Table 3 10 that compares the basics of both corpora We only mention the characteristics common to both corpora The CAC 2 0 will be integrated into the PDT when the next version of the PDT is published Table 3 10 A comparison of the CAC 2 0 and the PDT 2 0 PDT 2 0 CAC 2 0 Characteristics Number of words Number of Number of Number of thousands sentences w o r ds sentences thousands thousands thousands Morphological annotation 2 000 116 650 32 Analytical annotation 1 500 88 488 24 Written form 2 000 116 488 24 Transcriptions 162 8 Journalistic style 1 620 94 214 11 Administrative style 71 3 Scientific style 380 22 365 18 3 3 Tools We provide the whole range of tools for data annotations annotation corrections searching within the annotated data and automatic data processing Considering the fact that the CAC 2 0 is annotated on the m layer and a layer we provide the tools for working with the CAC and other data on these two layers Table 3 11 helps the user to orient themself to the tools contained on this CD
63. nnecting main clauses not subordinate J conjunction 4 Relative interrogative pronoun with adjectival declension of both types soft P pronoun and hard jak kter lit what which whose 5 The pronoun he in forms requested after any preposition with prefix n P pronoun n j n ho lit him in various cases 6 Reflexive pronoun se in long forms sebe sob sebou lit myself P pronoun yourself herself himself in various cases se is personless 7 Reflexive pronouns se C 5 4 si C 5 3 plus the same two forms P pronoun with contracted s ses sis distinguished by C 8 2 also number is singular only This should be done somehow more consistently virtually any word can have this contracted s cos polivkus 8 Possessive reflexive pronoun sv j lit my your her his when the P pronoun possessor is the subject of the sentence 9 Relative pronoun jen ji after a preposition n n ho ni P pronoun lit who A Adjective general A adjective B Verb present or future form V verb C Adjective nominal short participial form rad schopen A adjective D Pronoun demonstrative ten onen lit this that that over P
64. nodes marked as subject and object The order of these dependent nodes is not specified in the query Figure 3 5 Netgraph Query formulation Netgraph 1 85 14 9 2007 File View Options Help global head query tree attributes possible values AuxK aj AuxO afun P AuxP cio afun Sb afun Obj aji reference G lt xl lafun x O lt P m form change O gt overwrite insert Iz m id m lemma O gt Value Obj v brother alternate node remove node d x name node Ju factory i set ad use remove set RE add RE undo J show the guery tree and or AND query 2fun Pred sfun Sb afun Obj fad ENS history v load save dear invert match select trees by the guery above result select trees by the query select all trees Query Trees Debug Files set OK The tree in Figure 3 6 could be one of the results the server returns 22 The Czech Academic Corpus 2 0 CD ROM Figure 3 6 Netgraph Ouery result x Netgraph 1 85 14 9 2007 File View Options Help Vz cn hosty obdarovali kyticemi kv t pion i gt attribute _ 4 afun C eparents lid L Vis member L lm form obdarov Limform I m id m tm RE iv im tag VpMP lw id weibw RE L w no sp I wftoken __lobdarov displayed attributes m lemma atun m tag
65. notation of the sentence Obecn odpov na tuto ot zku Je sotva mo n Lit A general response to this question is hardly possible Each word unit word number punctuation mark is represented by a single node in the resulting tree Note that due to technical reasons each tree is rooted by one extra node the tree in our example therefore consists of 9 nodes The annotation approach builds on the tradition of the Prague linguistic school where the predicate usually verb is understood to be the centre of the sentence Therefore the predicate is placed as a direct daughter of the root The final punctuation is also placed as a daughter of the root node Two constituents of the sentence are dependent on the predicate odpov answer and mo n possible Please note that each node in the tree is annotated with the word form lemma morphological tag and analytic function Looking at the node representing the word odpov answer we can see its form is a feminine noun in nominative singular and that this unit stands in the role of subject of the sentence which is expressed by the analytic function Subj Introduction Figure 2 1 Example of an a layer annotation pou AuxS NM b t l Pred AuxK odpov mo n dpov o n NNFS1 A AAFS1 1A Sb Pnom Obecn na sotva obecn na 1 sotva AAFS1 1A RR 4 Db Atr AuxP AuxZ ot zku t zka NNFSA A Atr tuto
66. ntegrated environment for morphological annotation It supports simple morphological annotation assigning a lemma and tag to a word the comparison of different annotations of the same text and searching for a particular word tag etc The workbench runs on all operating systems supporting Java including Windows and Linux It is an open system extensible via external modules e g for different data views import export filters assistants The LAW editor supports PML 15 csts 13 and TNT 38 formats 3 3 2 1 Major components The application consists of three major components as shown in Figure 3 3 Figure 3 3 LAW Main screen LAW current user jirka CLaw sheta01w cac 2 orig csts File View Go Layers Tools Help ea navigator Da Panel m C Lawishelfa01w cac 2 orig csts Primary An Ambim rychleji rychle y Dg 2A jedouc m jedouc e pohybovat se ne vs wozidl m vozi NNP3 A umo nit umo nit WV V A ozemnich komunikac ch V n jim on PPXP3 3 n ls silni n m provozu Ne p edjet p edjet 1 NNNS4 A R a bezd vodn pomalou wnull Z edouc m vozidl m a umo nit jim p edjet Z jem plynulosti provozu Z jem z jem MNIS1 A EE oven povinnosti neomezovat provoz bezd vodn plynulosti plynulost 3y NNFS2 A pomalou j zdou Nesnizovat n hle rychlost j zdy ani n hle nezastavovat provozu provoz NNIS2 A
67. of dr V clav Holzknecht s04w Jak rozum me chemickym vzorc m a rovnic m s72s Zpr va o cest do Belgie PhDr Marie T itelov DrSc s05w Konflikty mezi lidmi s73s Obecn ot zky jazykov kultury s06w Skoda 1000 s74s Provozn kontrola potrub s07w Pra sk vodovod s75s Modelov n diod s08w Nauka o materi lu s76s Pfenosov parametry s09w Tranzistory zen elektrick m polem s77s O po tu koster jednoho grafu s10w sllw Pro p vab a eleganci Tis cilet v voj architektury s78s s79s Streptokoky Statick zaji t n domu U Ryt sl2w Polovodi ov technika s80s Probl my aerodynamiky z vodn ch voz sl3w Plazma tvrt skupenstv hmoty s8ls Sch ze v deck rady STV sl4w Nadhodnota a jej formy s82s Plen rn sch ze ROH Pauzy v h n sl5w Ur ovani efektivnosti za socialismu s83s Semin o houb ch sl6w Sta livost myokardu s84s esk filharmonie hraje a hovo V clav Neumann sl7w K biologick m a psychologick m z etel m v chovy s85s Semin o fotografii sl8w Poetika s86s P soben hromadn ch sd lovac ch prost edk s19w Slovo a slovesnost 4 1973 s87s Ochrany v pr myslov ch z vodech s20w Sociologick asopis 3 1973 s88s Pr ce se ten em s2lw Teorie a empirie s8
68. ons TrEd Editor of syntactical annotations including the TrEdVoice module for voice control Netgraph Corpus viewing and searching tool tool chain Tools for the automatic processing of Czech texts tool chain Script running the tokenisation and or morphological analysis and or tagging and or parsing tutorials Tutorials for the data and the tools The Czech Academic Corpus 2 0 CD ROM 3 2 Data 3 2 1 This section describes the inner representation of the files itself the rules used to name the files and the organisation of the CAC 2 0 corpus into files Data formats We used the Prague Markup Language PML as the main data format The PML is a generic XML based 31 data format designed for the representation of the rich linguistic annotation of text Each of the annotation layers is represented by a single PML instance The PML was developed in concurrence with the annotation of the PDT 2 0 A secondary data format used in the CAC 2 0 is a format named CSTS This is an SGML based 20 format used in the PDT 1 0 annotation and also in the Czech National Corpus 14 The reason why we use a secondary format for the CAC 2 0 is its more efficient human readability the ease of its processing by simple tools and also the fact that some of the tools developed for the CAC 2 0 are only able to work with the CSTS format A conversion tool for these two formats is also available In the following chapter you will find
69. or words containing digits are especially marked The upper case words and words beginning with upper case letters are marked with special tags too The resulting vertical column in the CSTS format serves as the input for further processing The morphological analysis evaluates individual word forms and determines lemmas as well as possible morphological interpretations for the word form The morphological analysis is based on the morphological dictionary containing part of speech information on Czech word forms Each word form is assigned a morphological tag describing the morphological characteristics of the word form The morphological dictionary used for the analysis contains additional information for many lemmas style semantics or derivational information The lemmas of abbreviations are often enriched by comments referring to the explanatory text in Attachment B 23 The Czech Academic Corpus 2 0 CD ROM Due to the high homonymy of the Czech language most word forms can be assigned more morphological tags or even more lemmas For example the word form pekla has two lemmas noun peklo hell and verb p ci to bake Both lemmas generate several tags for the given word form The morphological analysis compares the possible word forms from the whole corpus to the word forms contained in the morphological dictionary The corresponding lemmas and tags are assigned to the given word form in case they match Therefore a set of pair
70. ord contains the sequential number of the node in the tree in left to right order This number is equal to the word order in the sentence 12 The Czech Academic Corpus 2 0 CD ROM Table 3 6 An example of sentence a layer annotation in the PML format LM id a n0lw s14 gt lt s rf gt m m n0lw s14 lt s rf gt lt afun gt AuxS lt afun gt lt ord gt 0 lt ord gt lt children gt LM id a n0lw s14W3 gt lt afun gt Pred lt afun gt lt m rf gt mitm nOlw s14W3 lt m rf gt lt ord gt 3 lt ord gt lt children gt LM id a n01w s14W2 afun Sb afun lt m rf gt mi m nOlw s14W2 lt m rf gt lt ord gt 2 lt ord gt children id a nOlw s14wW1 gt lt afun gt Atr lt afun gt lt m rf gt mi m nOlw s14W1 lt m rf gt lt ord gt 1 lt ord gt lt children gt LM LM id a n01w s14W6 lt afun gt Pnom lt afun gt lt m rf gt mi m nOlw s14W6 lt m rf gt lt ord gt 6 lt ord gt children id a nOlw s14W5 gt lt afun gt Atr lt afun gt lt m rf gt miFm nOlw s14W5 lt m rf gt lt ord gt 5 lt ord gt children id a nOlw s14wW4 gt lt afun gt AuxZ lt afun gt lt m rf gt mi m nOlw s14W4 lt m rf gt lt ord gt 4 lt ord gt lt children gt lt children gt LM children LM LM id a n01w s14W7 afun AuxK afun m rf mim n01w s14W7 m rf ord 7 ord LM children LM XML elements of a PML instance occupy a dedicat
71. poses Figure 4 1 STYX Exercises r BB Styx Joe Fie Task Heb Plot profesora opravdu neni velk Z jeho vzd l n zmizely zbyte n p edm ty word form lemma port cf speech gender number case p edm ty p edm t nou Jul masculine x pural M nominative iv Z jeho vzd l n zmizely zbyte n k zmizely p edm ty 5 L 4 Z vad l n zbyte n Adv Atr 27 Bonus material Figure 4 2 STYX Exercise evaluation Task check x Show Pit profes edu reri velk Zaho vzd l n anizo zbyko n PISAT Z jeho vadi n zmizely zbyte n EEE correct selected word form p edm ty kerma p edm t part of spe noun neun OK gender mesculne masculine OK number phral phral OK case nominatie nominative OK zmizely p edm ty T Z vzd l n zbyte n Hee Alr jeho Akr y p edm ty T Ps Z vzd l n zbyte n Adv Atr jeho L Ar v 4 2 Voice control of the TrEd editor via the TrEdVoice module The TrEd annotation editor is the essential annotation tool used to annotate the CAC 2 0 on the analytical layer see Chapter 3 3 3 From the very beginning the TrEd was equipped with many complex functions and macros and their number even increased over time Most of the functions are assigned hotkeys as it would be extremely time consuming to call upon all the functions from the menu system each time Nevertheless the syst
72. pronoun there E Relative pronoun co corresponding to English which in subordinate P pronoun clauses referring to a part of the preceding text F Preposition part of never appears isolated always in a phrase nehled R preposition na vzhledem k lit regardless because of G Adjective derived from present transgressive form of a verb A adjective H Personal pronoun clitical short form m mi ti mu these P pronoun forms are used in the second position in a clause lit me you her him even though some of them m might be regularly used anywhere as well I Interjections I interjection J Relative pronoun jen ji not after a preposition lit who whom P pronoun K Relative interrogative pronoun kdo lit who incl forms with affixes Z P pronoun and s affixes are distinguished by the category C 15 for Z and C 8 for s L Pronoun indefinite v echen s m lit all alone P pronoun 45 Description of tags Value Description POS M Adjective derived from verbal past transgressive form A adjective N Noun general N noun o Pronoun sv j nesv j tentam alone lit own self not in mood P pronoun gone
73. re V verb are t Verb present or future tense with the enclitic lit perhaps V verb could you imagine that or but because both archaic u Numeral interrogative kolikr t lit how many times C numeral V Numeral multiplicative definite kr t lit times p tkr t lit five C numeral times w Numeral indefinite adjectival declension nejeden tolik t lit C numeral not only one so many times repeated y Numeral fraction ending at ina used as a noun p tina lit one fifth C numeral z Numeral interrogative kolik t lit what C numeral at what position place in a seguence Table C 3 Gender Value Description Feminine E N Feminine or Neuter Masculine inanimate Masculine animate Neuter oe ziz rm Feminine with singular only or Neuter with plural only used only with participles and nominal forms of adjectives pronoun forms and certain numerals T Masculine inanimate or Feminine plural only used only with participles and nominal forms of adjectives X Any Y 4M I Masculine either animate or inanimate Z M L Nj Not fenimine i e Masculine animate inanimate or Neuter only for some 47 Description of tags Table C 4 Number Value Description D Dual e g nohama
74. ros in the Perl language can extend its functionality Macros are called upon from menus or through the assigned hotkeys Users oriented with programming will certainly be able utilise the TrEd version without graphical user interface called btred for batch data processing the Batch mode Tree Editor The NTrEd tool is another add on to the editor It brings with it the possibility to parallelise the btred processes and to distribute them on more computing machines To open the files in the TrEd use the menu command File Open Choose a file with the extension a or csts The file opens in the TrEd and the first sentence of the file displays on the screen Figure 3 4 shows a typical TrEd screen The sentence Probl my motivace jsou tak star jako lidstvo E The motivational problems are as old as the human race Please find the explanatory notes below A window shows the tree representing the syntactical annotation of the sentence e 2 The represented sentence 3 Status line The status line shows various information on the selected word the highlighted node in our case Probl my In our example the ID number of the node its lemma and tag are displayed 20 The Czech Academic Corpus 2 0 CD ROM 3 3 4 4 Current context The environment for working with the annotations is called the context There is a context which only allows the user to view the annotations e g the PML A View context serv
75. s lemma morphological tag is the result of the morphological analysis for each word form The morphological analysis is followed by tagging also called disambiguation In this phase the right combination of the lemma and tag for the given context is selected from the set of all possible lemmas and tags Regarding the character of the task it is impossible to generate a method of tagging that would function with 100 percent accuracy The program carrying out the tagging is called tagger The tagger application included on the CD ROM is based on the Hidden Markov Model HMM and implements the use of the averaged perceptron statistical method Collins 2002 The method is statistically based A text that contains the set of all possible morphological tags and lemmas for every word the output from the morphological analysis is the input for the tagger In the output the tagger defines this dataset with an unambiguously determined tag and its corresponding lemma The tagger was trained on data in the PDT 2 0 and its accuracy percentage of correct tags on the CAC 2 0 is 91 8 percent However some errors have been caused by differences between the PDT and the CAC Therefore the morphological analysis does not always offer the correct tag for some words This happens systematically with numbers written in digits represented by the wildcard symbol in the CAC and unknown words represented by the wildcard symbol If we do not take these syste
76. s for Information Systems No 1ET101120413 financed by the Grant Agency of the Academy of Sciences of the Czech Republic The CD completes the comprehensive results presentation of the five years of work on the project Chapter 2 Introduction 2 1 Introducing the Czech Academic Corpus CAC 2 0 The Czech Academic Corpus 2 0 is a morphologically and syntactically annotated corpus of 650 000 words The Czech Academic Corpus CAC was created by a team from the Institute of the Czech Language of the ASCR led by Marie T itelov 11 from 1971 till 1985 The original purpose of the corpus was to build a frequency dictionary of the Czech language and the original name of the corpus was Korpus v cn ho stylu Practical corpus The corpus has been morphologically and syntactically annotated manually Independent from the CAC an annotation of the Prague Dependency Treebank PDT was launched in 1996 The idea of transferring the internal format and annotation scheme of the CAC into the PDT emerged during the work on the PDT s second version 16 The main goal was to make the CAC and the PDT fully compatible and thus enable the integration of the CAC into the PDT After converting the inner format and morphological annotation scheme we have published the first version of the CAC Vidova Hladka a kol 2007 The second version presented here enriches the CAC 1 0 by adding the surface syntax annotation in the terminology of the PDT we
77. tento PDFS4 Atr Obecn odpov na tuto ot zku je sotva mo n The conception of the main internal format of the CAC 2 0 in PML format see Chapter 3 2 1 treats the annotation layers separately where each layer of annotation in the document corresponds to one file In the case of the CSTS format all layers of annotation are contained in one file This relationship Introduction in the CAC 2 0 means that there are three instances files for every document one for the w layer one for the m layer and a third one for the a layer However the distinction between layers does not restrict interconnection between groups for particular layers of annotation In fact the opposite is true as will be demonstrated later in this section The word layer does not reflect the segmentation of the text into sentences this segmentation occurs on the m layer This means that unlike the w layer the m layer contains final punctuation Additionally the number of word tokens in both layers may differ The differences originate from the concatenation of the incorrectly split word into one word or reversely from the division of incorrectly connected words into more units The correctly written text should be contained in the m layer Example The three following figures illustrate the w layer and m layer interconnection Also the interconnection of the files in the sense of the number of word units is captured and denoted by arrows
78. vero C numeral desatero lit four kinds sorts of ten 46 Description of tags Value Description POS Numeral generic greater than or egual to 4 used as a syntactic adjective short form tvery lit four kinds sorts of C numeral l Numeral cardinal jeden dva tit ty i p l lit one two C numeral three four also sto and tisic lit hundred thousand if noun declension is not used m Verb past transgressive also archaic present transgressive of perfective verbs V verb ex ud lav lit he having done arch also ud laje C 15 4 lit he having done n Numeral cardinal greater than or equal to 5 C numeral o Numeral multiplicative indefinite kr lit times mnohokr t C numeral tolikr t lit many times that many times p Verb past participle active including forms with the enclitic s lit re V verb are q Verb past participle active with the enclitic lit perhaps V verb could you imagine that or but because both archaic r Numeral ordinal adjective declension without degrees of comparison C numeral S Verb past participle passive including forms with the enclitic s lit

The Czech Academic Corpus 2.0 Guide

Contents

Download Pdf Manuals

Related Search

Related Contents