Home

Dutch Parallel Corpus User Manual

1. ENGLISH Tag Description Frequency CC conjunction coordinating 105 174 CD numeral cardinal 74 206 DT determiner 321 793 EX existential there 4 519 FW foreign word 2 499 IN preposition or conjunction subordinating 371 991 JJ adjective or numeral ordinal 225 432 JJR adjective comparative 10 755 JJS adjective superlative 4 179 LS list item marker 6 087 MD modal auxiliary 37 866 NN noun common singular or mass 516 755 NNP noun proper singular 249 016 NNPS noun proper plural 11 424 NNS noun common plural 188 261 PDT pre determiner 2 089 POS genitive marker 11 050 PRP pronoun personal 69 287 PRPS pronoun possessive 30 951 RB adverb 112 449 RBR adverb comparative 6 058 RBS adverb superlative 1 544 RP particle 8 890 SYM symbol 19 785 TO to as preposition or infinitive marker 68 962 UH interjection 937 VB verb base form 103 882 VBD verb past tense 57 842 VBG verb present participle or gerund 51 929 VBN verb past participle 92 999 VBP verb present tense not 3rd person singular 51 052 VBZ verb present tense 3rd person singular 77 792 WDT WH determiner 15 412 WP WH pronoun 6 505 WP WH pronoun possessive 380 WRB Wh adverb 10 118 Total 2 929 870 25 DUTCH ADJ Adjective 472 962 BW Adverb 252 115 LET List item 75
2. sing plur singulier pluriel 1e 2e pers O te pers O 2e pers O 3e pers tous les temps pr sent imparfait futur simple pass simple de positie is niet belangrijk v alle vormen enkelvoud meervoud met alle tiden tegenwoordige tid verleden tijd conjunctief positie de positie is niet belangrijk by O woordvom lemma is jokerteken is n karakter jokerteken parts of speech werkwoord M stu woordvorm lemma is jokerteken is n karakter jokerteken parts of speech verbe V O tous les verbes verbe principal verbe auxiliaire sluiten voltooid deelwoord Maak uw keuze Y positie de positie is niet belangrijk v positie de positie is niet belangrijk Step 3 Results The results are shown under your search specifications and the words are indicated in red These results can also be exported to an excel sheet 17 Aantal resultaten 21 waarvan er 21 getoond worden BJ Exporteer deze resultaten naar Excel Le pr sident de OSCE M Goran Lennmarker 8 SGAE que la conf rence s int resserait principalement la De voorzitter van de OVSE de heer Goran Lennmarker BERIEMGONGE dat deze conferentie vooral oog situation des pays de l Europe du Sud Est hebben voor de situatie
3. 2005 9 Het project MSG tot doel kinderen van zeer verschillende afkomst samen te brengen en BERIKfe in 2005 kinde os WIEN ie nase di Coast M Le ci BETO de ee DOREM NOS Een waarnemer van de Assemblee van de Raad van Europa WEES dat de Bolognaregels bij het Processus de Lisbonne et qu il ne fallait pas oublier le lien entre enseignement et croissance conomique Lissabonproces aansluiten en dat men het verband tussen onderwijs en economische groei niet mag vergeten 6 6 Ras indiqu que le Portugal tait tr s avanc dans le domaine de la production d nergie 9 energiemarkt 800 zelfs de Europese Unie kunnen overstijgen de OVSE 68 hierin een rol kunnen spelen Next to each sentence there are two little icons When you stand on the i information you see the different metadata ministre espagnol des Affaires gt B l attention sur le r le important Iroit d mocratique ll a dress un bilan 18 When clicking on the c context another window opens which allows you to look at the sentence in its context ranging from 1 to 50 sentences The sentence in yellow is the original dpc context Windows Internet Explorer E http fit kuleuven be CODE dpcfcontext inks php taalfrBid tekst 0012878zin nr 7 Toon een context van 3 Bil zinnen La doctrine de la non prolif ration est en crise Pendant quinze ans le Pakistan qui a r alis ses premiers essais de bombe atomique en 1998 a transf r
4. secr tement sa technologie nucl aire militaire au moins trois pays la Libye l Iran et la Cor e du Nord La Libye aurait entre autres acquis pour des dizaines de millions de dollars le croquis d une bombe atomique qui Torigine avait t livr au Pakistan par la Chine dans les ann es 70 Pour Gary Samore un expert de l Institut international d tudes strat giques de Londres il s agit de la part du Pakistan du pire exemple de prolif ration dans l histoire du monde Lib ration 9 f vrier 2004 A Q Khan le p re de la bombe atomique pakistanaise a avou d but f vrier 2004 avoir coop r pendant des ann es la prolif ration de technologies voire de mat riaux nucl aires Steng M Trifun Kostavski houramestre de Skonie exnns sur l administration municinale en Mac dnine M f 19 Appendices Appendix one Design in numbers In accordance with the DPC design principles the corpus is balanced in two ways it contains five text types that each account for 2 000 000 words and each translation direction contains 500 000 words This leads to a corpus that can be resumed in the following table For each text type we see how many words are included per translation direction Text Type Administrative Texts External Communication Instructive Texts Journalistic Texts Literature Grand Total SRC TGT EN DU FR DU DU EN DU F
5. Nijgh amp Van Ditmar gt Editions du Seuil Extracts of Novels fictional Journalistic texts En DU The Independent De Standaard News Articles Comment articles Columns Editorials Fr gt DU Roularta News Articles Columns Du En Campuskrant Comment articles Du FR Roularta News Articles Columns Instructive texts Haten IBM Manuals Bosch Manuals Administrative texts En DU Melexis Yearly Reports Fr gt DU RIZIV Yearly reports minutes of meetings Du En Vlaamse Overheid Yearly Reports minutes Du FR FOD Sociale Zekerheid Yearly reports correspondence External Communication En gt DU Ablynx Press releases Fr gt DU NMBS Press releases Du gt En Promotion and Advertising Arcelor Mittal material Du gt FR Informative documents of a Transmed general nature Table 2 Selection of text providers 2 5 Metadata All the text material included in the corpus is annotated with additional metadata at different levels This allows the user to retrieve relevant information from the corpus The DPC metadata are of two kinds 1 text related data and 2 translation related data Finally some statistics are added 1 The first kind includes information on the text language author and or translator publishing information intended outcome of the text The text is also characterized accordi
6. Mevels Essayistic texts Literature Auto biographies non fictional Expository works of a general nature News articles Comment articles background Journalistic articles ER Comment articles columns Comment articles editorials Manuals His rue Internal legal documents texts RTS Procedure descriptions Administrative Legislation texts Proceedings of parliamentary debates Minutes of meetings Yearly reports 22 Official speeches External Communication Self presentations of organisations projects events Informative documents of a general nature Promotion and material advertising Yearly reports Press releases and newsletters Scientific texts 8 Domain 9 Keywords Communication ICT Internet Consumption Household appliances Culture Museum Architecture Arts Languages Economy Business Environment Conservation Pollution Threats Nature Finance Banking Investment Foreign affairs EU Institutions Management Policy Legal Documents Justice Legislation Leisure Tourism Sports Science Linguistics Oceanography Zoology Botany Medicine Technology Welfare state Social security Public health Working conditio
7. U bent wijnbouwer in Waals Brabant until Ik denk dat ik niet zo n slecht tacticus ben want ik handel snel en doeltreffend en heb een goed zicht op wat er binnen vijf jaar op het spel staat were deleted because they contained text not dealing with the financial domain 13 The resulting term lists consists of three or four fields delimited by a tab The first field contains the Dutch term the last field contains reference codes the one or two other fields contain the translations An overview of the codes to the reference books is given below CODE SOURCE ONLINE S AC http nl wikipedia org wiki Gezon AZWIKI Gezondheid van A Z Wikipedia dani wn A mr 2 Dictionnaire de la comptabilit et de DGF la gestion financi re Louis Dictionnaire M dical Manuila DICH Lewalle Nicoulin DMF Dictionnaire M dical Flammarion DMFI Dictionnaire des march s financiers _ Antoine amp Capiau Huart Dictionnaire frangais des termes de DTM E medicine Elsevier s dictionary of financial ELSE terms English French et al Marie Claude Bignaud Euramis terminologiebank Europese EUR Commissie EURFR EurekaSant http www eurekasante fr lexique medical html Financieel Economisch Lexicon N E PELNE A J de Keizer http www lautorite gc ca userfile FINCAN Glossaire Financier Canadees s File Publications Consommateu
8. aligner requires paragraph aligned data 10 of the corpus was also manually checked on paragraph level Afterwards this manually verified output was compared with the combined output of three aligners namely the Vanilla aligner the Microsoft aligner and the GMA aligner so as to be able to retrain the tools and to work out spot check heuristics 90 of the sentence aligned data was verified using spot checks 3 4 Sub sentential alignment For more than 25 000 words of the Dutch English part of the corpus manual alignments at the sub sentential level were created Reference corpora where sub sentential translational correspondences are indicated manually also called Gold Standards are used as an objective means for testing word alignment systems The reference corpus consists of journalistic texts newsletters and medical European Public Assessment Reports We assume that for each of the three text types another translation style was adopted Table 4 summarizes the formal characteristics of the corpus total number of words average sentence length of source and target sentences and the ratio of source target sentences In total the Gold Standard contains more than 25 000 words Text type Total Avg sentence length Avg sentence length words source target Journalistic texts 7 706 22 0 22 0 Newsletters 10 480 15 0 15 4 EPARs 7 536 17 2 17 7 Table 3 Sub sententially aligned corpus To account for a w
9. can be stored in bilingual glossaries which are already a valuable aid for technical translators If the aim is the creation of a term bank the extracted terms are structured in concept oriented databases in the terminology management phase The Gold Standard contains texts of two different domains e Medical domain trilingual texts Dutch French English e Financial domain bilingual texts Dutch French and Dutch English In the Gold Standard all terms single and multiword terms were manually indicated As we had no domain experts to our disposal all terms were looked up in several reference books The reference books in which the terms were found are included in data set Details on the texts of the extraction corpus are presented in the following table Domain Texts Lang Pairs Dutch words Terms Financial ELI DU EN FR 11 365 469 Medical ING DU EN 9 458 400 QTY DU FR 8 954 338 Table 4 Extracted terms for the gold standard The following texts were included in the extraction corpus ELI ING QTY dpc eli 000937 dpc eli 000938 dpc eli 000939 dpc ing 001878 dpc ing 001879 dpc ing 001888 dpc qty 000928 dpc qty 000930 dpc qty 000932 dpc eli 000940 dpc eli 000941 dpc eli 000942 dpc eli 000943 dpc eli 000944 dpc eli 000945 dpc eli 000946 dpc eli 000947 dpc eli 000948 dpc qty 000933 dpc qty 000935 dpc qty 000936 7 The text of dpc qty 000936 was shortened the sections from
10. in DPC the user can consult Appendix One In order to maximize research potential copyright clearance was obtained for all texts DPC is made available to the research community through the Dutch Agency for Human Language Technologies the TST centrale The next chapter will expand on quality control and other data processing steps 3 Data processing 3 1 Quality control Since one of the explicit objectives of DPC was obtaining high quality a quality control system was put into place for each step in compiling aligning and annotating the corpus Three forms of quality control were envisaged manual verification a spot checking module and automatic control Manual verification traditionally the best guarantee for high quality data was performed by qualified linguists with native and near native language proficiency Ten percent of the whole corpus was manually verified for each processing step 1 million words The exact composition of this manually verified 1 million word corpus can be found on http www kuleuven kortrijk be dpc xtra G1 G1 html The second step was to develop a spot checking module on the basis of error analysis of the manually verified data This was only done for those processing steps of which the output could be upgraded considerably using simple spot checking heuristics Finally other data processing steps were verified with automatic control procedures Each step of the data processing will now be
11. one sentence in a target language 1 many one sentence in a source language is aligned with two or more sentences in a target language many 1 two or more sentences in a source language are aligned with one sentence in a target language many many two more sentences in a source language are aligned with two or more sentences in a target language 0 1 no alignment links for a sentence in a target language 1 0 no alignment links for a sentence in a source language Zero alignments were only accepted if no translation could be found for a sentence in either the source or the target language in other words when a corresponding part of text was missing in the other language Many to many alignments were legitimate in two cases overlapping alignments and crossing alignments In other cases smaller links were used For example unless 2 2 alignment is a true case of an overlapping or a crossing alignment two 1 1 links were used An overlapping alignment is due to asymmetric sentence splitting in the two languages whereas a crossing alignment means that the translation of a sentence in the source text shows up at another place in the target text These two alignment types were inadmissible in DPC and therefore put under the umbrella of many to many alignments Ten percent of the sentence aligned data was checked manually For this manual verification the sentences were run through the Vanilla aligner Because this
12. 325 534 963 2 044 535 102 23 5 211 171 2 885 418 2 698 586 10 795 175 107 95 20 A small part of the corpus is trilingual and contains Dutch texts translated into both English and French The following table represents the number of Dutch words that were translated in English and French per text type Literature Journalistic texts Instructive texts Administrative texts External communication Total 223 322 165 205 4 383 76 319 469 229 21 Appendix two Metadata All text material included in DPC is provided with metadata at two levels text related data and translation related data In the following tables a comprehensive overview of all possible metadata tags is presented Text related data 1 Language NL NL NL BE EN UK EN US FR FR FR BE 2 Author translator X 3 Text unit title X 4 Publishing info magazine journal title publisher ISBN ISSN date of publication original date of publication place of publication original place of publication info on previous editions info on previous editions editor article number page of the article in the magazine keywords class of the article 5 Intended outcome written to be read written to be spoken written reproduction of spoken language 6 Text type 7 Text subtype Literature fictional
13. 7 264 LID Article 685 016 N Noun proper common 1 495 304 SPEC Abbreviation First name last name 237 959 TSW Interjection 1 124 TW Numeral 151 860 VG Conjunction 294 788 VNW Pronoun 432 513 vZ Preposition 890 746 WW Verb 879 826 Total 6 555 902 FRENCH A Adjective 268 961 C Conjunction 142 177 D Determiner 507 808 F Punctuation 425 053 I Interjection 2 805 N Nouns 1 019 973 P Pronoun 169 992 R Adverb 159 950 S Preposition 585 868 Verb 417 852 X Miscellaneous 7 746 Total 3 708 185 26
14. Dutch Parallel Corpus User Manual CONTENTS 1 Introduction 2 Corpus Design AL Idco T 2 2 TexttyDpes aa aa 2 3 ext EEGEN 2 4 Copyright clearance ET 2 9 Metdddkd naves nsi etd de Meets iot Ew dte diea 2 0 SCONCIUSION PE UN NER CON EE Bana 3 Data processing dn end kg Ae KUA DARA Oil OMAN contro eh i tp 3 2 Text leg UE e EE 3 3 Sentence alignment enim c Ee EE 3 4 Sub sentential alignment Wo Wo Wo 3 5 Linguistic annotbaLbioTi s esed uo bea 39L Token ab UU ep CHO ORDRE 3 5 2 Lemmatization and Part of Speech tagging 3 5 3 Syntactic information sce edd bal nba 3 6 Terminology EC 4 EXploitation EE 4 1 Monolingual vs Bilingual search 4 2 Full corpus vs SUDCOFDUS iix E aa p ae teta 43 The search POPE viande oss epa ENERO RES Pr PIA orne 4 3 1 Web interface functionality survey 4 3 2 Web interface functionality example Appendices Appendix one Design in numbers Appendix two Metadata Appendix three PoS tags 1 Introduction The present manual describes the Dutch Parallel Corpus DPC a 10 million word high quality sentence aligned parallel corpus for the language pairs Dutch English and Dutch French with Dutch as the central language It contains a detailed description of the design principles underlying DPC and the different stages of data processing The web interface is also discussed and illustrated with vari
15. R Total EN DU FR DU DU EN DU FR XDE XDEF Total EN DU FR DU DU EN DU FR XD F XDE XDEF Total EN DU FR DU DU EN DU FR Total EN DU FR DU DU EN DU FR Total DU EN FR TOTAL 255 155 246 137 0 501 292 100 26 307 886 O 322 438 _ 630 324 126 06 249 410 257 087 0 506497 101 30 280 584 O 301 270 581 854 116 37 1 093 035 503 224 623 708 2 219 961 111 00 278 515 272 460 O 550 975 110 19 233 277 Of 250 604 483 881 96 78 246 448 255 634 O 502 082 100 42 241 323 Of 270074 511 397 102 28 21 679 20 118 0 41797 36 14 192 14 953 _ 15 743 44 888 8 98 1 035 434 563 165 536 421 2 135 020 106 75 340 097 _327 543 __0 __667 640 133 53 40 487 O 42 017 82 504 16 50 19 011 20 696 o 39 707 794 110 278 Of 115 034 225 312 45 06 59 791 O 73 758 133 549 27 71 299 996 296 698 O 596 694 119 34 138 673 145 103 166 836 450 612 90 12 1 008 333 790 040 397 645 2 196 018 262 768 264 900 0 527 668 105 53 240 785 0 265 530 _ 506 315 101 26 250 580 259 764 O 510 344 102 07 314 989 Of 340 319 655 308 131 06 1 069 122 524 664 605 849 2 199 635 109 98 148 488 143 185 o 291 673 58 33 186 799 Of 186 799 373 419 74 68 346 802 361 140 O 707 942 141 59 323 158 Of 348 343 348 343 134 30 1 005 247 504
16. a smaller part a subcorpus A subcorpus can be put together by using the DPC metadata which exhibit a whole range of features that allow the user to make a number of selections on the DPC web interface A user can specify for instance what types of texts languages and domains should be used in the search For a complete list of all different metadata please consult Appendix Two 15 Putting together a subcorpus copyright ipr agreement tM translated text amp lang P vi make your choice make your choice E Het subcorpus van taal 1 bestaat uit 540820 woorden Het subcorpus van taal 2 bestaat uit 997358 woorden Non fictional literature On the basis of the selected metadata the user creates a new corpus that can be presented as a DPC subcorpus This subcorpus constitutes the starting point for any further search on the web interface The metadata can thus be defined as a first filter in the search task on the interface 4 3 The search proper 4 3 1 Web interface functionality survey After selecting the metadata and putting up a subcorpus the user can execute a second search command for the research proper This process is visualized in the following graph Tex D language A 1 DPC corpus mmm mei Leer Ve ab language 3 DPC Subcorpus Lexical Enricheg Regular Search lexical expressions search 4 3 2 Web interface fun
17. actic features as attributes to the word class In total 316 distinct full tags are discerned The D COI procedures were observed for the 10 manual verification of the Dutch PoS tags and the lemmata the procedures of the D COI project were used This implies that only the words for which the different taggers do not agree were manually verified The DCOI protocol with its description of all the possible tags served as a reference guide In addition to the verification of PoS and lemma we also grouped multiword units and Dutch separable verbs using the CGN protocol as a reference Lemmatization and PoS tagging of the 9M corpus was also effectuated with the help of the ensemble tagger The tagging task was carried out by the team of ILK Research group Tilburg 3 http www ccl kuleuven be Papers POSmanual febr2004 pdf http lands let kun ni cegn doc Dutch topics version 1 0 annot lex linkup Ixk prot pdf 11 English For English part of speech tagging and lemmatization for English was performed by the combined memory based PoS tagger lemmatizer which is part of the MBSP tools Daelemans Walter and Buchholz Sabine and Veenstra Jorn 1999 and Daelemans Walter and Van den Bosch Antal 2005 The English memory based tagger was trained on data from the Wall Street Journal corpus in the Penn Treebank Marcus Mitchell P and Santorini Beatrice and Marcinkiewicz Mary Ann 1993 and uses the Penn Treebank tagset The Penn Tree
18. bank tagset contains 45 distinct tags All PoS codes and lemmata of 10 of the corpus were manually inspected and verified For this we used the PennTreebank Tagging guidelines as a reference These manually verified annotations were used to test the performance of two different PoS taggers using the same PoS tag set and lemmatisation conventions the MBSP tagger and Treetagger As both taggers made different errors we combined the output of both taggers to process the nine million word corpus and only verified the PoS tags and lemmata for which both taggers did not agree With a limited manual verification effort we can achieve 98 precision for PoS tagging and 99 precision for lemmatization French In order to manually check 10 of the data the linguistic annotation was done by using the combined output of the French version of TreeTagger In fact the first run used the tag set of the original TreeTagger and FLEMM lemmatisation information In the second run the LIMSI tagset was used The output of both tagging procedures were compared during the analysis of the 1M set and a quality procedure was developed in order to spot check the data from the 9M set The tagset consists of 312 morphosyntactic tags Allauzen Alexandre and H l ne Bonneau Maynard 2008 Training and Evaluation of POS Taggers on the French MULTITAG Corpus In Proceedings of the Sixth International Language Resources and Evaluation LREC 08 pages 28 30 Paroub
19. ctionality example Step 1 Select a subcorpus In this example we carry out a bilingual search on a selection of texts Administrative texts with French as original language and Dutch as translated language 16 We see that our subcorpus for French contains 283 752 words and the one for Dutch 276 066 words Step 2 Define your search query Once the subcorpus has been determined the user can perform his search In our example we carry out an enriched bilingual search for the language pair Dutch French The example given below concerns the use of past tenses in French and Dutch In our bilingual search we search examples that contain the French verb avoir used as an auxiliary in indicatif pr sent and that is followed by a past participle As for the Dutch component we select verbs on lemma that are used in the past tense Besides we also specify that the results for Dutch cannot contain a past participle Our search thus excludes the literal translation of the French pass compose nieuwe zoekopdracht woord1 aaaaaaaaaaalalaaaaaaalaalalaalalalaIaaaaasasIssslMl aor woordvorm lemma is okerteken 7 is 6en karakter jokerteken op woordvorm is jokerteken is n karakterjokerteken parts of speech parts of speech verbe O tous les verbes verbe principal verbe auxiliaire bison werkwoord e Isu indicatif
20. cuments 3 3 Procedure descriptions 4 1 Legislation 4 2 Proceedings of parliamentary debates 4 Administrative texts 4 3 Minutes of meetings 4 4 Yearly reports 4 5 Official speeches 5 1 Self presentation of organisations events 5 2 Informative documents of a general 5 External communication nature 5 3 Promotion and advertising material 5 4 Press releases and newsletters 5 5 Scientific texts Table 1 Text types and subtypes included in DPC All this information is also stored in the metadata see 2 5 The exact number of words per text type and translation direction can be found in Appendix One 2 3 Text providers To guarantee the quality of the text samples most of them were taken from published materials or from companies or institutions working with a professional translation division Care was taken to differentiate kinds of data providers among them providers from publishing houses press government corporate enterprises European institutions etc Differentiation was also compulsory at cell level the material of each cell i e the unique combination of text type and translation direction originates from at least three different providers in order to preserve good balance This is why it was decided to limit the number of words per text provider to 166 666 for every combination In some cases however this ceiling could not be respected for pragmatic reasons an
21. d more material came from a single provider This is the case for example with journalistic texts Though it may be that articles only came from one text provider they were in fact written or translated by various people 2 4 Copyright clearance In order to make the corpus accessible to the entire research community copyright clearance was obtained for all samples included in the corpus These licence agreements guarantee accessibility and protect the intellectual and economic property rights of the authors and publishers Four types of agreements were used IPR for commercial use IPR for publishers IPR short version and an e mail or letter with permission All this information is stored in the metadata so as to guide each user in knowing which text material is accessible to the entire research community and which material has limited use In the following table we list some text providers that contributed to the DPC project for each text type and translation direction This is not an exhaustive list because some text providers desired to remain unknown In total 55 text providers participated in DPC Transi Text Provider Data Literature fictional amp non fictional En DU Little Brown Nijgh amp Van Ditmar Extracts of Novels fictional Fr gt DU Editions du Seuil gt Nijgh amp Van Ditmar Extracts of Novels fictional Du gt En Mercatorfonds Expository works non fictional Du FR
22. discussed in more detail 3 2 Text normalization The data acquired came in different formats and thus needed to be brought into conformity with a DPC standard To this end every text was converted into txt format assigned a unique DPC name and grouped together according to text type Graphs tables tables of contents and figures were removed from the material so as to end up with clean text Text material that had originally been drawn up in PDF format required particular attention Considerable time and effort was devoted to this process of cleaning incoming texts to ensure that the following more automated steps could be carried out as smoothly as possible All texts i e 100 of the corpus were therefore manually cleaned Once the text had been cleaned it was split into sentences a necessary step to be able to perform the next processing steps Quality control principles were applied for all of these steps a manual check for 10 of the corpus spot checking heuristics or automatic control procedures for the remaining 90 3 3 Sentence alignment The whole corpus is aligned on sentence level which means that each sentence of a source language text was linked to its target text equivalent The sentences linked by the alignment procedure thus represent translations of each other in different languages The alignment procedure resulted in matches of a different kind 1 1 one sentence in a source language is aligned with
23. egal documents e g contracts conditions regulations etc and procedural descriptions i e documents dealing with all kinds of procedures Administrative texts comprise five basic level categories legislation written law proceedings of parliamentary debates minutes of meetings yearly reports and official speeches These texts are produced within an institutional context their circulation is usually restricted to internal use or to use within a limited circle of organizations tied to the institution e External communication consists of five basic level categories self presentations of organisations projects events informative documents of a general nature promotion and advertising material press releases amp newsletters and scientific texts These are texts of an informative and or persuasive nature that are characterized by a wide circulation and meant for external use in general or for peers in a broad sense The DPC typology is presented in the table below The five main types represent superordinates each containing several basic level categories SUPERORDINATE BASIC LEVEL Fictional 1 1 Novels 1 2 Essayistic texts 1 Literature Non fictional 1 3 Auto biographies 1 4 Expository works of a general nature 2 1 News reporting articles 2 Journalistic texts 2 2 Comment articles background articles columns editorials 3 1 Manuals 3 Instructive texts 3 2 Internal legal do
24. ek Patrick 2000 Language resources as by product of evaluation the multitag example In Second International Conference on Language Resources and Evaluation LREC 2000 pages 151 154 3 5 3 Syntactic information A smaller part of the DPC data is syntactically annotated The Dutch selection 200 000 words was annotated by the LASSY team who used the Alpino parser developed at Groningen University for this The texts were selected from the following 4 text types administrative texts 26 3520 words instructive texts 25 985 words external communication 66 379 words journalistic texts 81 104 words http www inf unibz it bernardi Courses CompLing Papers tagguide pdf estimations derived from the 1 million word corpus 12 The texts come from both the Dutch French and the Dutch English part of the corpus and contain texts originally written in Dutch as well as texts translated into Dutch 3 6 Terminology Extraction In order to evaluate different terminology extraction tools a Gold Standard i e a manually created reference set for terminology extraction was created within the framework of the DPC project Terminology extraction can be seen as a first step towards terminology management In the terminology extraction phase terms are identified in a text and in the case of multilingual terminology extraction the corresponding translations are retrieved The extracted terms and their translations
25. ide range of translational phenomena three types of links were introduced regular links are used to connect straightforward correspondences fuzzy links for translation specific shifts of various kinds paraphrases and divergent translations and null links for source text units that have not been translated or target text units that have been added A multi level annotation is proposed in case of divergent translations fuzzy links are used to connect paraphrased sections regular links are used to connect corresponding words within the paraphrased sections The annotation guidelines are available on the website of http veto hogent be It3 For more information we refer to Lieve Macken 2010 An annotation scheme and Gold Standard for Dutch English word alignment Proceedings of the Seventh International Conference on Linguistics Resources and Evaluation LREC 2010 Valletta Malta 3 5 Linguistic annotation Linguistic annotation involves lemmatization and part of speech PoS tagging of the DPC data two processing steps which are usually linked together The input data had to be tokenized a pre processing task which is performed before the actual tagging procedure The whole corpus is tokenized lemmatized and enriched with Part of Speech tags Since these steps are language dependent different tokenizers lemmatizers and PoS taggers were used for every DPC language 3 5 1 Tokenization During tokenization a sentence i
26. in Zuidoost Europa e Au nom de M Miguel Angel Moratinos E Catalans Moratinos de Chairman in Office van de OVSE en de Spaanse minister van trang res M Josep Borell repr sentant sp cial du pr sident en exercice 8 Tattention sur le r le important de heer Josep Borell speciaal gezant van de Chairman in office op de de l assembl e parlementaire de OSCE dans le renforcement de l tat de droit d mocratique lla dress un bilan van de parlementaire assemblee van de OVSE bij de versterking van de democratische rechtsstaat en hije des derni res volutions dans la zone OSCE et expos les priorit s de la pr sidence espagnole i Ep Ec AES P PA 6 0 KREE Ee eo De heer Borel EES de belangrike bidrage van de OVSE vertegenwoordigingen de field missions om c un expos sur l administration municipale en Mac doine o e De heer Trifun Kostovski burgemeester van Gat uiteenzetting over het stedelijk bestuur in Macedc 2 vanaf 2001 Lors du d bat qui B ces expos s Mme Haering Suisse B sur la n cessit de conf rer un statut hetdebatdatop deze uiteenzettingen WEES mevrouw Haering Zwitserland op de nood van een spe sp cial aux capitales pour relever ad quatement les d fis qui se posent statuut voor de hoofdsteden om de uitdagingen die zich stellen passend te beantwoorden 6 Le projet qui visait notamment rassembler des enfants issus de milieux tr s diff rents 8 80688 enfants
27. ion and Part of Speech tagging The lemmatization process generates the base form lemma for each orthographic token Part of speech tagging assigns a part of speech code to each orthographic token The lemmatizers for the three languages use similar definitions of base form or lemma The base form for verbs is the infinitive with other words it is the stem i e the word form without inflectional affixes Although ideally one would like to compare grammatical codes over the three languages limitations of the tools and inherent features of the three languages involved do not allow for a straightforward mutual mapping of the PoS codes It was decided to use widely accepted PoS tag sets for each language In Appendix Three you can find the frequency of the head PoS tags for each language Dutch The PoS tagging system and tools developed within D COI a corpus project for Dutch were borrowed for the Dutch section of DPC The advantage of this is that the Dutch data Dutch being the central language in DPC can be directly related to existing Dutch corpora thus allowing for transparent search queries in linked Dutch corpora whenever need be For the 1 million subcorpus the ensemble tagger was used The ensemble tagger uses the CGN PoS tagset Van Eynde Frank and Zavrel Jakub and Daelemans Walter 2000 which is characterized by a high level of granularity Apart from the word class the CGN PoS tag set codes a wide range of morphosynt
28. is bi directional A part of the corpus is trilingual The design principles were based on research into standards for other parallel corpus projects and a user requirements study Two objectives were of paramount importance balance and quality In this chapter we discuss the aspect of balancing the corpus while the next chapter focuses on how high quality was ensured for each processing step 2 1 Balance The DPC corpus consists of two language pairs Dutch English and Dutch French hence its four translation directions Dutch into English English into Dutch Dutch into French and French into Dutch and five text types Literature Journalistic texts Instructive texts Administrative texts External communication The corpus is balanced proportionally with respect to translation direction and text type In accordance with DPC s design principles the corpus contains five text types that each account for 2 000 000 words Within each text type each translation direction contains 500 000 words When constructing DPC two exceptions were made to the global design e Given the difficulty to find information on translation direction for instructive texts the condition on translation direction was relaxed for this text type e For literary texts it often proved difficult to obtain copyright clearance Due to time constraints the literary texts are not strictly balanced according to translation direction but are balanced according to
29. language pair The exact number of words can be found in Appendix One 2 2 Text types In order to enhance the navigability of the corpus a subclassification was imposed on the five text types resulting in the creation of a finer tree like structure within each type This subdivision has no implications on the balancing of the corpus The introduction of subtypes is merely a way of mapping the actual landscape within each text type and assigning accurate labels to the data in order to enable the user to correctly select documents and search the corpus The labels for the subtypes were chosen from cognitively tangible categories most of them are encountered in everyday use The following subtypes have been distinguished within the DPC typology e Literature is subdivided into fictional and non fictional texts The fictional texts was not further subdivided in basic level categories since we only manages to clear some novels from copyrights which is a well known and recognized genre Whereas the non fictional literature is an umbrella category uniting three basic level categories all of them are well known genres essays auto biographies and expository works of a general nature e Journalistic texts were roughly subdivided into two basic level categories news reporting articles and comment articles The latter comprises background articles columns and editorials e Instructive texts contain three basic level categories manuals l
30. ng to its type and domains as well as according to the type of institution that produced the text profit vs non profit and according to its intended audience internal communication external communication for specialists or external communication for a general public A list of relevant keywords is provided for the text as well as information on copyright 2 The second kind translation related data indicates the translation direction and links original and translated texts It also notes how the text was translated human translation translation by a human using translation memory or machine translation corrected by a human The statistics mention how many words and sentences a certain document contains These metadata suit different types of users Any user can select according to his or her needs a more fine tuned sample set based on the combination of metadata tags More information can be found in Chapter 4 where the importance of the metadata as a first step in corpus search is underlined For a complete list of all possible metadata tags see Appendix Two 2 6 Conclusion This chapter discussed DPC s corpus design marked by two features balanced composition and research availability The Dutch Parallel Corpus contains texts from a wide range of text types and diverse domains It contains two bidirectional bilingual parts and one trilingual part For exact numbers of the total amount of words that are included
31. ns Pensions Benifits 10 Copyright IPR agreement Full version Light version Short version Letter or e mail with permission 11 Type of institution Profit Non profit 12 Intended audience Broad external audience Limited internal audience Specialist audience 23 Translation related data 13 Original Text amp Language EN FR NL Unknown 14 Translated Text amp Language EN NL FR NL EN FR NL 15 Intermediate Language EN FR NL Unknown Memory Machine Unknown Statistics 17 Number of words 18 Number of sentences Extra 19 Subdocuments 24 Appendix three PoS tags In the following tables you see the frequency of some PoS tags in the three DPC languages Dutch English French For Dutch and French the PoS tags have been truncated the subcategory labels have been stripped from the category label In French for example the PoS code Ncfs nom commun f minin singulier has been truncated to the main category N noun In Dutch for example the PoS code VNW aanw adv pron stan red 3 getal has been truncated to the main category VNW personal pronoun In the case of English all PoS tags are shown with the exclusion of those tags referring to punctuation marks or other word delimiting codes
32. ous examples The most recent version of this text is available online at the following address http www kuleuven kortrijk be dpc manual Acknowledgements DPC is a research project that was financed by the Nederlandse Taalunie Dutch Language Union within the framework of the STEVIN programme a Dutch acronym for Essential Speech and Language Technology Resources a multi year programme stimulating research in Dutch language and speech technology DPC was created by a Flemish consortium K U Leuven Campus Kortrijk and Faculty of Translation Studies of the University College Ghent The core team was assisted by a number of research partners with expertise in different domains data driven machine learning tools linguistic annotation alignment and corpus exploitation The core team collaborated closely with other Stevin projects D Coi SoNaR and Lassy To make sure that the corpus fulfils the need of the different intended users a user group was composed representing specialists of the different application and research domains The user group consists of industrial and academic partners The following researchers contributed to DPC Piet Desmet Willy Vandeweghe Hans Paulussen Lieve Macken Maribel Montero Perez Orph e De Clercq Lidia Rura Julia Trushkina and Antoine Besnehard http taalunieversum org taal technologie stevin 2 Corpus Design DPC consists of two language pairs Dutch English and Dutch French and
33. rs Glossaire pdf IDF International Dictionary of Finance http www medicalreflex fr grand The Economist Books public http georges dolisi free fr Termin LEXMED Lexique de terminologie m dicale ologie Menu terminologie medica le menu htm MEDREF Medical Reflex as E A http www medecine et MEDSAN Lexique M dical M decine et Sant sante com lexique html MWENNE Medisch woordenboek E N N E _ Mostert PINK Pinkhof Geneeskundig Woordenboek TAALVL Tasivlinder http www taalvlinder com pages medici htm UBSLEXFR Lexique bancaire UBS nttp www ubs com 1 f about bte rms html Woordenboek geneeskunde E N N _ eee E Kerkhof ZIEK Ziekenhuis hi http www ziekenhuis nl index ph p Table 5 Consulted reference works 14 4 Exploitation The corpus can be exploited either as a full text resource or as a web search interface This chapter focuses on the web interface that was developed for the different users of the corpus doing a corpus search The web interface was developed by Geert Peeters amp Serge Verlinde ILT Leuven it was composed for users with notions of Dutch 4 1 Monolingual vs Bilingual search The first choice an intended user can make is whether he she wants to make a monolingual or bilingual search Monolingual make your choice E 4 2 Full corpus vs Subcorpus Secondly the user has the choice whether he would like to search the entire corpus or
34. s split into sequences of words All punctuation marks not belonging to the word form i e punctuation marks that are not part of an abbreviation are stripped off Differences between the tokenization procedures for all three languages are related to the tagging tools used The treatment of certain punctuation marks required a different approach depending on the language An example is the treatment of the possessive marker s in English and Dutch According to the conventions of the English part of speech taggers the possessive marker s is split off during tokenization and a separate PoS tag is assigned The conventions of the Dutch part of speech tagger on the other hand do not bring about the possessive marker to be stripped off during tokenization as possessiveness of the noun is coded in the PoS tag EPAR stands for European Public Assessment Report this text type includes patient information leaflets 10 Tokenization for Dutch was performed by the D COI tokenizer for English a slightly adapted version was used The French data was tokenized with the help of an adapted version of the French tokenizer scripts which is part of the TreeTagger program documentation After the linguists had manually checked 10 of the output the development of spot checking modules to tokenize the remaining 90 of the corpus turned out to be superfluous as the tokenizers could proceed automatically given slight adaptations 3 5 2 Lemmatizat

Dutch Parallel Corpus User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents