Home

Applications of software in the compilation of corpora

1. Oft Scyld Scefing sceattena Heorogar 31 2 23342 meodosetla ofteah egsode eorla heresped 31 2 3343 lt R6 gt gt Hi aD TERA 344 Sytd dan CONJUNC arest wear hie 32 223345 gebad weox under wolcnum weor hildew apnum 31 2 3346 him INFL PRO aghwylc tara ymb him 311 3347 scolde gomban gyldan 3 hine 33 2 3348 lt R11 gt 3 his pal sr 349 ttat wtas_ BE PAST god cyning hl aste 31 2 23350 lt R 12 gt 3 holm gt 2 335 d am eafera wtas BE PAST afte hringedstefna 31 3352 sende folce to frofre fyren tde Hro dgar 31 2 3353 ar drugon aldorlease lange Hro dgare oe EEST lt R 16 gt gt hronrade 31 2 3355 Him INFL PRO ttas Raymond Hickey Applications of software Page 9 of 15 liffrea wul hu 31 2 2356 Beowulf w as BE PAST breme bl aAAAAA Search Abort Esc AAAA 23357 _ eafera Scedelandum in 23 358 lt R 20 gt 3359 2 Swa_CONJUNC sceal geong guma gode gewyrcean fromum feohgiftum 3 Index File BEOWULF WDX 3 This type of word index could be created in advance by the compilers of a corpus and supplied on the distribution medium thus obviating the need for users to generate such lists themselves Given the greatly increased storage capacity of mediums such as CD ROM disks the additional space required for such index files should not be a deterrent to offering them with the primary text files of a corpus 6 Lexical databases Closely related to word i
2. An eek in what array that they were lnne AAAAAAAAAAAAAA AA AA A A A AA AA AU 3 Raymond Hickey Applications of software Page 11 of 15 The second major type has the keyword in a separate column to the left with the text line from which it is taken following it Here one is dealing with a KWOC or keyword out of context file Both file types can be generated quite easily with Lexa Furthermore the information of a concordance file can be transferred to a database environment to enable users to avail of the additional manipulative power of the latter type of file Keyword out of context KWOC file UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA i 333 And made forward erly for to ryse 3334 To take oure wey ther as I yow deuyse 3337 Me thynketh it acordaunt to resoun 3338 To telle yow al the condicioun V 33 Wel 33 3324 Wel nyne and twenty in a compaignye 3329 And wel we weren esed atte beste oe 3 3 Whan 33 33 Whan that Aprill with hise shoures soote 3330 And shortly whan the sonne was to reste ze 3 3 What ae 335 What Zephirus eek with his sweete breath 3318 That hem hath holpen what that they were seeke 3340 An whiche they were and of what degree 3341 An eek in what array that they were Inne 2 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASA ee oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oe oe oe oo oe oo oe oe oe oe oe oe N 8
3. tah o d ttat him aghwylc tara ymbsittendra ofer hronrade hyran scolde gomban gyldan lt R 11 gt t at w as god cyning lt R 12 gt d am eafera w as after cenned geong in geardum tone god sende folce to frofre fyren dearfe ongeat te hie ar drugon aldorlease lange hwile lt R 16 gt Him ttas liffrea wuldres wealdend woroldare forgeaf Beowulf wtas breme bl ad wide sprang Scyldes eafera Scedelandum in If one takes this text however and runs it through the programme Make Symbols which is supplied in the Lexa suite then a number of substitutions are made and certain high ASCII characters are inserted where escape sequences were found in an input text Now under the important assumption that 1 you are using a computer with a colour monitor typically a VGA video adapter based system and ii that you have loaded the supplied Old Middle English font of the Lexa suite then the stretch of text printed above should now look like the following Raymond Hickey Applications of software Page 13 of 15 Beginning of Beowulf with Old English characters BEOWULHFYI lt R 1 gt Hw t We Gardena in geardagum eodcyninga rym gefrunon hu a elingas ellen fremedon lt R 4 gt Oft Scyld Scefing scea ena reatum monegum mzg um meodosetla ofteah egsode eorlas lt R 6 gt Syan erest wear feasceaft funden he zs frofre gebad weox under wolcnum weormyndum ah o zt him eghwylc ara ymbsittendra ofer hronrade
4. 3315 And specially fram euery shires ende 3316 _ Of Engelond to Cauntenbury they wende 3317 The hooly blisful martir for to seke 3318 That hem hath holpen whan that they were seeke 3319 Bifil that in that seson on a day 3320 In Southwerk at the Tabard as I lay 3321 Redy to wenden on my pilgrymage 3322 To Caunterbury with ful deuout corage gt 3 lt 222222222222222222222222222222222222222222222222222222222222222222222222222222 gt 3 3 Space Txtl t Txt2 Tab Split Sereen Menu Shift Tab S This comparison facility does not allow you to alter the contents of a text Should you wish to check on and edit two texts at once then you can use the similar comparison option in the Lexa suite text editor Lexa Text 2 Normalisation While the critical editions of texts in printed form strive to be accurate in the inclusion of variants e g in the edition of a work attested in different manuscripts for the electronic form of a text a normalised version may have very definite advantages in terms of readibility not to say accessibility particularly with older texts or those representing a dialectally divergent language variety In essence the process of normalisation consists of replacing variants of a grammatical form by a single form by external consensus e g as the latter is the input to a later standard form or indeed this itself or by a justifiable decision of the corpus compilers Despite the almost ideological dislike of normali
5. oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ee oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oo oe oe oo oe oe oo oe oe oe oe oe i Furthermore online searching of texts is possible Searching can be on a global level encompassing all texts of a corpus and including unspecified elements in search strings by the use of wild cards Texts can be extracted from a corpus and printed separately if required The corpus manager is particularly suitable for those corpora which consist of many small parts and where there is a hierarchical relation between these Technically to adapt a corpus for use with the programme Corpus Manager one must create a single file then place level markers at strategic points in this file indicating where the breaks are in the text so to speak A special text editor will carry out the task of marker placement quite easily The programme then indexes the file which has been prepared in this fashion Once this has been done one can consult the corpus now in the form of a text database at will There is no restriction on the number of text databases so that one could conceivably divide a corpus into several blocks each with an internal structure determined by the compilers of the corpus To illustrate this technique there is a text database of letters from the history of English included as part ofthe Lexa suite 9 Conclusion The present sketch is intended to offer the int
6. seeke ae Bifil that in PREP that seson on PREP a ART day 33In PREP Southwerk at the ART Tabard as I lay 3 3Redy to wenden on PREP my POSS PRO pilgrymage ROMANCE 3 3To Caunterbury with ful deuout corage ROMANCE 33 Alt 1 Alt 2 Alt 3 Alt 4 F7 End Tagging texts before their distribution is something which later users may view as a Raymond Hickey Applications of software Page 7 of 15 linguistic straightjacket as it imposes the grammatical classification scheme of the compilers on the user Seeing that there is tagging software available many compilers may now prefer to leave this work to the corpus users or to some sub group such as researchers in another university who would be prepared to carry out this task As universities have to economise on resources tagging by the compilers is likely to become less likely in future especially as partial tagging is not viewed as a sensible course of action You either tag completely or not at all If you decide to do so you may bind your capacities in a manner which you come to regret later This would appear at least to be the case for major projects like the Helsinki corpus With the arrival of smaller more specialised corpora tagging may become feasible particularly if it is directly connected with the research interests of the corpus compilers 4 Using Cocoa headers Independent of the question of whether to tag or not to tag compilers of a corpus should consider whether it would be of avail
7. the three volumes of documentation are available from the Norwegian Computer Centre for the Humanities in Bergen Norway Each of the following sections is intended to illustrate a typical situation in which software is useful in the preparatory stage of corpus building The list is not exhaustive but it does cover the main areas of concern in this phase of text collection and organisation 1 Text collation It is safe to assume that more than one individual will be involved in the compilation of a text corpus Texts will either be scanned or keyed in directly In either case it is more the exception than the rule to find that a text turns up error free in the computer This banal fact increases the status of the individual who is responsible for text correction Again it is commonplace for more than one version of a text to exist in some intermediary stage of compilation Sooner or later in such a situation doubts arise as to whether a particular version of a text is the more accurate or the better corrected The need arises quite quickly for a reliable means of comparing two versions of a single text Of course the time and date stamp of a file on the operating system level will tell you which of two is the more recent but age is not a guarantee for correctness To resolve this dilemma a programme has been included in the Lexa suite which will Raymond Hickey Applications of software Page 2 of 15 compare two files with each other byte for byte an
8. to future users to include some information on the nature of the texts before distributing these This decision has fortunately been made in favour of supplying such information by the compilers of the Helsinki corpus The format they have chosen for the inclusion of text relevant information is what is commonly known as the Cocoa header Note that header information is placed at the top of a file and has nothing to do with grammatical classifications included in the body of a text Parameters of the Cocoa header 1 lt B name of text file gt 2 lt Q text identifier gt 3 lt N name of text gt 4 lt A author gt 5 lt C part of corpus gt 6 lt O date of original gt 7 lt M date of manuscript gt 8 lt K contemporaneity gt 9 lt D dialect gt 10 lt V verse or prose gt 11 lt T text type gt 12 lt G relation to foreign original gt 13 lt F foreign original gt 14 lt W relation to spoken language gt 15 lt X sex of author gt 16 lt Y age of author gt 17 lt H social rank of author gt 18 lt U audience description gt 19 lt E participant relation gt 20 lt J interaction gt 21 lt I setting gt 22 lt Z prototypical text category gt 23 lt S sample gt 24 lt P page gt 25 lt L line gt 26 lt R record gt Although providing headers is not comparable to the task of tagging a corpus it nonetheless requires an ad
9. which Raymond Hickey Applications of software Page 12 of 15 have the shapes of the Old and Middle English characters in any set of input texts at those points where it encounters an escape sequence e g it inserts the yogh symbol when it hits on g The conversion is reversible so that texts can be restored to their original form if desired The numerical values of the redefined characters with Old and Middle English shapes are as following Escape Actual Letter ASCII numerical value for redefinition sequencecharactername by Lexa programme Make Symbols a ash L c 145 A Ash u c 146 d eth 1 c 253 D Eth u c 252 g yogh l c 243 G Yogh u c 242 t thorn 1 c 245 T M Thorn u c 244 tt crossed thorn 248 TT crossed Thorn 246 e e caudata 144 pound sign 156 Taking a typical text such as Beowulf and loading it with one s text editor or word processor leads to one being presented with a text which is convenient for computer manipulation but hardly readable to the Old English scholar Beginning of Beowulf with escape sequence coding D BEOWULF lt R 1 gt Hw at We Gardena in geardagum teodcyninga trym gefrunon hu da attelingas ellen fremedon lt R 4 gt Oft Scyld Scefing scea tena treatum monegum mt ag tum meodosetla ofteah egsode eorlas lt R 6 gt Syt d dan arest weart d feasceaft funden he ttas frofre gebad weox under wolcnum weor dmyndum
10. Applications of software in the compilation of corpora Raymond Hickey Department of English University of Munich Abstract An attempt is made here to sketch some of the applications to which corpus pro cessing software can be put in the compilation of corpora The emphasis is on the one hand on the automation of many standard processes such as text collation and the provision of header information for each file of a corpus while one the other hand the additional possibilities offered by dedicated corpus software are described Among the latter special emphasis is put on the transfer of textual data to a database environment for further processing Further matters such as the use of special fonts for older stages of English and the option of organising the text files of one s corpus for potential users in advance are also discussed 0 Introduction Given the nature of the contributions to this volume the present author thought it appropriate to discuss the uses to which corpus processing software could be put in the compilation and distribution of corpora especially ones with a diachronic orientation Assuming that the compilers of a corpus have reached basic agreement on what periods are to be covered and what texts are to be included software can be used gainfully from this point onwards To illustrate possible applications the software system Lexa developed by the present author will be used for the ensuing discussion This programme suite and
11. Font considerations The corpora and corpus projects under construction which are presented in the present volume all refer to diachronic English If the time span in a particular instance stretches back far enough then texts will involve special characters for Middle and Old English The practice with historical corpora has been to represent special symbols of historical stages of the language by using so called escape sequences For instance the Old English character thorn is represented by t in the Helsinki corpus the eth symbol is indicated by d and so forth This encoding has the advantage of portability The corpus texts only include characters with numeric values between 32 and 126 in the ASCH set and are transferrable to and readable on computer systems operating on a so called 7 bit basis The obvious disadvantage is that readability drops drastically with older texts Something like Beowulf is undecipherable in the escape sequence form A practical solution to this problem presented in the Lexa suite is to use a supplied programme to convert the sequences to single characters with the correct shapes so that an Old English text on screen looks more or less identical to one in printed form The scheme devised by the present author utilises the ability of personal computers with colour monitors to display characters with customized shapes on screen The programme which makes the alterations inserts the redefined symbols of the screen
12. No of replacements 7 6 Matching files 1 3 3K AAAAAAA AAA AA AAA AA AAA AA AAA AA AAA AAAAAAAAAAAAAAAAAAAA RAR AANAANHARARAARARRANANAA ANU anple ranpe Iae eb HAVE used to show two forms in the 3313 HAS has 3313 HAS HAS 3319 HAVE have 3319 HAVE HAVE 3322 HAD had O iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 3322 HAD HAD 3 2 33 3 DbTrans successfully executed 3 23 3 3 33 Oni Press any key 11113 4 a 33 3 3 Press lt Escape gt to abort operation 30 3 A KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAU Needless to say any normalisation procedure of any reasonable extent will require far more records specifying many more substitutions This is however not a matter of principle but of arranging a suitable database Again the advantage of normalisation via a translation programme is that the original version of a text is left unimpaired Furthermore Raymond Hickey Applications of software Page 5 of 15 the normalisation can in fact be carried out by the user of the corpus if he or she so wishes thus releasing the compiler from the arduous task of generating comprehensive databases for normalisation tasks 3 Pre tagging texts A major decision which the compilers of a corpus have to take is whether the texts of their corpus are to contain any kind of grammatical information If this decision is made in favour of including such information then a conside
13. aldend woroldare forgeaf 3 Beowulf wtas_BE PAST breme bl ad wide sprang Scyldes 3 3eafera Scedelandum in 3 3 lt R 20 gt 33 Alt 1 Alt 2 Alt 3 Alt 4 F7 End F10 Save 3 Sh Tab Menus Opening of the Canterbury Tales in tagged form UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 2C LEXA DEMO CHAUCER LEM Line 1 Coll Pagel Text13 uu Whan that Aprill with hise shoures_ PLURAL soote gt 3the ART droghte of PREP March hath HAVE perced to the ART roote 3 3And bathed euery IND PRO veyne in PREP swich licour 33Of PREP which vertu engendred is the ART flour 3 3What Zephirus eek with his POSS PRO sweete breath 3Inspired hath HAVE in PREP euery IND PRO hold and heeth 33The ART tendre croppes_ PLURAL and the ART yonge PAST PART sonne gt 3Hath HAVE in PREP the ART Ram his POSS PRO half cours yronne_ PAST PART 33And smale foweles_ PLURAL maken melodye 3 3That slepen al the ART nyght X PHON with open eye 3 3 So priketh hem nature in PREP hir POSS PRO corages_PLURAL gt 3Thanne longen folk to geen on PREP pilgrimages PLURAL 3 3And Palmeres_ PLURAL for PREP to seken straunge strondes_ PLURAL 3 3To ferne halwes PLURAL kowthe in PREP sondry londes_ PLURAL 3 3And specially fram euery IND PRO shires PLURAL ende 3 3Of PREP Engelond to Cauntenbury they wende gt 3The ART hooly blisful martir for PREP to seke 3 3That hem hath HAVE holpen what that they were COPULA
14. c can be excluded from the tagging process The result of a tagging operation on two input texts the opening lines of Beowulf and of Chaucer s Canterbury Tales produced the following results Note that for the purposes of illustration only a selection of tags were specified which means that only a small number of word forms are actually tagged Opening of Beowulf in a tagged form UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3C LEXA DEMO BEOWULF LEM Line 58 Coll Pagel Text 1 NAVAN ANAVIVAAN AANA IV VAAN AIAN A VINA AANA D E WN AA NAAN AANA AAA AN AAA AAV VIN A uu 7Hw at_INTERROG 3 3We Gardena in geardagum teodcyninga trym gefrunon hu CONJUNC da 33 a telingas ellen fremedon 33 lt R 4 gt 3 3Oft Scyld Scefing sceat tena treatum monegum m tag tum 3 3meodosetla ofteah egsode eorlas 33 lt R 6 gt 3 3Sy d dan CONJUNC arest wear d feasceaft funden he t as frofre 3 3gebad weox under wolcnum weor dmyndum tah o d t at Raymond Hickey Applications of software Page 6 of 15 3 3him_INFL_ PRO aghwylc tara ymbsittendra ofer LOCATIVE hronrade hyran 3 3scolde gomban gyldan 33 lt R 11 gt 3 3 ttat wtas_ BE PAST god cyning 33 lt R 12 gt 33 d am eafera wtas_ BE PAST after TEMPORAL cenned geong in geardum tone god sende folce to frofre fyrentdearfe ongeat te hie 3 3 ar drugon aldorlease lange hwile 33 lt R 16 gt 3 Him INFL PRO t as liffrea wuldres we
15. d give information on grammatical endings If you sort a lexical database alphabetically going on the field REVERSE then you end up with the records ordered according to the end and not the beginning of the words in the field TOKEN UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA DB 1 Field 1 Col 1 Rec 119 272 Sat ace re P 3TOKEN C 32 hine 3 3LEMMA C 16 INFL PRO gt 3FREQUENCY N 6 3 3 3REVERSE C 32 enih pen 33 343 3 3 PgUp PgDn G Goto F1 Help Alt F7 Desktop F10 Browse Raymond Hickey Applications of software Page 10 of 15 3 Shift Tab Menus Whether compilers of a corpus would feel like including lexical databases for their primary text files depends on the status of lexical analysis on their own research horizon Once more the question of the space which such files would occupy diminishes when one considers the large capacity of storage mediums nowadays 7 Concordance files Looking at words in isolation is one aspect of lexical analysis Another which stresses contextualization is the viewing of words in context Here one is dealing with a file type which while not very sophisticated from a computing point of view nonetheless has its justification is providing valuable information to users of a corpus The main programme Lexa can once more be employed to generate the type of file in question There are basically two types of concordance one in which the keyword being considered is displayed in the centre of a te
16. d report the differences on screen and write these to a file if required With Lexa Compare one loads two text files from a directory listing and then on pressing a dedicated key the programme begins a comparison of the two In the following screen print outs two versions of the opening line of Chaucer s Canterbury Tales are displayed The version CHAUC_1 TXT contains four errors what is written for whan twice kouthe for kowthe and season for Chaucer s spelling of seson These errors are highlighted on the screen and the user can immediately recognize which of the versions is the more correct Errors will be detected anywhere up to the end of the files chosen If these are identical you are informed of this Less correct version of text UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA i AAAAAAAKAAKAAAAKAAAAAKAAAAAAAA File 1 CHAUC 1 TXT 35 071 06 15 93 12 47 Offset 893 71 Whan that Aprill with hise shoures soote 332 the droghte of March hath perced to the roote 2 33 And bathed euery veyne in swich licour 334 Of which vertu engendred is the flour 335 What Zephirus eek with his sweete breath 336 Inspired hath in euery hold and heeth 337 The tendre croppes and the yonge sonne 338 Hath in the Ram his half cours yronne 339 And smale foweles maken melodye 3310 That slepen al the nyght with open eye 33 So priketh hem nature in hir corages 3312 Thanne longen folk to geen on pilgrimages 3313 And Palmeres
17. ditional amount of work to specify the values for these parameters for the texts of a corpus The advantages however are considerable With the Lexa suite the contents of a Cocoa header can be accessed by the information retrieval software This is realised as follows a programme called Cocoa extracts the header information from any set of input files and deposits this in a database Then with the database manager of the suite called DbStat one can load the database just created and impose a filter on it By this is meant that only those records remain visible which match a Raymond Hickey Applications of software Page 8 of 15 certain user specified condition Assuming that one generates a database of the Cocoa header information in the files of the Helsinki corpus and loads the database manager then one could specify a filter to which only those records i e file headers in database form correspond which represent translations Item 13 of Middle English Item 6 prose Item 10 texts A list of the files for which this header information obtains can be generated by creating a list from the field information for Item 1 name of text file The list file created by these steps can in its turn be used as the source of the file names for an information retrieval operation with other parts of the Lexa suite so that only Middle English prose translations from the corpus are examined In addition the user can specify with the retrieval programmes fro
18. e bound to be additional advantages to be accrued from looking at the contents of a corpus on the computer rather than in printed form To this end the programme Corpus Manager has been designed and included in the Lexa suite Basically what the corpus manager does it to provide one with a table of contents for a corpus with up to three levels of depth There is a contents window for each level and the user can choose any text and view it by selecting it in the current window roe UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3 F7 Desktop F9 Settings 3 3 AAAAAAA Level 1 4 0f 19 AAAAAA 333 1 Overview Cocoa info Pee D Beaumont AAAAAAA Level 2 1 of 2 AAAAAA 333 4 Plumpton 3 13 Thomas More 2333 12 More et al Wiese Margaret Roper 2333 19 Cromwel AAAAAAA Level 3 1 of 3 AAAAAA 2333 21 Cumberl 14 Letter to wife 33 26 Knyvett 15 Letter to daughter 2333 28 Harley 16 Letters to M Roper a 2333 30 Paston z 2333 33 Ferrar 2 333 36 Barring 2 2333 45 Proud e 2 2 353 gt 56 Gawdy A 2333 58 Haddock Raymond Hickey Applications of software Page 14 of 15 2 2 333 63 Strype 2 2 333 65 Oxinden 333 68 Hatton 2333 74 Pinney i 2333 78 Henry 2 233A U View Esc 2 ar K track Forward A 33 VIND n Y 3 N oee se se oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oe oo oe oe oo oe oo
19. erest linguist engaged in corpus compilation an idea of what can be done to both facilitate the process of compilation for him herself and to increase the gainful use to which the corpus can then be put once it is completed In all instances the programmes of the Lexa suite do not have to be altered in any major way apart from the operator of the programmes creating his her own configuration for in some instances this is standard procedure The net result is a greater degree of automation for many processes which represents a saving in resources which renders in turn many a task more feasible and brings forward the distribution date for many an interesting corpus References Hickey Raymond 1993a Lexa Corpus Processing Software 3 Vols Vol 1 Lexical Analysis Vol 2 Database and Corpus Management Vol 3 Utility Library Bergen Norwegian Computing Centre for the Humanities Hickey Raymond 1993b Corpus data processing with Lexa ICAME Journal 17 73 96 Hockey Susan and Ian Marriott 1980 Oxford Concordance Program Users manual Raymond Hickey Applications of software Page 15 of 15 Oxford Oxford University Computing Service Johansson Stig 1986 The tagged LOB Corpus User s manual Bergen Norwegian Computing Centre for the Humanities Johansson Stig and Anna Brita Stenstr m eds 1991 English computer corpora Selected papers and research guide Berlin Mouton de Gruyter Kyt Merja Ossi Ihalainen and Matti Rissane
20. for to seken straunge strondes 3314 To ferne halwes kouthe in sondry londes 3315 And specially fram euery shires ende 3316 _ Of Engelond to Cauntenbury they wende 3317 The hooly blisful martir for to seke 3318 That hem hath holpen what that they were seeke 3319 Bifil than in that seeson on a day 3320 In Southwerk at the Tabard as I lay e321 2 Redy to wenden on my pilgrymage 3322 To Caunterbury with ful deuout corage 3 3222222222222222222222222222222222222222222222222222222222222222222222222222222 gt 3 3 Space Txtl a Txt2 dab Split orreen M RE e mot ee ed More correct version of text UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA GI a Foe SAN a E Fae caTE File 2 CHAUC 2 TXT 35 071 06 15 93 12 45 Offset 893 31 Whan that Aprill with hise shoures soote 332 the droghte of March hath perced to the roote 33 And bathed euery veyne in swich licour 334 Of which vertu engendred is the flour 335 Whan Zephirus eek with his sweete Raymond Hickey Applications of software Page 3 of 15 breath 336 Inspired hath in euery hold and heeth 337 The tendre croppes and the yonge sonne 338 Hath in the Ram his half cours yronne 339 And smale foweles maken melodye 3310 That slepen al the nyght with open eye 3311 So priketh hem nature in hir corages 12 Thanne longen folk to geen on pilgrimages 3313 And Palmeres for to seken straunge strondes 3314 To ferne halwes kowthe in sondry londes
21. hyran scolde gomban gyldan lt R 11 gt t wes god cyning lt R 12 gt m eafera wes efter cenned geong in geardum one god sende folce to frofre fyrenearfe ongeat e hie r drugon aldorlease lange hwile lt R 16 gt Him es liffrea wuldres wealdend woroldare forgeaf Beowulf wes breme bleed wide sprang Scyldes eafera Scedelandum in In the interests of a unified system prospective compilers of diachronic corpora are advised to keep the encoding system used by the Helsinki corpus for special characters If they do then the software already available for the latter corpus can be used without any alteration in a newer corpus which complies to the original codification scheme Users of several corpora will only need one special video font namely that supplied with the Lexa suite and can then view any text from a selection of corpora without further system adjustment Note this coding scheme is also that used by the present author for the medieval texts in the Corpus of Irish English see elsewhere in this volume for details 8 Organisational considerations Once a corpus has been completed a practical question arises for its potential users How do they gain an overview of just what the corpus contains The simple answer with many corpora is to consult the manual However given that a corpus is an electronic library of texts it is surely natural to expect that this question can be answered electronically Not only that there ar
22. ler wishes to carry out The net result of this procedure is an unaltered and non normalised version of a file or files along with a database or databases one per text file with which the user of a corpus can if he or she so wishes generate a normalised version of a text The programme to use here is called Database Translate DBTRANS as it translates an input text into an altered output text on the basis of a database Here is a simply example of how this actually works A database is created entitled NORMAL DBF This contains two fields per record labelled MID_ENG Middle English and MOD ENG Modern English respectively There are six records with forms of the verb have which occur in Middle English The programme DBTRANS now examines any input text or texts and if it finds any instances of Middle English forms of have it replaces these by the modern forms specified in the database NORMAL DBF UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA S AAAAAAAAAAAAKAAAAAAAAAAAAAAAA Lexa utility DbTrans c Raymond Hickey gt 3A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAA Terminology database NORMAL DBF 6 Only words Yes 3 3 Input language field MID ENG Open ended No 3 3 Output language field MOD ENG Ignore case No 33 Template for files NORM INP TXT Input files ASCII 33 Current input file NORM INP TXT Manual oper Yes 33 String to be located had 33 String to be inserted HAD 33
23. m the set such as the pattern matcher Lexa Pat and the programme for locating syntactic contexts Lexa Context that the Cocoa information of the files examined be enclosed in the output file of statistics generated during a search The example just given is typical inasmuch as it illustrates how different parts of the Lexa suite link up together For any prospective operators of the programme package be they corpus compilers or users it is essential to grasp the inter relationships between items of software 5 Word indexes Among the simplest of tasks to carry out with any corpus processing software is the production of lists of unique words from source text files Despite this obvious simplicity this type of additional file is very commonly demanded by linguists examining a corpus To this end the programme Lexa Words in the Lexa suite has been created The programme takes any text file or set of files and generates a list of all the words which occur uniquely in the input Once this list has been created the user can consult it via a pop up window in which a sorted list appears together with the frequency of the words noted when generating the list in the first place UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Offset 1 342 33337 Hwtat INTERROG 23338 We Gardena in geardagum teodcyninga trym gefrunon hu CONJUNC da 23339 a telingas ellen fremedon UAAAAAAAAA 145 of 280 AAAAAAA 23 340 lt R4 gt gt heold 31 2 334
24. n eds 1988 Corpus linguistics hard and soft Amsterdam Rodopi Kyt Merja 1991 Manual to the diachronic part of the Helsinki corpus of English texts Helsinki Department of English Kyt Merja and Matti Rissanen 1988 The Helsinki Corpus of English Texts Classifying and coding the diachronic part In Corpus linguistics ed by M Kyt6 et al 169 180 Amsterdam Rodopi
25. ndexes is another type of file which is useful in analysing texts lexically This is what is termed a lexical database Recall that a database is a type of file in which information in stored in table like form The rows of the table contain different fields and the columns the contents of these fields With the main programme Lexa of the suite under discussion it is possible to derive a database from any input text or texts By this is meant that you load or specify an input text and then demand of the programme that it extract information on each word storing this in a record with four fields The first contains the word form itself i e the token which is found The second contains the tag if any which has been associated with the word form in question For the third field the frequency of the word in the input text has been noted The frequency is stored cumulatively which means that if you run the lexical database function on a series of texts the entry for FREQUENCY is incremented for every find of a particular word form in each text Thus if you for example divided Beowulf into six texts and generated a lexical database for each text using the same output database then the frequency field for any given form would contain the total number of occurrences in all the six texts taken together In the fourth field the word is deposited in reverse order of spelling The idea behind this is to allow the creation of reverse order dictionaries which woul
26. rable amount of additional work beyond the collection of texts has to be undertaken Grammatical information is normally included in corpora by tagging word forms i e by adding a label to words identifying them grammatically If tagging is to be done then it is only sensible if it is done completely Quite apart from the actual work of tagging agreement must be reached in advance on the system of classification to be used The advantage for users of a corpus is obvious the retrieval of grammatical information froma text or texts is vastly facilitated if grammatical affiliation has already been specified via tagging Given the size of the task it is imperative to use every resource available for accelerating the process In effect this means employing tagging software for the purpose In the Lexa suite the main programme called simply Lexa is designed to tag texts automatically The operator of the programme be he or she a compiler or user of a corpus must specify what words are to be tagged in what way by creating one or more lemma definition files The programme takes note of these definitions and then examines any set of input texts adding tags to words it deems as representing the grammatical classes for which tags exist in the definition file or files The programme runs in an automatic semi automatic or manual mode Tagging can be done cumulatively tags can be exchanged or updated and particularly frequent words prepositions articles et
27. sation particularly on the part of medieval scholars there are obvious advantages to it as it allows later readers to approach a text or texts without undue linguistic difficulty to see the wood for the trees so to speak In contradistinction to printing in the compilation of corpora no a priori decision has to be made about whether to distribute a normalised text or not Instead the corpus should include an original unaltered form of a text along with the means for users of the corpus to normalise the text later if they so wish Just what these means are should be explained briefly To begin with recall that the process of normalisation consists of replacing variant forms of a word by some standard or normalised form What one needs then is software which will recognize every occurrence of a variant form as an instantiation of a normal form This is realised by creating a list of normal forms and of all forms which represent variants of these Technically this is achieved by generating a database For each variant Raymond Hickey Applications of software Page 40f 15 form there is a single record which at the very least consists of two fields The first is that for the variant and the second is for the normal form with which the variant is associated The database will have as many records as there are variant forms to be replaced by normal forms The extent of the normalisation is thus dependent solely on the number of substitutions which the compi
28. xt line with a certain number of words to the left and right of it included as context this is known as a KWIC or keyword in context file Keyword in context KWIC file UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAY 333000 GuGAndtmadeutforwardterlyufortutoutryse 334 TL otutaketouretweyuthertasuluyowudeuyse 3337000 Me thynketh it acordaunt to resoun 3 338 To telle yow al the condicioun 3 3 Wel 33 3 3 3 3 324 v v v v v v v uv u Wel nyne and twenty in a compaignye 3 329 v v v v v v v v v And wel we weren esed atte beste 52 3 3 Whan 33 3 31 v v u v vu v u u Whan that Aprill with hise shoures soot e 330 And shortly whan the sonne was to reste 3 i 3 3 What 33 3 3S v v u u v v uu u What Zephirus eek with his sweete brea th 3318 v T hat hem hath holpen what that they were seeke 3 40 An whiche they were and of what degree 3 41 u v v

Applications of software in the compilation of corpora

Contents

Download Pdf Manuals

Related Search

Related Contents