Home
        Applications of software in the compilation of corpora
         Contents
1.    Oft Scyld Scefing     sceattena     Heorogar 31 2 23342   meodosetla ofteah  egsode eorla   heresped 31 2 3343   lt R6 gt   gt  Hi aD  TERA  344   Sytd dan CONJUNC  arest    wear   hie 32 223345   gebad   weox under wolcnum  weor  hildew apnum 31 2  3346   him INFL PRO   aghwylc  tara ymb  him 311   3347   scolde  gomban gyldan  3  hine 33 2 3348   lt R11 gt  3 his pal sr  349    ttat wtas_ BE PAST god cyning    hl aste 31 2 23350   lt R 12 gt   3 holm  gt   2 335    d am eafera wtas BE PAST  afte    hringedstefna 31   3352   sende folce to frofre  fyren tde  Hro dgar   31 2 3353    ar drugon    aldorlease   lange  Hro dgare oe EEST     lt R 16 gt   gt  hronrade 31 2 3355   Him INFL PRO  ttas    Raymond Hickey Applications of software Page 9 of 15    liffrea  wul  hu 31 2 2356   Beowulf w as BE PAST breme  bl aAAAAA   Search      Abort  Esc  AAAA 23357 _  eafera Scedelandum in    23 358   lt R 20 gt  3359 2   Swa_CONJUNC sceal      geong         guma     gode gewyrcean  fromum feohgiftum     3 Index  File  BEOWULF WDX 3    This type of word index could be created in advance by the compilers of a corpus and  supplied on the distribution medium  thus obviating the need for users to generate such lists  themselves  Given the greatly increased storage capacity of mediums such as CD ROM  disks the additional space required for such index files should not be a deterrent to offering  them with the primary text files of a corpus     6 Lexical databases    Closely related to word i
2.   An  eek  in    what    array  that  they  were  lnne        AAAAAAAAAAAAAA   AA AA A   A A   AA AA AU    3    Raymond Hickey Applications of software Page 11 of 15    The second major type has the keyword in a separate column to the left with the text line  from which it is taken following it  Here one is dealing with a KWOC or  keyword out of  context    file  Both file types can be generated quite easily with Lexa  Furthermore the  information of a concordance file can be transferred to a database environment to enable  users to avail of the additional manipulative power of the latter type of file     Keyword out of context  KWOC file    UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  i 333   And made forward erly for     to  ryse  3334    To  take oure wey ther as I yow deuyse   3337   Me thynketh it acordaunt  to  resoun 3338   To  telle yow al the condicioun V    33  Wel   33  3324    Wel  nyne and twenty in a compaignye  3329   And  wel  we weren esed atte beste  oe  3 3  Whan    33 33     Whan  that  Aprill with hise shoures soote 3330   And shortly   whan  the sonne  was to reste  ze 3 3   What   ae  335    What  Zephirus eek with his sweete breath 3318    That hem hath holpen  what  that they were seeke  3340   An whiche they  were  and of  what  degree  3341   An eek in  what  array that they  were Inne  2    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASA  ee oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oe oe oe oo oe oo oe oe oe oe oe oe N    8 
3.   tah  o d ttat him  aghwylc  tara ymbsittendra ofer hronrade hyran  scolde  gomban gyldan   lt R 11 gt   t at w as god cyning   lt R 12 gt   d am eafera w as  after  cenned  geong in geardum   tone god sende folce to frofre  fyren dearfe ongeat  te hie  ar  drugon   aldorlease   lange hwile   lt R 16 gt  Him  ttas liffrea  wuldres wealdend   woroldare forgeaf  Beowulf wtas breme bl ad wide   sprang     Scyldes eafera  Scedelandum in           If one takes this text  however  and runs it through the programme Make Symbols which is  supplied in the Lexa suite then a number of substitutions are made and certain high ASCII  characters are inserted where escape sequences were found in an input text  Now under the  important assumption that  1  you are using a computer with a colour monitor  typically a  VGA video adapter based system  and  ii  that you have loaded the supplied Old   Middle  English font of the Lexa suite then the stretch of text printed above should now look like the  following     Raymond Hickey Applications of software Page 13 of 15    Beginning of Beowulf with Old English characters     BEOWULHFYI     lt R 1 gt  Hw  t  We  Gardena in geardagum   eodcyninga   rym gefrunon  hu a    elingas ellen fremedon   lt R 4 gt   Oft Scyld Scefing   scea ena    reatum  monegum mzg um  meodosetla ofteah  egsode  eorlas   lt R 6 gt  Syan erest   wear   feasceaft funden  he  zs frofre gebad  weox under  wolcnum  weormyndum  ah  o zt him eghwylc  ara ymbsittendra ofer hronrade 
4.  3315   And specially fram euery shires ende  3316 _  Of Engelond to Cauntenbury they wende  3317   The  hooly blisful martir for to seke  3318   That hem hath holpen whan  that they were seeke  3319     Bifil that in that seson on a day  3320   In Southwerk at the Tabard as I lay  3321  Redy  to wenden on my pilgrymage 3322   To Caunterbury with ful  deuout corage   gt   3 lt 222222222222222222222222222222222222222222222222222222222222222222222222222222 gt 3 3 Space  Txtl t Txt2   Tab  Split Sereen Menu  Shift Tab   S    This comparison facility does not allow you to alter the contents of a text  Should you wish  to check on and edit two texts at once then you can use the similar comparison option in the  Lexa suite text editor Lexa Text     2 Normalisation    While the critical editions of texts in printed form strive to be accurate in the inclusion of  variants  e g  in the edition of a work attested in different manuscripts  for the electronic  form of a text  a normalised version may have very definite advantages in terms of  readibility not to say accessibility particularly with older texts or those representing a  dialectally divergent language variety    In essence the process of normalisation consists of replacing variants of a grammatical  form by a single form by external consensus  e g  as the latter is the input to a later standard  form or indeed this itself  or by a justifiable decision of the corpus compilers  Despite the  almost ideological dislike of normali
5.  oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  ee oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oo oe oe oo oe oe oo oe oe oe oe oe i    Furthermore online searching of texts is possible  Searching can be on a global level  encompassing all texts of a corpus and including unspecified elements in search strings by  the use of wild cards  Texts can be extracted from a corpus and printed separately if  required    The corpus manager is particularly suitable for those corpora which consist of many  small parts and where there is a hierarchical relation between these  Technically to adapt a  corpus for use with the programme Corpus Manager one must create a single file  then  place level markers at strategic points in this file indicating where the breaks are in the text  so to speak  A special text editor will carry out the task of marker placement quite easily   The programme then indexes the file which has been prepared in this fashion  Once this has  been done  one can consult the corpus  now in the form of a text database  at will  There is  no restriction on the number of text databases  so that one could conceivably divide a  corpus into several blocks  each with an internal structure determined by the compilers of  the corpus  To illustrate this technique there is a text database of letters from the history of  English included as part ofthe Lexa suite     9 Conclusion    The present sketch is intended to offer the int
6.  seeke  ae   Bifil that in PREP that seson on PREP a ART day 33In PREP Southwerk  at the ART Tabard as I lay  3 3Redy to wenden on PREP  my POSS PRO pilgrymage ROMANCE 3 3To Caunterbury with ful  deuout corage ROMANCE  33     Alt 1 Alt 2 Alt 3 Alt 4 F7 End    Tagging texts before their distribution is something which later users may view as a    Raymond Hickey Applications of software Page 7 of 15    linguistic straightjacket as it imposes the grammatical classification scheme of the  compilers on the user  Seeing that there is tagging software available  many compilers may  now prefer to leave this work to the corpus users  or to some sub group  such as  researchers in another university who would be prepared to carry out this task  As  universities have to economise on resources  tagging by the compilers is likely to become  less likely in future  especially as partial tagging is not viewed as a sensible course of  action  You either tag completely or not at all  If you decide to do so you may bind your  capacities in a manner which you come to regret later    This would appear at least to be the case for major projects like the Helsinki corpus   With the arrival of smaller more specialised corpora  tagging may become feasible  particularly if it is directly connected with the research interests of the corpus compilers     4 Using Cocoa headers    Independent of the question of whether to tag or not to tag  compilers of a corpus should  consider whether it would be of avail
7.  the  three volumes of documentation are available from the Norwegian Computer Centre for the  Humanities in Bergen  Norway  Each of the following sections is intended to illustrate a  typical situation in which software is useful in the preparatory stage of corpus building   The list is not exhaustive but it does cover the main areas of concern in this phase of text  collection and organisation     1 Text collation    It is safe to assume that more than one individual will be involved in the compilation of a  text corpus  Texts will either be scanned or keyed in directly  In either case it is more the  exception than the rule to find that a text turns up error free in the computer  This banal fact  increases the status of the individual who is responsible for text correction  Again it is  commonplace for more than one version of a text to exist in some intermediary stage of  compilation  Sooner or later in such a situation doubts arise as to whether a particular  version of a text is the more accurate or the better corrected  The need arises quite quickly  for a reliable means of comparing two versions of a single text  Of course the time and date  stamp of a file on the operating system level will tell you which of two is the more recent   but age is not a guarantee for correctness    To resolve this dilemma a programme has been included in the Lexa suite which will    Raymond Hickey Applications of software Page 2 of 15    compare two files with each other byte for byte an
8.  to future users to include some information on the  nature of the texts before distributing these  This decision has fortunately been made in  favour of supplying such information by the compilers of the Helsinki corpus  The format  they have chosen for the inclusion of text relevant information is what is commonly known  as the Cocoa header  Note that header information is placed at the top of a file and has  nothing to do with grammatical classifications included in the body of a text     Parameters of the Cocoa header    1   lt B  name of text file  gt  2   lt Q    text identifier  gt    3   lt N   name of text  gt  4   lt A    author  gt    5   lt C   part of corpus  gt  6   lt O      date of original  gt    7   lt M  date of manuscript gt  8   lt K   contemporaneity  gt    9   lt D   dialect  gt  10   lt V    verse  or  prose  gt    11   lt T   text type  gt  12   lt G   relation to foreign original  gt   13   lt F    foreign original  gt  14   lt W   relation to spoken language  gt   15   lt X   sex of author  gt  16   lt Y    age of author  gt     17   lt H   social rank of author  gt  18   lt U    audience description  gt   19   lt E  participant relation    gt  20   lt J    interaction  gt     21   lt I    setting  gt  22   lt Z    prototypical text category  gt   23   lt S   sample  gt  24   lt P    page  gt   25   lt L  line  gt  26   lt R    record  gt     Although providing headers is not comparable to the task of tagging a corpus  it nonetheless  requires an ad
9.  which    Raymond Hickey Applications of software Page 12 of 15    have the shapes of the Old and Middle English characters  in any set of input texts at those  points where it encounters an escape sequence  e g  it inserts the yogh symbol when it hits  on   g   The conversion is reversible so that texts can be restored to their original form if  desired  The numerical values of the redefined characters with Old and Middle English  shapes are as following     Escape Actual Letter ASCII numerical value for redefinition       sequencecharactername by Lexa programme Make Symbols   a     ash  L c  145  A      Ash  u c  146  d   eth  1 c  253  D   Eth  u c 252  g  yogh  l c   243  G  Yogh  u c   242  t    thorn  1 c  245  T  M  Thorn  u c  244   tt       crossed thorn  248  TT         crossed Thorn  246   e     e caudata  144         pound sign  156    Taking a typical text such as Beowulf and loading it with one s text editor or word  processor leads to one being presented with a text which is convenient for computer  manipulation but hardly readable to the Old English scholar     Beginning of Beowulf with  escape sequence  coding    D   BEOWULF       lt R 1 gt  Hw at  We Gardena in geardagum   teodcyninga   trym  gefrunon  hu  da  attelingas ellen fremedon   lt R 4 gt  Oft Scyld Scefing   scea tena      treatum  monegum mt ag tum  meodosetla ofteah  egsode eorlas   lt R 6 gt  Syt d dan  arest    weart d   feasceaft funden  he  ttas frofre gebad  weox under wolcnum   weor dmyndum
10. Applications of software in the compilation of corpora  Raymond Hickey  Department of English  University of Munich  Abstract    An attempt is made here to sketch some of the applications to which corpus pro  cessing software  can be put in the compilation of corpora  The emphasis is on the one hand on the automation of  many standard processes  such as text collation and the provision of header information for each file  of a corpus  while one the other hand the additional possibilities offered by dedicated corpus  software are described  Among the latter special emphasis is put on the transfer of textual data to a  database environment for further processing  Further matters such as the use of special fonts for  older stages of English and the option of organising the text files of one s corpus for potential users  in advance are also discussed     0 Introduction    Given the nature of the contributions to this volume  the present author thought it  appropriate to discuss the uses to which corpus processing software could be put in the  compilation and distribution of corpora  especially ones with a diachronic orientation    Assuming that the compilers of a corpus have reached basic agreement on what periods  are to be covered and what texts are to be included  software can be used gainfully from  this point onwards  To illustrate possible applications the software system Lexa developed  by the present author will be used for the ensuing discussion  This programme suite and
11. Font considerations    The corpora and corpus projects under construction which are presented in the present  volume all refer to diachronic English  If the time span in a particular instance stretches  back far enough then texts will involve special characters for Middle and Old English  The  practice with historical corpora has been to represent special symbols of historical stages  of the language by using so called escape sequences  For instance the Old English  character thorn is represented by   t  in the Helsinki corpus  the eth symbol is indicated by    d  and so forth  This encoding has the advantage of portability  The corpus texts only  include characters with numeric values between 32 and 126 in the ASCH set and are  transferrable to and readable on computer systems operating on a so called 7 bit basis   The obvious disadvantage is that readability drops drastically with older texts  Something  like Beowulf is undecipherable in the    escape sequence    form    A practical solution to this problem  presented in the Lexa suite  is to use a supplied  programme to convert the sequences to single characters with the correct shapes so that an  Old English text on screen looks more or less identical to one in printed form    The scheme devised by the present author utilises the ability of personal computers  with colour monitors to display characters with customized shapes on screen  The  programme which makes the alterations inserts the redefined symbols of the screen 
12. No  of replacements   7   6  Matching files   1      3    3K AAAAAAA AAA AA AAA AA AAA AA AAA AA AAA AAAAAAAAAAAAAAAAAAAA  RAR AANAANHARARAARARRANANAA ANU anple ranpe Iae eb   HAVE   used to show two forms in the 3313   HAS  has    3313   HAS  HAS 3319   HAVE   have 3319   HAVE  HAVE  3322   HAD  had O    iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii  3322    HAD  HAD 3 2 33 3 DbTrans  successfully executed  3 23 3 3  33 Oni Press any key                11113   4 a    33    3 3 Press  lt Escape gt  to abort operation     30 3    A KAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAU    Needless to say  any normalisation procedure of any reasonable extent will require far  more records specifying many more substitutions  This is  however  not a matter of  principle but of arranging a suitable database  Again the advantage of normalisation via a  translation programme is that the original version of a text is left unimpaired  Furthermore    Raymond Hickey Applications of software Page 5 of 15    the normalisation can in fact be carried out by the user of the corpus if he or she so wishes   thus releasing the compiler from the arduous task of generating comprehensive databases  for normalisation tasks     3 Pre tagging texts    A major decision which the compilers of a corpus have to take is whether the texts of their  corpus are to contain any kind of grammatical information  If this decision is made in  favour of including such information then a conside
13. aldend  woroldare forgeaf  3  Beowulf wtas_BE PAST breme bl ad wide    sprang     Scyldes 3 3eafera  Scedelandum in  3 3 lt R 20 gt    33    Alt 1 Alt 2 Alt 3 Alt 4 F7 End F10 Save      3   Sh Tab Menus     Opening of the Canterbury Tales in tagged form  UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA                                                                       2C  LEXA DEMO CHAUCER LEM  Line 1 Coll Pagel Text13  uu    Whan that Aprill with hise shoures_ PLURAL soote  gt  3the ART  droghte of PREP March hath HAVE perced to the ART roote  3 3And bathed  euery IND PRO veyne in PREP swich licour 33Of PREP which vertu  engendred is the ART flour  3 3What Zephirus eek with  his POSS PRO sweete breath   3Inspired hath HAVE in PREP  euery IND PRO hold and heeth 33The ART tendre croppes_ PLURAL  and  the ART yonge PAST PART sonne  gt  3Hath HAVE in PREP the ART Ram  his POSS PRO half cours yronne_ PAST PART  33And smale foweles_ PLURAL  maken melodye  3 3That slepen al the ART nyght X PHON with  open eye  3 3 So priketh hem nature in PREP hir POSS PRO  corages_PLURAL    gt  3Thanne longen folk to geen on PREP  pilgrimages PLURAL  3 3And Palmeres_ PLURAL for PREP to seken  straunge strondes_ PLURAL  3 3To ferne halwes PLURAL kowthe in PREP  sondry londes_ PLURAL  3 3And specially fram euery IND PRO  shires PLURAL ende 3 3Of PREP Engelond to Cauntenbury they wende    gt 3The ART hooly blisful martir for PREP to seke   3 3That hem hath HAVE holpen what that they were COPULA
14. c   can be excluded  from the tagging process    The result of a tagging operation on two input texts  the opening lines of Beowulf and of  Chaucer s Canterbury Tales  produced the following results  Note that for the purposes of  illustration only a selection of tags were specified which means that only a small number of  word forms are actually tagged     Opening of Beowulf in a tagged form   UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA   AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  3C  LEXA DEMO BEOWULF LEM  Line 58 Coll Pagel Text 1      NAVAN ANAVIVAAN AANA IV VAAN AIAN A VINA AANA D E WN AA NAAN AANA AAA AN AAA AAV VIN A    uu   7Hw at_INTERROG  3 3We Gardena in  geardagum   teodcyninga   trym gefrunon  hu CONJUNC  da 33 a telingas ellen  fremedon  33 lt R 4 gt    3 3Oft Scyld Scefing    sceat tena     treatum  monegum m tag tum  3  3meodosetla ofteah  egsode eorlas  33 lt R 6 gt     3 3Sy d dan CONJUNC  arest    wear d     feasceaft funden   he  t as frofre 3 3gebad  weox under wolcnum  weor dmyndum  tah  o d t at    Raymond Hickey Applications of software Page 6 of 15    3 3him_INFL_ PRO  aghwylc  tara ymbsittendra ofer LOCATIVE hronrade    hyran 3 3scolde  gomban gyldan  33 lt R 11 gt   3 3 ttat wtas_ BE PAST god cyning    33 lt R 12 gt  33 d am eafera  wtas_ BE PAST  after TEMPORAL cenned  geong in geardum   tone god    sende folce  to frofre  fyrentdearfe ongeat  te hie 3 3 ar drugon    aldorlease     lange  hwile  33 lt R 16 gt  3  Him INFL PRO  t as liffrea  wuldres we
15. d give information on grammatical  endings  If you sort a lexical database alphabetically going on the field REVERSE then you  end up with the records ordered according to the end and not the beginning of the words in  the field TOKEN   UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  DB  1 Field  1 Col  1 Rec  119  272      Sat ace re     P 3TOKEN C 32 hine  3 3LEMMA C 16 INFL PRO   gt 3FREQUENCY N 6 3 3 3REVERSE C  32 enih pen  33 343    3 3    PgUp    PgDn  G Goto F1 Help Alt F7 Desktop F10 Browse    Raymond Hickey Applications of software Page 10 of 15    3   Shift Tab Menus      Whether compilers of a corpus would feel like including lexical databases for their  primary text files depends on the status of lexical analysis on their own research horizon   Once more the question of the space which such files would occupy diminishes when one  considers the large capacity of storage mediums nowadays     7 Concordance files    Looking at words in isolation is one aspect of lexical analysis  Another which stresses  contextualization is the viewing of words in context  Here one is dealing with a file type  which while not very sophisticated from a computing point of view nonetheless has its  justification is providing valuable information to users of a corpus    The main programme Lexa can once more be employed to generate the type of file in  question  There are basically two types of concordance  one in which the keyword being  considered is displayed in the centre of a te
16. d report the differences on screen and  write these to a file if required  With Lexa Compare one loads two text files from a  directory listing and then on pressing a dedicated key  the programme begins a comparison  of the two  In the following screen print outs two versions of the opening line of Chaucer s  Canterbury Tales are displayed  The version CHAUC_1 TXT contains four errors  what  is written for whan  twice   kouthe for kowthe and season for Chaucer s spelling of seson   These errors are highlighted on the screen and the user can immediately recognize which of  the versions is the more correct  Errors will be detected anywhere  up to the end of the  files chosen  If these are identical you are informed of this     Less correct version of text    UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  i AAAAAAAKAAKAAAAKAAAAAKAAAAAAAA   File 1  CHAUC 1 TXT 35 071  06 15 93 12 47 Offset  893    71   Whan that Aprill with hise shoures soote  332  the droghte of March hath perced to the roote  2   33    And bathed euery veyne in swich licour 334    Of which vertu  engendred is the flour  335   What Zephirus eek with his sweete  breath 336   Inspired hath in euery hold and heeth   337   The tendre croppes  and the yonge sonne 338   Hath in the  Ram his half cours yronne  339    And smale foweles maken  melodye  3310   That slepen al the nyght with open eye    33      So priketh hem nature in hir corages   3312  Thanne longen folk to geen on pilgrimages  3313   And Palmeres 
17. ditional amount of work to specify the values for these parameters for the  texts of a corpus  The advantages  however  are considerable    With the Lexa suite the contents of a Cocoa header can be accessed by the information  retrieval software  This is realised as follows  a programme  called Cocoa  extracts the  header information from any set of input files and deposits this in a database  Then with the  database manager of the suite  called DbStat  one can load the database just created and  impose a filter on it  By this is meant that only those records remain visible which match a    Raymond Hickey Applications of software Page 8 of 15    certain user specified condition    Assuming that one generates a database of the Cocoa header information in the files of  the Helsinki corpus and loads the database manager then one could specify a filter to which  only those records  i e  file headers in database form  correspond which represent  translations  Item 13  of Middle English  Item 6  prose  Item 10  texts  A list of the files  for which this header information obtains can be generated by creating a list from the field  information for Item 1  name of text file   The list file created by these steps can in its turn  be used as the source of the file names for an information retrieval operation with other  parts of the Lexa suite so that only Middle English prose translations from the corpus are  examined  In addition the user can specify with the retrieval programmes fro
18. e bound to be additional advantages to be accrued from looking at the  contents of a corpus on the computer rather than in printed form  To this end the programme  Corpus Manager has been designed and included in the Lexa suite    Basically what the corpus manager does it to provide one with a table of contents for a  corpus with up to three levels of depth  There is a contents window for each level and the  user can choose any text and view it by selecting it in the current window     roe    UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  3    F7 Desktop  F9 Settings   3 3 AAAAAAA Level 1   4 0f 19  AAAAAA  333 1   Overview  Cocoa info     Pee D  Beaumont   AAAAAAA Level 2   1 of 2  AAAAAA 333 4   Plumpton    3 13   Thomas More   2333 12  More et al  Wiese  Margaret Roper   2333 19  Cromwel AAAAAAA Level 3   1 of 3   AAAAAA   2333 21 Cumberl  14  Letter to wife       33 26  Knyvett  15   Letter to daughter   2333 28  Harley  16   Letters to M Roper a   2333 30  Paston  z    2333 33  Ferrar    2     333 36  Barring  2   2333 45   Proud e 2   2 353     gt  56  Gawdy    A   2333 58  Haddock     Raymond Hickey Applications of software Page 14 of 15    2   2 333 63   Strype    2   2 333 65   Oxinden      333 68  Hatton        2333 74  Pinney  i   2333 78  Henry     2   233A U View  Esc  2   ar K   track     Forward A 33  VIND n Y  3    N oee se se oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oe oo oe oo oe oe oo oe oo
19. erest linguist engaged in corpus compilation  an idea of what can be done to both facilitate the process of compilation for him herself  and to increase the gainful use to which the corpus can then be put once it is completed  In  all instances  the programmes of the Lexa suite do not have to be altered in any major way   apart from the operator of the programmes creating his her own configuration for in some  instances  this is standard procedure   The net result is a greater degree of automation for  many processes which represents a saving in resources which renders in turn many a task  more feasible and brings forward the distribution date for many an interesting corpus     References   Hickey  Raymond 1993a  Lexa  Corpus Processing Software  3 Vols  Vol 1  Lexical  Analysis  Vol 2  Database and Corpus Management  Vol 3  Utility Library  Bergen  Norwegian Computing Centre for the Humanities     Hickey  Raymond  1993b   Corpus data processing with Lexa   ICAME Journal 17  73 96     Hockey  Susan and Ian Marriott 1980  Oxford Concordance Program  Users  manual     Raymond Hickey Applications of software Page 15 of 15    Oxford  Oxford University Computing Service    Johansson  Stig 1986  The tagged LOB Corpus  User s manual  Bergen  Norwegian  Computing Centre for the Humanities    Johansson  Stig and Anna Brita Stenstr  m  eds    1991  English computer corpora   Selected papers and research guide  Berlin  Mouton de Gruyter    Kyt    Merja  Ossi Ihalainen  and Matti Rissane
20. for to  seken straunge strondes  3314   To ferne halwes kouthe in sondry  londes  3315   And specially fram euery shires ende  3316 _  Of Engelond to Cauntenbury they wende  3317   The  hooly blisful martir for to seke  3318   That hem hath holpen what  that they were seeke  3319     Bifil than in that seeson on a day  3320   In Southwerk at the Tabard as I lay  e321  2   Redy to wenden on my pilgrymage 3322   To Caunterbury with  ful deuout corage  3   3222222222222222222222222222222222222222222222222222222222222222222222222222222 gt 3 3 Space  Txtl a Txt2   dab Split orreen M RE e mot ee ed    More correct version of text    UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  GI a Foe SAN a E Fae caTE     File 2  CHAUC 2 TXT 35 071    06 15 93 12 45 Offset  893     31  Whan that Aprill with hise shoures soote  332  the droghte of March hath perced to the roote     33  And bathed euery veyne in swich licour 334    Of which vertu    engendred is the flour  335    Whan Zephirus eek with his sweete    Raymond Hickey Applications of software Page 3 of 15    breath 336    Inspired hath in euery hold and heeth  337   The tendre croppes  and the yonge sonne 338   Hath in the  Ram his half cours yronne  339   And smale foweles maken  melodye  3310   That slepen al the nyght with open eye   3311    So priketh hem nature in hir corages   12  Thanne longen folk to geen on pilgrimages  3313   And Palmeres for to  seken straunge strondes  3314   To ferne halwes kowthe in sondry  londes 
21. hyran  scolde  gomban gyldan   lt R 11 gt     t wes god cyning   lt R 12 gt    m eafera wes efter cenned   geong in geardum   one god sende folce to frofre  fyrenearfe ongeat  e hie   r drugon    aldorlease   lange hwile   lt R 16 gt  Him  es liffrea  wuldres wealdend  woroldare  forgeaf  Beowulf wes breme bleed wide   sprang     Scyldes eafera Scedelandum in     In the interests of a unified system  prospective compilers of diachronic corpora are  advised to keep the encoding system used by the Helsinki corpus for special characters  If  they do then the software already available for the latter corpus can be used without any  alteration in a newer corpus which complies to the original codification scheme  Users of  several corpora will only need one special video font  namely that supplied with the Lexa  suite and can then view any text from a selection of corpora without further system  adjustment  Note this coding scheme is also that used by the present author for the medieval  texts in the Corpus of Irish English  see elsewhere in this volume for details      8 Organisational considerations    Once a corpus has been completed a practical question arises for its potential users  How  do they gain an overview of just what the corpus contains  The simple answer with many  corpora is to consult the manual  However  given that a corpus is an electronic library of  texts  it is surely natural to expect that this question can be answered electronically  Not  only that  there ar
22. ler wishes to carry out  The net result of this procedure is an  unaltered and non normalised version of a file or files along with a database or databases   one per text file  with which the user of a corpus can  if he or she so wishes  generate a  normalised version of a text  The programme to use here is called Database Translate   DBTRANS  as it translates an input text into an altered output text on the basis of a  database  Here is a simply example of how this actually works  A database is created  entitled NORMAL DBF  This contains two fields per record labelled MID_ENG  Middle  English  and MOD ENG  Modern English  respectively  There are six records with  forms of the verb  have  which occur in Middle English  The programme DBTRANS now  examines any input text or texts and if it finds any instances of Middle English forms of   have  it replaces these by the modern forms specified in the database NORMAL DBF     UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  S AAAAAAAAAAAAKAAAAAAAAAAAAAAAA    Lexa utility  DbTrans    c  Raymond Hickey  gt    3A AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  AAAAAAAAAAAAAAAAAAAAAAAAAAAAA   Terminology database      NORMAL DBF   6  Only words  Yes 3 3 Input language field   MID ENG  Open ended  No 3 3 Output language field   MOD ENG Ignore  case   No 33 Template for files   NORM INP TXT Input files   ASCII  33 Current input file   NORM INP TXT Manual oper   Yes 33 String  to be located   had 33 String to be inserted   HAD  33 
23. m the set  such  as the pattern matcher Lexa Pat and the programme for locating syntactic contexts Lexa  Context  that the Cocoa information of the files examined be enclosed in the output file of  statistics generated during a search    The example just given is typical inasmuch as it illustrates how different parts of the  Lexa suite link up together  For any prospective operators of the programme package  be  they corpus compilers or users  it is essential to grasp the inter  relationships between  items of software     5 Word indexes    Among the simplest of tasks to carry out with any corpus processing software is the  production of lists of unique words from source text files  Despite this obvious simplicity  this type of additional file is very commonly demanded by linguists examining a corpus  To  this end the programme Lexa Words in the Lexa suite has been created  The programme  takes any text file or set of files and generates a list of all the words which occur uniquely  in the input  Once this list has been created the user can consult it via a pop up window in  which a sorted list appears together with the frequency of the words noted when generating  the list in the first place   UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  Offset  1 342 33337   Hwtat INTERROG    23338   We Gardena in geardagum   teodcyninga   trym gefrunon  hu CONJUNC  da   23339      a telingas ellen fremedon  UAAAAAAAAA  145 of 280   AAAAAAA 23    340   lt R4 gt   gt  heold 31 2 334 
24. n  eds   1988  Corpus linguistics hard and  soft  Amsterdam  Rodopi    Kyt    Merja 1991  Manual to the diachronic part of the Helsinki corpus of English texts   Helsinki  Department of English    Kyt    Merja and Matti Rissanen 1988  The Helsinki Corpus of English Texts  Classifying  and coding the diachronic part  In Corpus linguistics  ed  by M Kyt6 et al   169 180  Amsterdam  Rodopi     
25. ndexes is another type of file which is useful in analysing texts  lexically  This is what is termed a lexical database  Recall that a database is a type of file  in which information in stored in table like form  The rows of the table contain different  fields and the columns the contents of these fields  With the main programme Lexa of the  suite under discussion it is possible to derive a database from any input text or texts  By  this is meant that you load or specify an input text and then demand of the programme that it  extract information on each word  storing this in a record with four fields  The first  contains the word form itself  i e  the token which is found  The second contains the tag  if  any  which has been associated with the word form in question  For the third field the  frequency of the word in the input text has been noted  The frequency is stored cumulatively  which means that if you run the lexical database function on a series of texts  the entry for  FREQUENCY is incremented for every find of a particular word form in each text  Thus if  you  for example  divided Beowulf into six texts and generated a lexical database for each  text  using the same output database  then the frequency field for any given form would  contain the total number of occurrences in all the six texts taken together  In the fourth field  the word is deposited in reverse order of spelling  The idea behind this is to allow the  creation of reverse order dictionaries which woul
26. rable amount of additional work beyond  the collection of texts has to be undertaken    Grammatical information is normally included in corpora by tagging word forms  i e   by adding a label to words identifying them grammatically  If tagging is to be done then it is  only sensible if it is done completely  Quite apart from the actual work of tagging   agreement must be reached in advance on the system of classification to be used  The  advantage for users of a corpus is obvious  the retrieval of grammatical information froma  text or texts is vastly facilitated if grammatical affiliation has already been specified via  tagging    Given the size of the task  it is imperative to use every resource available for  accelerating the process  In effect this means employing tagging software for the purpose   In the Lexa suite the main programme  called simply Lexa  is designed to tag texts  automatically  The operator of the programme  be he or she a compiler or user of a corpus   must specify what words are to be tagged in what way by creating one or more lemma  definition files  The programme takes note of these definitions and then examines any set of  input texts  adding tags to words it deems as representing the grammatical classes for  which tags exist in the definition file or files  The programme runs in an automatic   semi automatic or manual mode  Tagging can be done cumulatively  tags can be exchanged  or updated and particularly frequent words  prepositions  articles  et
27. sation  particularly on the part of medieval scholars   there are obvious advantages to it as it allows later readers to approach a text or texts  without undue linguistic difficulty  to see the wood for the trees so to speak    In contradistinction to printing  in the compilation of corpora no a priori decision has to  be made about whether to distribute a normalised text or not  Instead the corpus should  include an original unaltered form of a text along with the means for users of the corpus to  normalise the text later if they so wish  Just what these means are should be explained  briefly    To begin with recall that the process of normalisation consists of replacing variant  forms of a word by some standard or normalised form  What one needs then is software  which will recognize every occurrence of a variant form as an instantiation of a normal  form  This is realised by creating a list of normal forms and of all forms which represent  variants of these  Technically this is achieved by generating a database  For each variant    Raymond Hickey Applications of software Page 40f 15    form there is a single record which at the very least consists of two fields  The first is that  for the variant and the second is for the normal form with which the variant is associated   The database will have as many records as there are variant forms to be replaced by  normal forms  The extent of the normalisation is thus dependent solely on the number of  substitutions which the compi
28. xt line with a certain number of words to the  left and right of it included as context  this is known as a KWIC  or  keyword in context     file     Keyword in context  KWIC file    UAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAY  333000  GuGAndtmadeutforwardterlyufortutoutryse   334                                                                                                TL otutaketouretweyuthertasuluyowudeuyse   3337000          Me  thynketh  it  acordaunt    to    resoun 3  338                                                                                       To    telle  yow  al  the  condicioun  3 3   Wel       33 3    3    3 3    324                                 v    v    v  v    v  v  v  uv  u Wel    nyne  and  twenty  in  a  compaignye  3 329                    v  v  v  v    v  v  v  v    v    And    wel    we  weren  esed  atte  beste   52 3 3    Whan      33 3  31                           v        v    u      v  vu      v  u  u Whan    that  Aprill  with  hise  shoures  soot  e  330                                       And  shortly     whan    the  sonne  was  to  reste  3  i 3 3    What      33 3  3S                    v              v    u      u    v  v    uu  u What    Zephirus  eek  with  his  sweete  brea  th 3318             v        T hat  hem  hath  holpen    what    that  they  were  seeke  3     40         An  whiche  they  were   and  of    what    degree  3   41                           u        v  v  
    
Download Pdf Manuals
 
 
    
Related Search
    
Related Contents
  ISO-TECH ICA 15 AC Current Transducer OPERATOR`S MANUAL  Philips eXpanium EXP503 User's Manual  Manuel d`instructions  Viewing - Amazon Web Services  Compaq AA-RH8RD-TE User's Manual  INGREDIENTES: MODO DE EMPLEO:  to View the User Manual  User Manual: Fluke 434/435 Three Phase Power Quality    Copyright © All rights reserved. 
   Failed to retrieve file