Computer-Assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes
Total Page:16
File Type:pdf, Size:1020Kb
Computer-assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes Jon Mills PhD Thesis University of Exeter 2002 1 Abstract This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be capable of uniquely identifying every lexical item that is attested in the corpus. A survey of published and unpublished Cornish dictionaries, glossaries and lexicographical notes was carried out. A corpus was compiled incorporating specially prepared new critical editions. An investigation into the history of Cornish lemmatisation was undertaken. A systemic description of Cornish inflection was written. Three methods of corpus lemmatisation were trialed. Findings were as follows. Lexicographical history shapes current Cornish lexicographical practice. Lexicon based tokenisation has advantages over character based tokenisation. System networks provide the means to generate base forms from attested word types. Grammatical difference is the most reliable way of disambiguating homographs. A lemma that contains three fields, the canonical form, the part-of-speech and a semantic field label, provides of a unique code for every lexeme attested in the corpus. Programs which involve human interaction during the lemmatisation process allow bootstrapping of the lemmatisation database. Computerised morphological processing may be used at least to partially create the lemmatisation database. Disambiguation of at least some of the most common homographs may be automated by the use of computer programs. 2 Table of Contents TABLE OF CONTENTS ....................................................................... 3 TABLE OF FIGURES........................................................................... 6 1 INTRODUCTION ......................................................................... 15 1.1 Nature and scope of problem.................................................................................. 18 1.2 Method of investigation........................................................................................... 19 1.3 Principal findings..................................................................................................... 23 2 CORNISH DICTIONARIES, GLOSSES & LEXICOGRAPHICAL NOTES ............................................................................................... 26 2.1 The historical perspective ....................................................................................... 26 2.2 Onomastic dictionaries............................................................................................ 55 2.3 Interlingual relations............................................................................................... 62 3 THE CORPUS OF CORNISH...................................................... 77 3.1 Chronology of the corpus of Cornish..................................................................... 78 3.2 Methodology for compiling a historical corpus................................................... 126 4 THE LEMMA ............................................................................. 148 3 4.1 Lexical Variation ................................................................................................... 149 4.1.1 Synchronic variation ........................................................................................... 150 4.1.2 Derivational variation ......................................................................................... 179 4.1.3 Diachronic variation............................................................................................ 183 4.2 The entry-form....................................................................................................... 189 4.2.1 The base form ..................................................................................................... 191 4.2.2 The canonical form ............................................................................................. 199 4.2.3 Compounds ......................................................................................................... 202 4.3 Alphabetisation ...................................................................................................... 204 4.3.1 Derived forms ..................................................................................................... 206 4.3.2 Compounds and multi-word lexemes.................................................................. 209 4.4 The Historical Development of the Cornish Lemma .......................................... 213 5 METHODOLOGY OF CORPUS LEMMATISATION ................. 244 5.1 Lexeme tagging ...................................................................................................... 245 5.2 Lemmatisation databases...................................................................................... 248 5.3 VOLTA: a method developed for the Corpus of Cornish .................................. 251 5.4 Normalisation......................................................................................................... 260 5.5 Lemmatisation rules .............................................................................................. 264 5.6 The stochastic approach to generating morphological rules.............................. 272 5.7 Manual creation of a morphological analyser..................................................... 283 5.8 Homograph Separation ......................................................................................... 295 5.9 Interlingual Lemmatisation .................................................................................. 336 4 6 CONCLUSION .......................................................................... 342 BIBLIOGRAPHY .............................................................................. 367 Cited Dictionaries................................................................................................................. 367 Manuscripts Cited................................................................................................................ 372 Software Cited...................................................................................................................... 374 Other Works Cited...............................................................................................................374 INDEX............................................................................................... 391 5 Table of Figures Figure 1 Hierarchical system ...........................................................................21 Figure 2 Simultaneous system .........................................................................21 Figure 3 Simple system....................................................................................22 Figure 4 Compound system .............................................................................22 Figure 5 Disjunctive system.............................................................................23 Figure 6 Gloss from Oxoniensis Posterior ......................................................28 Figure 7 Lhuyd’s long-tailed-U .......................................................................34 Figure 8 Equivalents of Cornish PEN..............................................................69 Figure 9 SDMC, English lexeme BANK.........................................................73 Figure 10 The corpus of Old Cornish ..............................................................79 Figure 11 The corpus of Middle Cornish.........................................................80 Figure 12 The corpus of Modern Cornish........................................................82 Figure 13 Comparative size of the main corpus texts....................................126 Figure 14 Extract from Beunans Meriasek ....................................................131 Figure 15 First occurrence of the...................................................................132 Figure 16 The scale of rank ...........................................................................133 6 Figure 17 The unit of lemmatisation system..................................................133 Figure 18 Algorithm for character based tokenisation ..................................139 Figure 19 Algorithm for lexicon based tokenisation .....................................141 Figure 20 Simple dictionary for lexicon based tokenisation .........................141 Figure 21 Examples of combinatorial ambiguity...........................................142 Figure 22 Possible solutions of lexicon based tokenisation...........................144 Figure 23 Critical tokenisation.......................................................................145 Figure 24 Critical tokenisation implemented in Prolog database ..................146 Figure 25 The synchronic variation system of Cornish.................................151 Figure 26 The Cornish inflection system.......................................................154 Figure 27 The Cornish nominal inflection system.........................................156 Figure 28 The vowel affection system...........................................................159 Figure 29 The verbal inflection system .........................................................160 Figure 30 The past participle inflection system.............................................162