Computer-Assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes

Computer-assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes Jon Mills PhD Thesis University of Exeter 2002 1 Abstract This project sets out to discover and develop techniques for the lemmatisation of a historical corpus of the Cornish language in order that a lemmatised dictionary macrostructure can be generated from the corpus. The system should be capable of uniquely identifying every lexical item that is attested in the corpus. A survey of published and unpublished Cornish dictionaries, glossaries and lexicographical notes was carried out. A corpus was compiled incorporating specially prepared new critical editions. An investigation into the history of Cornish lemmatisation was undertaken. A systemic description of Cornish inflection was written. Three methods of corpus lemmatisation were trialed. Findings were as follows. Lexicographical history shapes current Cornish lexicographical practice. Lexicon based tokenisation has advantages over character based tokenisation. System networks provide the means to generate base forms from attested word types. Grammatical difference is the most reliable way of disambiguating homographs. A lemma that contains three fields, the canonical form, the part-of-speech and a semantic field label, provides of a unique code for every lexeme attested in the corpus. Programs which involve human interaction during the lemmatisation process allow bootstrapping of the lemmatisation database. Computerised morphological processing may be used at least to partially create the lemmatisation database. Disambiguation of at least some of the most common homographs may be automated by the use of computer programs. 2 Table of Contents TABLE OF CONTENTS ....................................................................... 3 TABLE OF FIGURES........................................................................... 6 1 INTRODUCTION ......................................................................... 15 1.1 Nature and scope of problem.................................................................................. 18 1.2 Method of investigation........................................................................................... 19 1.3 Principal findings..................................................................................................... 23 2 CORNISH DICTIONARIES, GLOSSES & LEXICOGRAPHICAL NOTES ............................................................................................... 26 2.1 The historical perspective ....................................................................................... 26 2.2 Onomastic dictionaries............................................................................................ 55 2.3 Interlingual relations............................................................................................... 62 3 THE CORPUS OF CORNISH...................................................... 77 3.1 Chronology of the corpus of Cornish..................................................................... 78 3.2 Methodology for compiling a historical corpus................................................... 126 4 THE LEMMA ............................................................................. 148 3 4.1 Lexical Variation ................................................................................................... 149 4.1.1 Synchronic variation ........................................................................................... 150 4.1.2 Derivational variation ......................................................................................... 179 4.1.3 Diachronic variation............................................................................................ 183 4.2 The entry-form....................................................................................................... 189 4.2.1 The base form ..................................................................................................... 191 4.2.2 The canonical form ............................................................................................. 199 4.2.3 Compounds ......................................................................................................... 202 4.3 Alphabetisation ...................................................................................................... 204 4.3.1 Derived forms ..................................................................................................... 206 4.3.2 Compounds and multi-word lexemes.................................................................. 209 4.4 The Historical Development of the Cornish Lemma .......................................... 213 5 METHODOLOGY OF CORPUS LEMMATISATION ................. 244 5.1 Lexeme tagging ...................................................................................................... 245 5.2 Lemmatisation databases...................................................................................... 248 5.3 VOLTA: a method developed for the Corpus of Cornish .................................. 251 5.4 Normalisation......................................................................................................... 260 5.5 Lemmatisation rules .............................................................................................. 264 5.6 The stochastic approach to generating morphological rules.............................. 272 5.7 Manual creation of a morphological analyser..................................................... 283 5.8 Homograph Separation ......................................................................................... 295 5.9 Interlingual Lemmatisation .................................................................................. 336 4 6 CONCLUSION .......................................................................... 342 BIBLIOGRAPHY .............................................................................. 367 Cited Dictionaries................................................................................................................. 367 Manuscripts Cited................................................................................................................ 372 Software Cited...................................................................................................................... 374 Other Works Cited...............................................................................................................374 INDEX............................................................................................... 391 5 Table of Figures Figure 1 Hierarchical system ...........................................................................21 Figure 2 Simultaneous system .........................................................................21 Figure 3 Simple system....................................................................................22 Figure 4 Compound system .............................................................................22 Figure 5 Disjunctive system.............................................................................23 Figure 6 Gloss from Oxoniensis Posterior ......................................................28 Figure 7 Lhuyd’s long-tailed-U .......................................................................34 Figure 8 Equivalents of Cornish PEN..............................................................69 Figure 9 SDMC, English lexeme BANK.........................................................73 Figure 10 The corpus of Old Cornish ..............................................................79 Figure 11 The corpus of Middle Cornish.........................................................80 Figure 12 The corpus of Modern Cornish........................................................82 Figure 13 Comparative size of the main corpus texts....................................126 Figure 14 Extract from Beunans Meriasek ....................................................131 Figure 15 First occurrence of the...................................................................132 Figure 16 The scale of rank ...........................................................................133 6 Figure 17 The unit of lemmatisation system..................................................133 Figure 18 Algorithm for character based tokenisation ..................................139 Figure 19 Algorithm for lexicon based tokenisation .....................................141 Figure 20 Simple dictionary for lexicon based tokenisation .........................141 Figure 21 Examples of combinatorial ambiguity...........................................142 Figure 22 Possible solutions of lexicon based tokenisation...........................144 Figure 23 Critical tokenisation.......................................................................145 Figure 24 Critical tokenisation implemented in Prolog database ..................146 Figure 25 The synchronic variation system of Cornish.................................151 Figure 26 The Cornish inflection system.......................................................154 Figure 27 The Cornish nominal inflection system.........................................156 Figure 28 The vowel affection system...........................................................159 Figure 29 The verbal inflection system .........................................................160 Figure 30 The past participle inflection system.............................................162

Computer-Assisted Lemmatisation of a Cornish Text Corpus for Lexicographical Purposes

Th/-Dh and Final -F/-V in Cornish Discussion Paper for the AHG Michael Everson and Nicholas Williams 12 March 2008

O-TYPE VOWELS in CORNISH Dr Ken George

16A Bus Time Schedule & Line Route

An Introduction to Cornish Place Names

201914Th-28Th September Programme of Events

Cornish Association of NSW - No

Summary of Sensory Team Manager Duties

Dolly Pentreath

Accents, Dialects and Languages of the Bristol Region

The Cornish Language in Education in the UK

Worcestershire Has Fluctuated in Size Over the Centuries

The Death of Cornish