Computational Approaches to Mapping Historical Irish Cognate Verb Forms
Total Page:16
File Type:pdf, Size:1020Kb
Past, present and future: Computational approaches to mapping historical Irish cognate verb forms Theodorus Leman Franciscus Fransen Thesis submitted for the Degree of Doctor of Philosophy Centre for Language & Communication Studies School of Linguistic, Speech and Communication Sciences Trinity College Dublin The University of Dublin 2019 Declaration I declare that this thesis has not been submitted as an exercise for a degree at this or any other university and it is entirely my own work. I agree to deposit this thesis in the University’s open access institutional repository or allow the Library to do so on my behalf, subject to Irish Copyright Legislation and Trinity College Library conditions of use and acknowledgement. Summary This thesis investigates how computational methods can be used to enhance our understanding of the significant historical developments in the verbal system between Old Irish (c. 8th–9th centuries A.D.) and Modern Irish (13th century onwards). Out of all grammatical subsystems, the verbal system is subject to the most severe morphological changes between Old and Modern Irish, i.e., during the Middle Irish period (c. 10th–12th centuries). The main contribution of this thesis is the creation of a morphological Finite-State Transducer (FST) for Old Irish, focusing on verbs, successfully implemented in the finite-state tool foma (Hulden 2009). The FST is an important advancement in Natural Language Processing for Old Irish and will assist research in various linguistic subdisciplines as well as in medieval Irish philology. Chapter 1 demonstrates that a hiatus exists in digital linguistic support for historical Irish language periods. This hiatus in coverage and continuity is particularly true for Early Modern Irish (c. 13th–mid 17th centuries); it not only deters one from carrying out a systematic di- achronic analysis of the verbal system, but also complicates the task of satisfactorily linking up available and emerging lexical resources for Old Irish and Modern Irish. The otherwise invalu- able electronic Dictionary of the Irish Language (eDIL), covering the period c. 700–1700, was not deemed feasible as a starting point for a morphological parser for Old Irish. However, it is part of a proposed linking framework presented in Chapter 6. The focus in the current work is on Old Irish (rather than Middle or Early Modern Irish) due to its (comparatively) uniform, normative and well-resourced nature. Chapter 2 focuses on the complex verbal system of Old Irish, introducing in a stepwise fashion the various elements that constitute the verbal complex: one accentual unit which may comprise, apart from the lexical root, various prefixes, infixes and suffixes. The concept of the verbal complex is also of significant importance for the computational implementation. The stress system of Old Irish results in a bewildering array of inflectional variation, especially in the ‘middle part’ of the verb. A major challenge that had to be overcome in this thesis is capturing the complex interplay between morphology and phonology in order to arrive at a feasible way of implementing Old Irish verb morphology programmatically. The current work focuses on the weak verb classes W1 and W2a, whose inflection patterns are much more predictable than the inflection patterns of the strong types. Chapter 3 reports on a plethora of techniques used in Natural Language Processing for his- torical texts. No ‘one-size-fits-all’ algorithm currently exists. Lemmatisation is often aided by iii iv morphological or orthographical rules, or by approximate matching techniques (string similar- ity). An approximate matching algorithm is used in Dereza’s (2016) Early Irish lemmatiser, based on eDIL. If a Part-Of-Speech Tagger for a modern variety is available, one can develop a standardisation module to bring historical forms in line with a modern standard, which are subsequently input to the tagger. This approach is employed in the context of the Royal Irish Academy’s Corpas Stairiúil na Gaeilge (‘Historical Irish Corpus’) 1600–1926. Due to the linguistic distance between Old and Modern Irish, adapting available NLP tools for the modern language was not an option. It was therefore decided to build a morphological parser for Old Irish from the ground up. Together with the tagging tools for Modern Irish, it is envisaged to constitute the core of a ‘two-pronged attack’: two morphological/morphosyntactic tools for normative and (relatively) well-resourced language varieties at opposite ends of the (historical) chronological spectrum, with standardisation methods to arrive at either end. Chapter 4 covers in detail the building of a morphological parser for Old Irish using the finite-state paradigm of two-level morphology. A key aspect of the FST implementation is the encoding of what is called the ‘monolithic stem’ in this work: a non-derived multi- morpheme base, not trivially segmentable on the surface, reflecting the ‘middle part’ of the verbal complex—crucial for operating with straightforward prefixation and suffixation rules. Another important aspect of the finite-state implementation includes the creation of two main lexicons: one for proclitics (e.g., ní ‘not’) and one for the stems and endings, accompanied by a set of fine-grained morphotactic and morphophonemic rules. Chapter 5 is dedicated to testing the FST using a digital edition of the text Táin Bó Fraích (TBF), based on Meid (1974). This chapter aims to find out how well the morphological analyser performs, and which issues one might encounter when applying an Early Irish text to the FST for verbs. For the 27 weak verb lemmas (Old Irish W1 and W2a) in this text, which are at the heart of my study, 36 unique inflected forms out of 50 in total (72%) received a morphological parse. After using Dereza’s (2016) Lemmatiser, an additional 10 were found to be correctly recognised. After having implemented a selection of function words and personal names, it was found that 9.6% of the total amount of words in TBF were covered; comparing this figure against another four Old Irish narrative texts points to a similar figure of about 10%. Chapter 6 concludes the work and provides a roadmap for the future. It proposes a prelim- inary infrastructure for bidirectional mappings between cognate verb forms. The first ‘route’ involves lexical tag mappings between my morphological FST for Old Irish and the one for Modern Irish (Uí Dhonnchadha & van Genabith 2006). A second strategy is to link eDIL lem- mas with their modern counterparts in Foclóir Gaeilge-Béarla (Ó Dónaill 1977), as contained in the Modern Irish FST. Implementing lexical-level tag mappings between the Old Irish and Modern Irish FSTs will assist research on the development of verbs and historical roots, mak- ing possible a systematic diachronic study of processes such as lexicalisation, relevant for the process of verb stem formation throughout the history of Irish. Possible computational so- lutions to dealing with grammatical and orthographical variation in Early Irish (i.e., Old and Middle Irish) are also discussed in this chapter. Acknowledgements Firstly, I would like to gratefully acknowledge two funding bodies for awarding me with schol- arships that enabled me to successfully complete my doctoral degree: Trinity College Dublin (Ussher Fellowship, 2014-2017) and the Irish Research Council (Government of Ireland Post- graduate Scholarship GOIPG/2017/1808). Thanks also to former and current postgraduate friends and colleagues at South Leinster Street, the Trinity Long Room Hub, and College in general, for creating a friendly and stim- ulating study environment. I received great support in particular from Frances Brady, Siobán˙ O’Brien Green and Tim van Wanrooij. Thank you all. Dr Eoin Mac Cárthaigh (Department of Irish and Celtic Languages, Trinity College Dublin) provided helpful feedback on an earlier draft in the context of my Ph.D. confirmation process. I am also grateful to Professor David Stifter (Department of Early Irish, Maynooth University) for showing an interest in my project and for making time to discuss matters relating to Old Irish verb morphology with me. Many of my colleagues kindly offered to proofread parts of this thesis, namely, Dr Ciaran McDonough, Dr Katherine Ravenna Morales, Dr Daniel Watson, Dr Ruud Koolen, Alejandra Núñez Asomoza, Antoin Eoin Rodgers and Adrian Doyle. All remaining errors and omissions are of course my own. My parents have always endorsed my educational choices and career path; I would like to take this opportunity to thank them from the bottom of my heart for their support and advice along the way. This work is dedicated to two former Dutch mentors who sparked my research interest in language variation and historical linguistics: Dr Ben Hermans, phonologist and dialectologist formerly of Tilburg University and the Meertens Institute in Amsterdam, and Professor Peter Schrijver, chair of Celtic Languages and Culture at Utrecht University. Heel erg bedankt voor uw inspirerende colleges en begeleiding. Last but not least, I would like to express my sincere gratitude to my supervisors Dr Elaine Uí Dhonnchadha and Professor Ruairí Ó hUiginn for providing guidance and generously shar- ing their knowledge in the course of my project. Míle buíochas ó chroí! v vi Contents List of Tables xiii List of Figures xv List of Code Examples xvii Glossary xix 1 Introduction: background and goals 1 1.1 Introduction . .1 1.2 Historical overview of the Irish language . .3 1.2.1 Language stages . .3 1.2.2 Orthography . .4 1.3 ‘Standard’ Old Irish . .5 1.4 Relevant (mainly) printed works for historical Irish . .7 1.5 A ‘lexicographical gap’ . .9 1.6 Research goals . 12 1.6.1 Aims and objectives . 12 1.6.2 Scope . 14 1.7 Synthesis . 14 2 Old Irish verbs: the morphology-phonology interface 17 2.1 Introduction . 17 2.2 The Early Irish verbal complex . 17 2.2.1 Basic structure and terminology .