Corpas na Gaeilge (1882-1926): Integrating Historical and Modern Irish Texts Elaine Uí Dhonnchadha3, Kevin Scannell7, Ruairí Ó hUiginn2, Eilís Ní Mhearraí1, Máire Nic Mhaoláin1, Brian Ó Raghallaigh4, Gregory Toner5, Séamus Mac Mathúna6, Déirdre D’Auria1, Eithne Ní Ghallchobhair1, Niall O’Leary1 1Royal Irish Academy, Dublin, Ireland 2National University of Ireland Maynooth, Ireland 3Trinity College Dublin, Ireland 4Dublin City University, Ireland 5Queens University Belfast, Northern Ireland 6University of Ulster, Northern Ireland 7Saint Louis University, Missouri, USA E-mail:
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected];
[email protected],
[email protected] Abstract This paper describes the processing of a corpus of seven million words of Irish texts from the period 1882-1926. The texts which have been captured by typing or optical character recognition are processed for the purpose of lexicography. Firstly, all historical and dialectal word forms are annotated with their modern standard equivalents using software developed for this purpose. Then, using the modern standard annotations, the texts are processed using an existing finite-state morphological analyser and part-of-speech tagger. This method enables us to retain the original historical text, and at the same time have full corpus-searching capabilities using modern lemmas and inflected forms (one can also use the historical forms). It also makes use of existing NLP tools for modern Irish, and enables integration of historical and modern Irish corpora. Keywords: historical corpus, normalisation, standardisation, natural language processing, Irish, Gaeilge 1.