A Faroese Part-Of-Speech Tagger Built with Icelandic Methods Data Preperation, Training and Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
M.A. thesis Language Technology A Faroese part-of-speech tagger built with Icelandic methods Data preperation, training and evaluation Hinrik Hafsteinsson Supervisor Dr. Anton Karl Ingason October 2020 University of Iceland School of Humanities Language Technology A Faroese part-of-speech tagger built with Icelandic methods Preperation, training and evaluation M.A. thesis in language technology Hinrik Hafsteinsson Kt.: 010694-3109 Supervisor: Dr. Anton Karl Ingason September 2020 Abstract This thesis describes the development of a dedicated, high-accuracy part-of-speech (PoS) tagging solution for Faroese. To achieve this, a state-of-the-art neural PoS tag- ger for Icelandic, ABLTagger, was trained on the 100,000 word Sosialurin PoS-tagged corpus for Faroese, standardised with methods previously applied to Icelandic cor- pora. This tagger was supplemented with a novel Experimental Database of Faroese Inflection (EDFM), which contains morphological information on 67,488 Faroese words with about one million inflectional forms. This approach produced a PoS- tagging model for Faroese which achieves a 91.40% overall accuracy when evaluated with 10-fold cross validation, which is currently the highest accuracy for a dedicated Faroese PoS-tagging implementation. The tagging model, morphological database, proposed revised PoS tagset for Faroese as well as a revised and standardised Sosial- urin corpus are all presented as products of this project and are made available for use in further research in Faroese language technology. Útdráttur Þessi ritgerð lýsir þróun nákvæms málfræðimarkara fyrir færeysku. Til að ná slíku fram var íslenski tauganetsmarkarinn ABLTagger, sem hefur náð besta birta árangri í íslenskri málfræðimörkun, þjálfaður á færeyskri markaðri málheild sem kennd er við dagblaðið Sosialurin og inniheldur u.þ.b. 100.000 lesmálsorð. Færeyska mörkunar- líkanið notast við nýja Bráðabirgðabeygingarlýsingu færeysks nútímamáls (BBFN) til að betrumbæta mörkunina en beygingarlýsingin inniheldur beygingargögn fyrir um 67,488 færeysk orð, samtals u.þ.b. milljón stakar beygingarmyndir. Þessi aðferð skilaði mörkunarlíkani fyrir færeysku sem nær 91,40% mörkunarnákvæmni, sem er besti birti árangur í sjálfvirkri málfræðimörkun á færeysku. Mörkunarlíkanið, beygingarlýsingin, tillaga að endurbættu færeysku markamengi og yfirfarin Sosial- urin málheild eru allt afurðir þessa verkefnis og eru gerðar aðgengilegar, svo þær megi nýtast sem best í frekari rannsóknum í færeyskri máltækni. iii Contents List of abbreviations vi List of acronyms vi List of Tables vii List of Figures viii Preface 1 1 Introduction 2 2 Previous work 4 2.1 Part-of-speech tagging . .4 2.2 Faroese NLP research . .5 2.2.1 The Sosialurin project . .6 2.2.2 FarParsald experiments . .7 2.2.3 Faroese implementation by Giellatekno . .9 2.3 The ABLTagger experiment for Icelandic . .9 2.3.1 Data sets . 10 2.3.1.1 Corpora used for training . 10 2.3.1.2 Morphological component . 11 2.3.2 Tagger architecture . 13 2.3.2.1 Artificial neural networks and bi-LSTMs . 13 2.3.2.2 Vectorised input . 14 2.3.2.3 Pre-tagging step . 16 2.3.3 Experiment setup and results . 17 3 Data set creation and standardisation 19 3.1 Faroese Tagset . 19 3.1.1 Original Sosialurin tagset . 19 3.1.2 Proposed Revised Tagset . 23 3.1.2.1 Revisions from the MIM-GOLD tagset . 24 3.1.2.2 Language specific revisions . 27 3.2 Preperation of the Sosialurin Corpus . 30 3.3 Experimental Database of Faroese Morphology . 33 3.3.1 Collecting morphological data . 33 3.3.1.1 Dictionary of Faroese . 34 3.3.1.2 Faroese given names . 36 3.3.1.3 Supplementation with Wiktionary . 37 3.3.2 Formatting the morphological data . 38 iv 4 Training and evaluation 42 4.1 Training setup . 42 4.1.1 Network parameters and cross validation . 42 4.1.2 Effects of tagset revision: Three models . 43 4.1.3 Effects of corpus content: Comparison to MIM-GOLD ..... 44 4.1.4 Addition of morphological database . 45 4.2 Evaluation results . 46 4.2.1 Tagset revisions and error analysis . 46 4.2.2 Corpus size and contents . 48 4.2.3 Morphological data . 50 4.2.4 Applicability of the Faroese model . 51 5 Conclusion 53 Bibliography 55 A Cross validation splits of training corpora 62 A.1 Revised Sosialurin corpus . 63 A.2 Baseline Sosialurin corpus . 64 A.3 MIM-GOLD FBL reference corpus . 65 A.4 MIM-GOLD MBL reference corpus . 66 A.5 Resized MIM-GOLD MBL-resize reference corpus . 67 v Abbreviations 1p 1st grammatical person. 2p 2nd grammatical person. 3p 3rd grammatical person. Acc. Accusative case. Dat. Dative case. Gen. Genitive case. Imp. Imperative mood. Nom. Nominative case. Past part. Past participle. Plur. Plural number. Prs. part. Present participle. Sng. Singular number. Acronyms BiLSTM Bidirectional long short-term memory network. DIM Database of Icelandic Morphology. EDFM Experimental Database of Faroese Morphology. FarPaHC Faroese Parsed Historical Corpus. IcePaHC Icelandic Parsed Historical Corpus. IFD Icelandic Frequency Dictionary corpus [Íslensk orðtíðnibók]. LSTM Long short-term memory network. MIM Tagged Icelandic corpus [Mörkuð íslensk málheild]. MIM-GOLD MIM gold standard corpus. NLP Natural language processing. PoS Part of speech. PPCHE Penn Parsed Corpora of Historical English. RNN Recurrant neural network. vi List of Tables 1 Excerpt of English PUD corpus showing Google universal tags . .4 2 Excerpt from the IFD corpus, with explanations . 10 3 Icelandic noun inflectional paradigm: Inflections of grunnur ...... 12 4 Fields of the DIM basic format . 12 5 Fields of the DIM basic format, reiterated . 15 6 Accuracy of ABLTagger models when trained on IFD . 17 7 Accuracy of full ABLTagger model when trained on MIM-GOLD . 18 8 Example of original Sosialurin tagset, with explanations . 20 9 Original Faroese tagset, used in the Sosialurin corpus . 21 10 Revisions based on the Icelandic MIM-GOLD tagging scheme . 24 11 Subcategory revisions for pronoun tags . 25 12 Adverb tag revisions . 25 13 Numeral subcategory revisions . 26 14 Punctuation mark tags in revised tagset . 27 15 Verbal conjugation in Faroese and Icelandic . 27 16 Tagging scheme for verbs in the Sosialurin corpus . 28 17 Revised tag strings for indicative forms of the verb siga ........ 28 18 Revised Sosialurin tagset for Faroese . 29 19 Comparison of original and revised Sosialurin corpora . 32 20 Nominal OBG morphological information . 34 21 Verbal OBG morphological information on verb skriva ........ 35 22 Complete morphological descreption of verb skriva .......... 35 23 Sections of the OBG database which contain inflectional information 36 24 Number of given names compiled . 37 25 UniMorph format description for verb skriva .............. 38 26 Contents of EDFM by word class and source database . 40 27 Source databases in the EDFM, with in-data flag . 41 28 Examples of proposed revisions for plural verb tags . 43 29 Occurances of plural verb tags in the original Sosialurin corpus . 44 30 Models trained to evaluate tagset revisions . 44 31 Icelandic reference models compared to the revised Sosialurin . 45 32 Models with morphological data and lexicon representation . 46 33 Accuracy of taggers trained on different tagsets . 47 34 10 most common errors in tagset evaluation, divided by tagging scheme 47 vii 35 Revised Sosialurin model, reference models and ABLTagger baseline . 49 36 Contents of Sosialurin corpus compared to the Icelandic reference corpora . 49 37 Sosialurin morphology evaluation results . 50 38 Morphology model results, accuracy gain and ratio of represented lexicon . 51 39 Tagging accuracy for the current project compared to previous best . 52 40 Sosialurin Revised cross validation splits . 63 41 Sosialurin Baseline cross validation sets . 64 42 MIM-GOLD FBL reference corpus cross validation sets . 65 43 MIM-GOLD MBL reference corpus cross validation sets . 66 44 MIM-GOLD MBL-resize reference cross validation sets . 67 List of Figures 1 Example of Sosialurin corpus text format . .7 2 Excerpt from the DIM basic format output: Inflections of grunnur .. 13 3 Excerpt from EDFM output: Inflections of grunnur .......... 40 viii Preface Faroese has for a long time been an important topic of interest for me. I was fortunate to be able to study it as a subject in my bachelor studies in linguistics, where I wrote my BA thesis on Faroese morphology. The idea of creating a PoS tagger for Faroese using Icelandic NLP tools and methods, as is the focus of the project described here, first came to mind in the spring semester of 2019, when I was attending the course Language analysis in Icelandic context, taught by Hulda Óladóttir at the University of Iceland, where I performed a miniature experiment on the topic. In that sense, this project has been more than a year in the making, although most of the work which led to this thesis was done in the last few months, with most of the thesis itself materialising in the last few weeks. During this period I was living in Germany for an exchange semester at the University of Tübingen, which ended under very unusual circumstances due to the ongoing pandemic. For the last month of the summer, I was lucky to receive a desk at the Language and Tech lab at the University of Iceland, specifically in the old emeritus room in Árnagarður, where I have spent most of my waking hours these last few weeks and which is, indeed, where I am sitting at this very moment. First and foremost, I would like to thank Dr Anton Karl Ingason for the un- faltering help and insight he provided during his supervision of this project, and for taking on the task in the first place. Dr Jógvan í Lon Jacobsen, Zakaris Sv- abo Hansen, Heðin Jákupsson and Dr Trond Trosterud deserve special thanks for the correspondences regarding all things Faroese. Dagbjört Guðmundsdóttir, PhD student in Icelandic linguistics, also recieves my best thanks for the support and numerous reality checks during my studies.