Named Entity Recognition for Icelandic – Annotated Corpus and Neural Models
Total Page:16
File Type:pdf, Size:1020Kb
Named Entity Recognition for Icelandic – Annotated Corpus and Neural Models by Svanhvít Lilja Ingólfsdóttir Thesis of 60 ECTS credits submitted to the School of Technology, Department of Computer Science at Reykjavík University in partial fulfillment of the requirements for the degree of Master of Science (M.Sc.) in Language Technology June 2020 Examiners: Hrafn Loftsson, Supervisor Associate Professor, Reykjavík University, Iceland Hannes Högni Vilhjálmsson, Examiner Professor, Reykjavik University, Iceland, Yngvi Björnsson, Examiner Professor, Reykjavik University, Iceland i Copyright Svanhvít Lilja Ingólfsdóttir June 2020 ii Named Entity Recognition for Icelandic – Annotated Corpus and Neural Models Svanhvít Lilja Ingólfsdóttir June 2020 Abstract Named entity recognition (NER) is the task of automatically extracting and classifying the names of people, places, companies, etc. from text, and can additionally include numerical entities, such as dates and monetary amounts. NER is an important prepro- cessing step in various natural language processing tasks, such as in question answering, machine translation, and speech recognition, but can prove a difficult task, especially in highly-inflected languages where each entity can have many different surface forms. We have annotated all named entities in a text corpus of one million tokens to create the first annotated NER corpus for Icelandic, containing around 48,000 named entities. The data has then been used for training neural networks to annotate named entities in unseen texts. This work consists mainly of two parts: the annotation phase and the neural network training phase. For the annotation phase, gazetteers of Icelandic named entities were collected and used to extract and classify as many entities as possible. Regular ex- pressions and other heuristics were also applied in this pre-processing step. These pre- classified results were then manually reviewed. The corpus, MIM-GOLD, is a tagged and balanced Icelandic corpus sampled from thirteen different text types, containing a variety of named entities. The entity types that have been annotated are: person, location, organization, miscellaneous, date, time, money, and percent. In the neural model training phase, a bidirectional LSTM recurrent neural network was trained on the annotated corpus, using word embeddings trained from a larger text source as external input. We trained on different sizes of the corpus, to gain an understanding of how increasing corpus sizes affects the results. We report an F1 score of 83.65% for all entity types when trained on the whole corpus. Experiments with different corpus sizes show a clear advantage in using the whole dataset, but viable results can also be obtained from smaller training sets. The different corpus text genres also allow for selecting the domains that best fit the purpose of the application each time. The corpus and models will be made publicly available, and we hope they will help in moving the rapidly developing Icelandic language technology field even further. iii Nafnakennsl fyrir íslensku – málheild og tauganetslíkön Svanhvít Lilja Ingólfsdóttir júní 2020 Útdráttur Nafnakennsl („named entity recognition“), er svið innan máltækni sem felst í því að finna og flokka sérnöfn, þ.e. nöfn á fólki, stöðum, fyrirtækjum o.fl. með sjálfvirkum hætti. Stundum eru enn fremur flokkaðar ýmsar tölulegar einingar, svo sem dagsetn- ingar og upphæðir. Nafnakennsl eru eitt af grunnverkfærum máltækni og mikilvægt skref fyrir ýmis viðfangsefni hennar, svo sem spurningasvörun, vélþýðingar og talgrein- ingu. Þetta er þó ekki einfalt verkefni, sér í lagi þegar um er að ræða beygingamál eins og íslensku þar sem hvert sérnafn getur haft margar birtingarmyndir. Hér er kynnt mörkun á öllum sérnöfnum og ýmsum tölulegum einingum í milljón orða málheild, Gullstaðlinum. Þetta er fyrsta íslenska nafnakennslamálheildin, og inniheldur yfir 48.000 nafnaeiningar. Þessi nýju gögn hafa enn fremur verið notuð til þjálfunar á tauganetslíkönum til þess að finna og flokka nafnaeiningarnar í áður óséðum texta. Verkefnið er tvíþætt: annars vegar snýst það um mörkun málheildarinnar, hins vegar um þjálfun tauganetslíkananna. Við mörkunina var notast við reglulegar segðir og ýmsa lista með íslenskum sérnöfnum til að flokka sem flest nöfn í textanum sjálfvirkt, áður en öll milljón orðin voru lesin yfir til að tryggja rétta mörkun. Gullstaðallinn, sem notaður var sem grunnur að þessari nafnakennslamálheild, er mörkuð og jafnvæg („balanced“) málheild sem samanstendur af þrettán textaflokkum þar sem fjölbreytt sérnöfn koma fyrir. Þær nafnaeiningar sem markaðar voru í málheildinni eru eftirfar- andi: person, location, organization, miscellaneous, date, time, money og percent. Í þjálfunarfasanum voru tauganet af gerðinni „bidirectional LSTM RNN“ þjálfuð á nafnakennslamálheildinni. Að auki var notast við orðavigra forþjálfaða á mun stærri málheild sem viðbótarinntak. Málheildinni var skipt upp í mismunandi þjálfunarstærð- ir, til þess að komast að því hvernig niðurstöður þróast með meiri gögnum. Niðurstöð- urnar úr þjálfun með stærsta þjálfunarsettinu á öllum flokkum gefa 83,65% F1. Tilraunir með þjálfun á mismunandi stærðum sýna að meiri gögn skila betri árangri, en að einnig má þjálfa með minna magni af gögnum, eða jafnvel ákveðnum textaflokkum og fá frambærilegar niðurstöður. Málheildin og líkönin verða gerð opinber og munu vonandi koma að gagni í einhverjum þeirra fjölmörgu verkefna sem nú eru í vinnslu á sviði íslenskrar máltækni. iv Acknowledgements Knowledge is the small part of ignorance that we arrange and classify. Ambrose Bierce This work was funded by “The Strategic Research and Development Programme for Language Technology” 2019, grant no. 180027-5301. I want to warmly thank my advisor, Hrafn Loftsson, for his excellent guidance and support thoughout this work, ever since its beginning as an undergraduate research project, and for encouraging me to take this path. I also want to thank my fellow student and co-worker in this project, Ásmundur Alma Guðjónsson, for a fun and fruitful cooperation during this last year, and Sigurjón Þorsteinsson, who worked with me on the undergraduate project in 2018. Páll Hilmarsson and Steinþór Steingrímsson, who provided me with valuable data, Jan Carbonell, who has been a constant source of encouragement, and Salome Lilja Sigurðardóttir, who worked on the annotation evaluation, also deserve my sincere thanks. I also want to thank my family and friends for their support though the years, and to Óli Sindri and little Ólafsson, who is due in September, I send my deepest love and gratitude. v Preface This dissertation is original work by the author, Svanhvít Lilja Ingólfsdóttir. Elements of the work at an earlier stage have been published in [1], co-authored by Svanhvít. This early work was conducted in equal parts by the author and Sigurjón Þorsteinsson, as a final research project submitted in partial fulfillment of the requirements for the degree of BSc in Computer Science at Reykjavik University, in February 2019. This work was conducted alongside the work of Ásmundur Alma Guðjónsson, a fellow MSc student of Language Technology at Reykjavik University, who made use of the corpus and models presented here in his own dissertation. Ásmundur also took part in the annotation process, annotating around 8% of the corpus, and helped with some of the preprocessing for this work. Salome Lilja Sigurðardóttir worked on the annotation evaluation part of this work. Hrafn Loftsson, associate professor at Reykjavik University, supervised the project at all stages. vi Contents Acknowledgements v Preface vi Contents vii List of Abbreviations ix 1 Introduction 1 2 Background 4 2.1 Origins of NER . 4 2.2 Definitions of NEs . 4 2.2.1 The philosophical approach . 5 2.2.2 The linguistic approach . 6 2.3 NER domains and entity types . 7 2.3.1 General domain NE types . 7 2.3.2 Biomedical domain . 8 2.3.3 Anonymization . 8 2.3.4 Proper names in Icelandic . 9 2.4 NER corpus annotation . 10 2.4.1 Manual annotation methods . 10 2.4.2 Automated annotation methods . 11 2.4.2.1 NER tagging format . 12 2.4.3 Text sources for NER datasets . 12 2.5 Evaluation metrics . 12 2.5.1 NER model evaluation . 12 2.6 Development of NER methods . 14 2.6.1 Rule-based methods . 14 2.6.2 Traditional machine learning methods . 15 2.6.3 Deep learning methods . 16 2.6.3.1 From a single perceptron to RNNs . 16 2.6.3.2 Word embeddings . 18 2.6.3.3 Deep learning and NER . 21 2.7 NER for Icelandic and other languages . 22 2.7.1 Nordic languages . 23 2.7.2 Icelandic . 24 3 Methods 25 3.1 Corpus annotation . 25 vii 3.1.1 Corpus selection and text categories . 25 3.1.2 Annotation process . 27 3.1.2.1 Selecting the entity types . 27 3.1.2.2 Preprocessing . 27 3.1.2.3 Gazetteers . 28 3.1.2.4 Regular expressions . 29 3.1.2.5 Manual review . 29 3.1.2.6 Annotation guidelines . 31 3.2 NER baseline model training . 32 3.2.1 NeuroNER – biLSTM RNN with external word embeddings . 32 4 Results and evaluation 36 4.1 Corpus description and evaluation . 36 4.1.1 Corpus statistics . 36 4.1.2 Annotation evaluation results . 38 4.2 Experimental results . 38 5 Discussion 43 5.1 Corpus . 43 5.2 Neural models . 45 6 Conclusion 51 Bibliography 53 A Annotation guidelines 66 A.1 General guidelines . 66 A.1.1 Overlapping entities . 66 A.1.2 Unorthodox spelling and punctuation . 67 A.1.3 Abbreviations . 67 A.2 Named entity types . 67 A.2.1 Person . 67 A.2.2 Location . 68 A.2.3 Organization . 68 A.2.4 Miscellaneous . 69 A.2.5 Time . 69 A.2.6 Date . 70 A.2.7 Money . 71 A.2.8 Percent