Massively Multilingual Creation of Language Tools Using Interlinear Glossed Text

From Aari to Zulu: Massively Multilingual Creation of Language Tools using Interlinear Glossed Text Ryan Georgi A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2016 Reading Committee: Fei Xia, Chair William D. Lewis Emily M. Bender Program Authorized to Offer Degree: UW Department of Linguistics c Copyright 2016 Ryan Georgi University of Washington Abstract From Aari to Zulu: Massively Multilingual Creation of Language Tools using Interlinear Glossed Text Ryan Georgi Chair of the Supervisory Committee: Professor Fei Xia Linguistics This dissertation examines the suitability of Interlinear Glossed Text (IGT) as a computa- tional, semi-structured resource for creating NLP tools for resource-poor languages, with a focus on the tasks of word alignment, part-of-speech (POS) tagging, and dependency parsing. The creation of a massively multilingual database of IGT instances called the Online Database of INterlinear text (Odin) made possible the potential for creating tools to har- ness this particular data format on a large scale. Xia and Lewis (2007); Lewis and Xia (2008) demonstrated the potential of using IGT instances from Odin to answer some typological questions such as basic word order for a large number of languages by means of utilizing the language{gloss{translation line structure of IGT instances to bootstrap word alignment, and consequentially syntactic projection. This dissertation seeks to perform a thorough investigation as to the potential for creating these NLP tools for endangered or otherwise resource-poor languages with nothing more than the IGT instances found in Odin. After introducing the IGT data type and the particulars of the resources that will be used (Sections 3.1 to 4.4), this thesis presents each task in detail. Word alignment will be discussed in Chapter 5, POS tagging in Chapter 6, and dependency parsing in Chapter 7. In Chapter 8, INterlinear Text ENrichment Toolkit (Intent), the software created to enrich the IGT data and extract NLP tools from it will be introduced, and a brief summary of where to find and use the software will be included. Lastly, Chapter 9 will discuss aims of the experiments, and the overall viability of IGT for the tasks attempted. ACKNOWLEDGMENTS First and foremost, I would like to thank my advisor, Fei Xia, for all of her guidance, expertise, and patience with me throughout my graduate career. Her attention to detail instilled a methodological rigor and skepticism in me and my work that contributed to my overall maturity as a scientist. I would also like to thank my co-advisor, Will Lewis, for his input and enthusiasm for my chosen topic of study, which helped me maintain confidence that my work was meaningful, even when the results were disappointing. My final committee member, Emily Bender, was not only instrumental in helping me greatly strengthen my dissertation, and providing guidance through the last several years of the program, but is also largely responsible for there being a program for me to have started in the first place. Thanks also to my Graduate School Representative, Gail Stygall, for giving her time to participate in my defense and celebrate at my commencement. Thanks also to the Linguistics Department Administrator, Mike Furr, and our Academic Counselor, Joyce Parvi, who helped me navigate the complicated world of funding, travel, and academic bureaucracy. And to our Systems Administrator, David Brodbeck, for running a tight ship, despite the many attempts of us graduate students to run the computing cluster into the ground. I would also extend my thanks to the National Science Foundation, which provided a significant portion of the funding for my research through Grant No. BCS-0748919, and for recognizing the need for the type of exploratory work on resource-poor languages that this research consisted of. I would also like to thank my family, who supported me in uncountable ways throughout my graduate career, and enabled me to tackle the greatest challenges by knowing that they ii had my back should I fall, and without whom this dissertation surely would not exist. Last but not least, my friends and partners, who supported me when I was down, and made sure I was never without joy and hope when things were at their most difficult. DEDICATION To Deanna and Shawn Who not only gave of themselves to support me as the person I was, but supported my growth throughout the years into the person I was to become. i TABLE OF CONTENTS Page List of Figures . vi List of Tables . viii List of Figures . xi List of Figures . xii Chapter 1: Introduction . .1 1.1 Motivation . .1 1.2 Research Question . .4 1.3 Outline . .4 Chapter 2: Literature Review . .6 2.1 Unsupervised and Semi-supervised Induction Methods . .6 2.1.1 Fully Unsupervised Methods . .6 2.1.2 Semi-Supervised Methods . .9 2.2 Adapting Resources by Leveraging Typological Similarities . 12 2.3 Adapting Resources by Other Methods . 16 2.4 Creating Resources for Resource-Poor Languages . 18 2.5 Summary . 21 Chapter 3: Methodology Overview . 23 3.1 Interlinear Glossed Text . 23 3.1.1 Using IGT for Alignment . 24 3.1.2 Using IGT for Projection . 25 i 3.1.3 The Gloss Line As a \Pseudo-Language" . 26 3.2 System Overview . 29 3.2.1 Word Alignment . 30 3.2.2 POS Tagging . 31 3.2.3 Dependency Parsing . 32 3.3 Summary . 34 Chapter 4: The Data . 35 4.1 Corpora Overview . 36 4.1.1 Odin v2.1 . 37 4.1.2 XL-IGT . 37 4.1.3 RG-IGT . 39 4.1.4 UD-2.0 . 39 4.1.5 HUTP ................................... 40 4.1.6 CTN . 41 4.2 Tokenization and Transliteration in IGT . 42 4.3 POS Tagsets . 43 4.4 IGT and Data Cleaning . 44 4.4.1 Cleaning: Fixing PDF-to-Text Corruption . 46 4.4.2 Normalization: Removing Non-Linguistic Data . 48 4.4.3 Intent Filtering . 49 4.4.4 Future Work in Cleaning . 50 4.4.5 Consistency Issues . 51 4.5 Data Formats . 51 Chapter 5: Word Alignment . 53 5.1 Evaluating Word Alignment within IGT Instances . 53 5.2 Heuristic Alignment . 54 5.3 Statistical Alignment . 61 5.3.1 Baseline Statistical Alignment Approach . 62 5.3.2 Gloss/Translation-Based Approach . 63 5.3.3 Other Settings . 64 ii 5.4 Combining Statistical and Heuristic Alignment . 66 5.5 Future Work . 67 5.5.1 \Clue-based" Alignment . 67 5.5.2 Constrained Giza++ Search . 68 5.5.3 Bootstrapping Alignment for Other Bitexts . 68 5.6 Summary . 69 Chapter 6: Part-of-Speech Tagging . 70 6.1 Task Overview . 70 6.2 Projection-Based Tagging . 74 6.2.1 Projection-Based Tagging Algorithm . 76 6.2.2 Projection-Based Tagging Experiments . 76 6.2.3 Problems with Projection-Based Tagging . 78 6.3 Classification-Based Tagging . 79 6.3.1 IGT Gloss-Line Annotation . 80 6.3.2 Training . 81 6.3.3 Features . 83 6.3.4 Running the Classifier on New Instances . 86 6.4 Combining Projection and Classification . 86 6.4.1 Results on IGT Data . 88 6.5 A Case study in Projection Methods: Part-of-Speech Tagging Chintang . 89 6.5.1 Projection-Based Tagging . 90 6.5.2 Subword Mapping . 91 6.5.3 Classification-Based Tagging . 92 6.5.4 Varying the Amount of Data . 94 6.5.5 Analysis . 95 6.6 Extending to Monolingual Corpora . 96 6.6.1 Training Monolingual POS Taggers . 96 6.6.2 Evaluating The Monolingual Taggers . 96 6.6.3 Results . 98 6.7 Future Work . 101 iii 6.7.1 Incorporating Classification Probabilities . 101 6.7.2 POS Induction . 102 Chapter 7: Dependency Structures . 103 7.1 Introduction . 104 7.2 Projecting Dependency Structures . 105 7.3 Combining Dependency Parsers with Projection . 111 7.3.1 MSTParser . 111 7.3.2 Adding Features to the Parser . 113 7.3.3 Results . 118 7.3.4 Summary . 118 7.4 Analyzing and Correcting for Divergence Patterns . 120 7.4.1 Calculating Divergence Across Dependency Structures . 121 7.4.2 Defining Tree.

Massively Multilingual Creation of Language Tools Using Interlinear Glossed Text

Technical Reference Manual for the Standardization of Geographical Names United Nations Group of Experts on Geographical Names

Names of Countries, Their Capitals and Inhabitants

Evitalia NORMAS ISO En El Marco De La Complejidad

Transliteration of Tamil and Other Indic Scripts Ram Viswanadha

A Könyvtárüggyel Kapcsolatos Nemzetközi Szabványok

Finite-State Script Normalization and Processing Utilities: the Nisaba Brahmic Library

Inventory of Romanization Tools

Transliteration of <Script Name>

Proceedings of the 2Nd Workshop on Natural Language Processing for Conversational AI, Pages 1–10 July 9, 2020

The Issue of the Assamese Script Misrepresentation / Nonrepresentation in the International Standards

A Könyvtárüggyel Kapcsolatos Nemzetközi Szabványok

Appendix XIV ISO TC46 Report