Infixer: a Method for Segmenting Non-Concatenative Morphology in Tagalog
Total Page:16
File Type:pdf, Size:1020Kb
City University of New York (CUNY) CUNY Academic Works All Dissertations, Theses, and Capstone Projects Dissertations, Theses, and Capstone Projects 6-2016 Infixer: A Method for Segmenting Non-Concatenative Morphology in Tagalog Steven R. Butler Graduate Center, City University of New York How does access to this work benefit ou?y Let us know! More information about this work at: https://academicworks.cuny.edu/gc_etds/1308 Discover additional works at: https://academicworks.cuny.edu This work is made publicly available by the City University of New York (CUNY). Contact: [email protected] INFIXER:AMETHOD FOR SEGMENTING NON-CONCATENATIVE MORPHOLOGY IN TAGALOG by STEVEN BUTLER A masters thesis submitted to the Graduate Faculty in Linguistics in partial fulfillment of the requirements for the degree of Master of Arts, The City University of New York 2016 © 2016 STEVEN BUTLER Some Rights Reserved This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. https://creativecommons.org/licenses/by-nc-sa/4.0/ ii This manuscript has been read and accepted by the Graduate Faculty in Linguistics in satisfaction of the thesis requirement for the degree of Master of Arts. Professor William Sakas Date Thesis Advisor Professor Gita Martohardjono Date Executive Officer THE CITY UNIVERSITY OF NEW YORK iii Abstract INFIXER:AMETHOD FOR SEGMENTING NON-CONCATENATIVE MORPHOLOGY IN TAGALOG by STEVEN BUTLER Adviser: Professor William Sakas In this paper, I present a method for coercing a widely-used morphological segmentation algo- rithm, Morfessor (Creutz and Lagus 2005a), into accurately segmenting non-concatenative mor- phological patterns. The non-concatenative patterns targeted—infixation and partial-reduplication— present problems for many segmentation algorithms, and tools that can successfully identify and segment those patterns can improve a number of downstream natural language processing tasks, including keyword search and machine translation. Included with this project is an implementation of the segmentation method described in the form of a Python library called Infixer. This approach involves a preprocessing step that re- structures non-concatenative patterns into concatenative ones using regular expressions, which allows the algorithm to operate on the input data as it would any other data. The target language for this project is Tagalog, a major language of the Philippines that makes extensive use of non- concatenative morphology, and the training data is built from data from the IARPA Babel Program and the Tagalog Wikipedia. The results for this test were promising, especially for the more straightforward cases of infix- ation tested. For the data tested, the Infixer implementation using affix regular expressions showed performance gains over those without, demonstrating an improved ability to segment data contain- ing non-concatenative morphological forms. In the future it is hoped that this project can lead to the development of tools that can be used effectively on languages besides Tagalog and on a more diverse array of phenomena. iv Acknowledgments I would like to thank everyone who advised, assisted, and supported me through the completion of this project. That includes my advisors and instructors William Sakas, Andrew Rosenberg, Nava Shaked, Juliette Blevins, and all of the other professors I’ve had through my time at the Graduate Center. I would also like to thank many of my fellow students for the invaluable support and advice they have provided me, including Rebecca Lovering, Rachel Rakov, Maxwell Schwartz, Pablo Gonzalez, Michelle Morales, all of the other members of the computational linguistics reading group. Finally, I have to thank my partner, Susan Heinick, for quite literally supporting me through every step of this process, as well as my family, Melanie Mendenhall, Robert Butler, Amanda Butler, Daphne Mendenhall, and everyone else. This work was partially supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Defense U.S. Army Research Laboratory (DoD / ARL) contract num- ber W911NF-12-C-0012. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. Disclaimer: The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DoD/ARL, or the U.S. Government. This effort uses the IARPA Babel Program Tagalog language collection release IARPA-babel106b-v0.2g. v Contents 1 Introduction 1 2 Background 3 2.1 Morpho Challenge and Morfessor .......................... 3 2.2 Non-Concatenative Morphology ........................... 4 2.3 Tagalog Grammar ................................... 6 2.3.1 Focus System ................................. 6 2.3.2 -um- ...................................... 7 2.3.3 -in- ...................................... 8 2.3.4 Partial Reduplication and Aspect ....................... 8 2.3.5 Tagalog as a Test Language ......................... 8 3 Data 9 3.1 Corpus Preparation .................................. 9 3.2 IARPA Babel Program ................................ 10 3.3 Wikipedia ....................................... 12 4 Method 13 4.1 Overview of the Process ............................... 13 4.2 Regular Expressions for Affixes ........................... 14 4.2.1 Infixation ................................... 15 4.2.2 Infixation within Reduplication ....................... 16 4.2.3 Reduplication Alone ............................. 18 4.2.4 Ordering ................................... 18 5 Implementation 19 5.1 Tokenizers ....................................... 20 5.2 Filtering and Modeling ................................ 20 vi 5.2.1 AffixFilter ................................. 20 5.2.2 InfixerModel ................................ 21 5.3 Segmentation and Evaluation ............................. 24 5.4 Interface ........................................ 25 6 Evaluation 25 6.1 Evaluation Data .................................... 26 6.2 Results by Test Set .................................. 27 6.2.1 Test Set 1: Infixation ............................. 27 6.2.2 Test Set 2: Infixation within Reduplication ................. 28 6.2.3 Test Set 3: Reduplication ........................... 29 6.2.4 Test Set 4: Concatenative Morphology Only ................. 30 6.3 Discussion ....................................... 31 7 Future work 32 7.1 Testing on Other Languages ............................. 32 7.2 Generalization ..................................... 32 7.3 Affix identification .................................. 34 References 35 vii List of Tables 1 Count of Word Annotations per Test Set ....................... 27 2 Results for Babel corpus trained models on test set 1 ................ 28 3 Results for Wikipedia corpus trained models on test set 1 .............. 28 4 Results for Babel corpus trained models on test set 2 ................ 29 5 Results for Wikipedia corpus trained models on test set 2 .............. 29 6 Results for Babel corpus trained models on test set 3 ................ 30 7 Results for Wikipedia corpus trained models on test set 3 .............. 30 8 Results for Babel corpus trained models on test set 4 ................ 31 9 Results for Wikipedia corpus trained models on test set 4 .............. 31 viii 1 Introduction Morphology is the division of linguistics focused on morphemes, the smallest grammatical units in language, and morphological segmentation is an area of research in natural language processing (henceforth NLP) focused primarily on methods for splitting words into their component mor- phemes. Accurate morphological segmentation can lead to dramatic improvements in many “down- stream” NLP tasks, such as machine learning and keyword search, as segmentation can reduce the percent of out-of-vocabulary (OOV) tokens in a corpus, and can group words that share a lemma form but differ in inflection. When an NLP process only makes use of work tokens, morphologically-related forms that vary due to inflection or some other form of affixation will be treated as entirely distinct. To take an example from Indonesian, the verbs mendukung “sup- port” (active voice) and didukung “support” (passive voice) are transparently derived from a single root, dukung “support”. Without segmentation of the data, a keyword-search operation would fail to correctly return didukung on a query for mendukung, reducing the usefulness of the system overall. For high-resource languages like English, Spanish, and French, a variety of well-supported and tailored morphological segmentation systems already exist, such as the tools found in the Stanford CoreNLP Toolkit (Manning et al. 2014). Tools like these can and do provide high quality results for their target languages, but the time and money needed to build a system of that quality are substantial, and preparing something similar for even a tiny fraction of the more than 7000 languages currently spoken across the globe would be prohibitively expensive. For the speakers of many of those languages, even those with tens of millions of speakers, access to tools that result from the development of advanced NLP systems are currently out of reach. Instead, many researchers have turned to designing semi-supervised and unsupervised machine- learning algorithms that can be applied to many different languages, both high- and low-resource. One particular family