A Basic Language Technology Toolkit for Quechua

A Basic Language Technology Toolkit for Quechua A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy to the Faculty of Arts of the University of Zurich by Annette Rios Accepted in the Autumn Term 2015 on the Recommendation of the Doctoral Committee: Prof. Dr. Martin Volk (main advisor) Prof. Dr. Balthasar Bickel Zurich, 2015 ii Abstract In this thesis, we describe the development of several natural language processing tools and resources for the Andean language Cuzco Quechua as part of the SQUOIA project at the University of Zurich. The main focus of this work lies on the implementation of a machine translation system for the language pair Spanish-Cuzco Quechua. Since the target language Quechua is not only a non-mainstream language in the field of computational linguistics, but also typologically quite different from the source language Spanish, several rather unusual problems became evident, and we had to find solutions in order to deal with them. Therefore, the first part of this thesis presents monolingual tools and resources that are not directly related to machine translation, but are nevertheless indispensable. The main contributions of this thesis are as follows: • We built a hybrid machine translation system that can translate Spanish text into Cuzco Quechua. The core system is a classical rule-based transfer engine, however, several statistical modules are included for tasks that cannot be resolved reliably with rules. • We implemented a text normalization pipeline that automatically rewrites Quechua texts in different orthographies or dialects to the official Peruvian standard orthog- raphy. This includes a tool for the morphological analysis of Quechua words that achieves high coverage. Furthermore, we also created a slightly adapted version that can be used as spell checker back-end, in combination with a plug-in for the open-source productivity suite LibreOffice/OpenOffice. • We built a Quechua dependency treebank of about 2000 annotated sentences, that provided not only training data for some of the translation modules, but also served as a source of verification, since it allows to observe the distribution of cer- tain syntactic and morphological structures. Furthermore, we trained a statistical parser on the treebank and thus have now a complete pipeline to morphologically analyze, disambiguate and then parse Quechua texts. All resources and tools are freely available from the projects website.1 Apart from the scientific interest in developing tools and applications for a language that is typologically distant from the main stream languages in computational linguistics, we hope that the various resources presented in this thesis will be useful not only for language learners and linguists, but also to Quechua speakers who want to use modern technology in their native language. 1https://github.com/ariosquoia/squoia Acknowledgements Above all, I would like to thank my supervisor Martin Volk for his support and guidance during the four years of this project. I am also very grateful for the continued assis- tance, endless discussions and many laughs with my fellow researcher in the SQUOIA project, Anne Göhring. I would also like to thank the members of the doctoral committee, Balthasar Bickel and Paul Heggarty, who provided a detailed review with many suggestions for improvement. Moreover, I wish to thank the people in Peru that made this work possible: • Richard Castro Mamani for the collaboration on the spell checkers, the manage- ment and organization of the evaluation of the MT system and the translations for the treebank • Roger Gonzalo Segura for the syntactic annotation and the numerous discussions about Quechua syntax • CésarMorante Luna for translations, corrections and filling the gaps of the bilingual dictionary of the MT system • Virginia Mamani Mamani and Irma Alvarez´ Ccoscco for the contribution of the translations of the treebank texts • Juan Cruz Tello for providing contacts and general support of the project Furthermore, I would like to thank all my colleagues at the Institute of Computational Linguistics, especially Simon Clematide for the provided help with the finite-state tools and my fellow PhD students Magdalena Plamada, for general advice on MT related issues, Don Tuggener for ideas and discussions about coreference resolution to deal with Quechua switch-reference, Johannes Graënfor the technical support with the web- related parts of this thesis, and my former colleague Rico Sennrich for his valuable tips and tricks concerning the machine learning parts of the Spanish-Quechua translation system. I would also like to thank my family, especially Naira and my mother Susanne for their patience and support during these past four years. Most importantly, I am grateful for the financial support provided by the Swiss National Science Foundation under grants 100015 132219 and 100015 149841. iii Contents Abstract ii Acknowledgements iii Contents iv List of Figures ix List of Tables xi Abbreviations xiii 1 Introduction1 1.1 Overview....................................1 1.2 The Quechua Language Family........................2 1.2.1 Distribution of Quechua Languages..................3 1.3 NLP for Quechua................................4 1.4 The SQUOIA Project.............................6 1.5 Research Questions...............................8 1.6 Thesis Outline.................................8 I Monolingual Quechua Resources 11 2 Quechua Morphology 13 2.1 Introduction................................... 13 2.2 Orthographic Variation............................ 16 2.3 Morphological Analysis............................ 17 2.3.1 Finite-State Networks......................... 18 2.3.2 Finite-State Analysis for Quechua.................. 21 2.4 Morphological Disambiguation and Text Normalization........... 26 2.4.1 Model 1: Disambiguation of Ambiguous Roots........... 26 2.4.2 Model 2: Disambiguation of Nominalizing and Verbalizing Suffixes 30 2.4.3 Model 3: Disambiguation of Verbal Morphology........... 31 2.4.4 Model 4: Disambiguation of Independent Suffixes.......... 31 2.4.5 Performance of the Four Models................... 32 2.4.6 Evaluation............................... 36 v Contents vi 2.5 Spell Checking................................. 39 2.6 Summary.................................... 41 3 Quechua Treebank 43 3.1 Introduction................................... 43 3.2 Corpus...................................... 45 3.3 Quechua Dependency Annotation Scheme.................. 47 3.3.1 Case Suffixes.............................. 47 3.3.2 Elision of Copula............................ 48 3.3.3 Coordination.............................. 48 3.3.4 Focus.................................. 52 3.3.5 Relative Clauses............................ 54 3.3.6 Internally Headed Relative Clauses.................. 55 3.3.7 Embedded Clauses........................... 56 3.4 Annotation Process............................... 58 3.5 Parsing Quechua Sentences.......................... 61 3.5.1 Conversion PML to CoNLL...................... 61 3.5.2 Parsing and Preliminary Evaluation................. 63 3.6 Summary.................................... 67 II Bilingual Spanish-Quechua Resources 69 4 Word-Aligned Parallel Text: Bilingwis Spanish-Quechua 71 4.1 Introduction................................... 71 4.2 Spanish-Quechua Bilingwis.......................... 72 4.3 Summary.................................... 76 5 Hybrid Machine Translation Spanish-Quechua 81 5.1 Introduction................................... 81 5.2 Analysis of Spanish Input........................... 82 5.3 Verb Form Disambiguation.......................... 87 5.3.1 Relative Clauses............................ 88 5.3.1.1 Relative Clause Disambiguation with Machine Learning. 92 5.3.1.2 Training Data........................ 92 5.3.1.3 Features........................... 93 5.3.1.4 Evaluation.......................... 94 5.3.1.5 Relative Clauses with no Direct Correspondence..... 95 5.3.2 Coreference Resolution......................... 96 5.3.3 Disambiguation of Subordinated Clauses............... 97 5.3.3.1 Disambiguation of Subordinated Clauses with Machine Learning........................... 99 5.3.3.1.1 Training Data................... 99 5.3.3.1.2 Features...................... 100 5.3.3.1.3 Classification.................... 100 5.3.3.2 Rule-based Translation System with Machine Learning Verb Disambiguation.................... 101 5.3.3.3 Evaluation.......................... 103 Contents vii 5.3.3.3.1 Whole Verb Disambiguation Pipeline...... 103 5.3.3.3.2 Additional Verb Disambiguation Module.... 103 5.4 Lexical Transfer................................. 105 5.5 Morphological Disambiguation........................ 109 5.6 Syntactic Transfer and Generation...................... 113 5.7 Ranking and Morphological Generation................... 116 5.8 Discourse: Modeling Topic and Focus.................... 119 5.8.1 Discourse Morphology and Information Structure in Quechua... 119 5.8.2 Modeling Information Structure for Machine Translation...... 124 5.9 Evaluation of the Machine Translation Output............... 130 5.9.1 Setting................................. 132 5.9.2 Results................................. 133 5.10 Summary.................................... 137 6 Conclusions 139 6.1 Recapitulation and Contributions....................... 139 6.2 Discussion and Research Questions...................... 140 6.3 Outlook..................................... 143 6.3.1 Morphology Tools........................... 144 6.3.2 Treebank...............................

A Basic Language Technology Toolkit for Quechua

Origins and Diversity of Aymara How and Why Is Aymara Different in Different Regions?

Languages: Genetic Relatiónship Or Areal Diffusion?

Forms and Functions of Negation in Huaraz Quechua (Ancash, Peru): Analyzing the Interplay of Common Knowledge and Sociocultural Settings

Semantic Transparency in the Lowland Quechua Morphosyntax

Text Segmentation by Language Using Minimum Description Length

Languages of the Middle Andes in Areal-Typological Perspective: Emphasis on Quechuan and Aymaran

Redalyc.Multilingualism on the North Coast of Peru: an Archaeological

Intonation in Quechua: Questions and Analysis

Only and Focus in Imbabura Quichua 1

(REELA) 5-7 September 2015, Leiden University Centre for Linguistics

Five Suffixes with Unified Spellings for Southern Quechua

Descriptive and Comparative Research on South American Indian Languages