Redalyc.A Basic Language Technology Toolkit for Quechua

Procesamiento del Lenguaje Natural ISSN: 1135-5948 [email protected] Sociedad Española para el Procesamiento del Lenguaje Natural España Rios, Annette A Basic Language Technology Toolkit for Quechua Procesamiento del Lenguaje Natural, núm. 56, 2016, pp. 91-94 Sociedad Española para el Procesamiento del Lenguaje Natural Jaén, España Available in: http://www.redalyc.org/articulo.oa?id=515754423009 How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Procesamiento de Lenguaje Natural, Revista nº 56, marzo de 2016, pp 91-94 recibido 29-10-2015 revisado 17-12-2015 aceptado 28-01-2016 A Basic Language Technology Toolkit for Quechua Una colecciónde herramientas básicas para el procesamiento automático del Quechua Annette Rios Institute of Computational Linguistics University of Zurich [email protected] Resumen: Tesis escrita por Annette Rios en la Universidad de Zúrich bajo la di- recciónde Prof. Dr. Martin Volk. La tesis fue defendida el 21 de septiembre de 2015 en la Universidad de Zúrich ante el tribunal formado por Prof. Dr. Martin Volk (Universidad de Zúrich, Departamento de Lingü´ısticaComputacional), Prof. Dr. Balthasar Bickel (Universidad de Zúrich, Departamento de Lingü´ısticaCompar- ativa) y Dr. Paul Heggarty (Instituto Max Planck para la Antropolog´ıaEvolutiva). La tesis obtuvo la calificación`Summa cum Laude'. Palabras clave: Traducciónautomática,análisismorfológico,programa de con- cordancia, transductor de estados finitos, traducciónautomáticah´ıbrida,treebank, gramáticade dependencias Abstract: Thesis written by Annette Rios under the supervision of Prof. Dr. Mar- tin Volk at the University of Zurich. The thesis defense was held at the University of Zurich on September 21, 2015 and was awarded `Summa Cum Laude'. The mem- bers of the committee were Prof. Dr. Martin Volk (University of Zurich, Institute of Computational Linguistics), Prof. Dr. Balthasar Bickel (University of Zurich, Department of Comparative Linguistics) and Dr. Paul Heggarty (Max Planck In- stitute for Evolutionary Anthropology). Keywords: Machine translation, morphological analysis, concordancer, finite state, hybrid MT, treebank, dependency grammar 1 Introduction also be used for closely related varieties. In this thesis, we describe the development The main focus of this work lies on the of several natural language processing tools implementation of a machine translation sys- and resources for the Andean language Cuzco tem for the language pair Spanish-Cuzco Quechua as part of the SQUOIA project at Quechua. Since the target language Quechua the University of Zurich. is not only a non-mainstream language in Quechua is a group of closely related the field of computational linguistics, but languages, spoken by 8-10 million people in also typologically quite different from the Peru, Bolivia, Ecuador, Southern Colombia source language Spanish, several rather un- and the North-Western parts of Argentina. usual problems became evident, and we had Although Quechua is often referred to to find solutions in order to deal with them. as a `language' and its local varieties as Therefore, the first part of this thesis presents `dialects', Quechua represents a language monolingual tools and resources that are not family, comparable in depth to the Romance directly related to machine translation, but or Slavic languages (Adelaar and Muysken, are nevertheless indispensable. 2004).Mutual intelligibility, especially be- All resources and tools are freely available tween speakers of distant dialects, is not from the project's website.1 always given. The applications described Apart from the scientific interest in devel- in this thesis were developed for Cuzco oping tools and applications for a language Quechua, but some applications, such as the morphology tools and the spell checker, can 1https://github.com/ariosquoia/squoia ISSN 1135-5948 © 2016 Sociedad Española para el Procesamiento del Lenguaje Natural Annette Rios that is typologically distant from the main- directly related to machine translation, au- stream languages in computational linguis- tomatic processing of Quechua morphology tics, we hope that the various resources pre- provides the necessary resources for impor- sented in this thesis will be useful not only tant parts of the translation system. for language learners and linguists, but also Moreover, the project involved the cre- to Quechua speakers who want to use modern ation of a parallel treebank in all 3 languages. technology in their native language. While the Spanish and German parts of the treebank were finished within the first year 2 Structure of Thesis of the project, building the Quechua tree- The thesis is structured as follows: bank took considerably longer: before the ac- In the first chapter, we set the broader tual annotation process started, the texts had context for the development of NLP tools for to be translated into Quechua. Additionally, a low-resource language and we give a short we had to design an annotation scheme from overview on the characteristics and the dis- scratch and set up pre-processing and anno- tribution of the Quechua languages. tation tools. Chapter 2 explains the morphological A by-product of the resulting parallel cor- structures in Quechua word formation and pus is the Spanish-Quechua version of Bil- how we deal with them in technological ap- ingwis, a web tool that allows to search for plications, this includes morphological anal- translations in context in word-aligned texts. ysis, disambiguation, automatic text normal- Generally speaking, the monolingual tools ization and spell checking. and resources described in the first part of Chapter 3 summarizes the treebanking this thesis are necessary to build the multi- process and how existing tools were adapted lingual applications of the second part. for the syntactic annotation of Quechua texts 3 Contributions with dependency trees. Chapter 4 describes how Bilingwis, an on- The main contributions of this thesis are as line service for searching translations in par- follows: allel, word-aligned texts, was adapted to the language pair Spanish-Quechua. • We built a hybrid machine translation Chapter 5 describes the implementation of system that can translate Spanish text the hybrid machine translation system for the into Cuzco Quechua. The core system language pair Spanish-Quechua, with a spe- is a classical rule-based transfer engine, cial focus on resolving morphological ambi- however, several statistical modules are guities. included for tasks that cannot be re- The main part of this work was the de- solved reliably with rules. velopment of the translation system Spanish- • We have created an extensive finite Cuzco Quechua. However, the special situa- state morphological analyzer for South- tion of the target language Quechua as a non- ern Quechua that achieves high cover- mainstream language in computational lin- age. The analyzer consists of a set cas- guistics and as a low-prestige language in so- caded finite state transducers, where the ciety resulted in many language specific prob- last transducer is a guesser that can an- lems that had to be solved along the way. alyze words with unknown roots. Fur- For instance, the wide range of different thermore, we included the Spanish lem- orthographies used in written Quechua is a mas from the FreeLing library2 into the problem for any statistical text processing, analyzer in order to recognize the nu- such as the training of a language model to merous Spanish loan words in Quechua rank different translation options. Thus, in texts (words that consist of a Spanish order to get a statistical language model, we root with Quechua suffixes). had to first find a way to normalize Quechua • We implemented a text normaliza- texts automatically. tion pipeline that automatically rewrites Furthermore, the resulting normalization Quechua texts in different orthographies pipeline can be adapted for spell checking or dialects to the official Peruvian stan- with little effort. For this reason, an entire dard orthography. Additionally, we cre- chapter of this thesis is dedicated to the treat- ment of Quechua morphology: although not 2http://nlp.lsi.upc.edu/freeling/ 92 A Basic Language Technology Toolkit for Quechua ated a slightly adapted version that translation system was trained on normalized can be used as spell checker back-end, morphemes instead of word forms, in order to in combination with a plug-in for the mitigate data sparseness. open-source productivity suite LibreOf- Apart from the rich morphology, Quechua fice/OpenOffice.3 has several characteristics that have rarely (or not at all) been dealt with in machine • We built a Quechua dependency tree- translation. One of the most important issues bank of about 2000 annotated sentences, concerns verb forms in subordinated clauses: that provided not only training data for while Spanish has mostly finite verbs in sub- some of the translation modules, but ordinated clauses that are marked for tense, also served as a source of verification, aspect, modality and person, Quechua often since it allows to observe the distribu- has nominal forms that vary according to tion of certain syntactic and morpholog- the clause type. Furthermore, Quechua uses ical structures. Furthermore, we trained switch-reference as a device of clause linkage, a statistical parser on the treebank and while in Spanish, co-reference of subjects is thus have now a complete pipeline to unmarked and pronominal subjects are usu- morphologically analyze, disambiguate ally omitted (`pro-drop'). This leads to am- and then parse Quechua texts. biguities when Spanish text is translated into Quechua. 4 Conclusions and Outlook Another special case are relative clauses: We have created tools and applications for a the form of the nominalized verb in the language with very limited resources: While Quechua relative clause depends on whether printed Quechua dictionaries and grammars the head noun is the semantic agent of the rel- exist, some of them quite outdated, digital ative clause.

Load more