<<

Procesamiento del Lenguaje Natural ISSN: 1135-5948 [email protected] Sociedad Española para el Procesamiento del Lenguaje Natural España

Rios, Annette A Basic Language Technology Toolkit for Quechua Procesamiento del Lenguaje Natural, núm. 56, 2016, pp. 91-94 Sociedad Española para el Procesamiento del Lenguaje Natural Jaén, España

Available in: http://www.redalyc.org/articulo.oa?id=515754423009

How to cite Complete issue Scientific Information System More information about this article Network of Scientific Journals from Latin America, the Caribbean, Spain and Portugal Journal's homepage in redalyc.org Non-profit academic project, developed under the open access initiative Procesamiento de Lenguaje Natural, Revista nº 56, marzo de 2016, pp 91-94 recibido 29-10-2015 revisado 17-12-2015 aceptado 28-01-2016

A Basic Language Technology Toolkit for Quechua

Una colecci´onde herramientas b´asicas para el procesamiento autom´atico del Quechua

Annette Rios Institute of Computational University of Zurich [email protected]

Resumen: Tesis escrita por Annette Rios en la Universidad de Z´urich bajo la di- recci´onde Prof. Dr. Martin Volk. La tesis fue defendida el 21 de septiembre de 2015 en la Universidad de Z´urich ante el tribunal formado por Prof. Dr. Martin Volk (Universidad de Z´urich, Departamento de Ling¨u´ısticaComputacional), Prof. Dr. Balthasar Bickel (Universidad de Z´urich, Departamento de Ling¨u´ısticaCompar- ativa) y Dr. Paul Heggarty (Instituto Max Planck para la Antropolog´ıaEvolutiva). La tesis obtuvo la calificaci´on‘Summa cum Laude’. Palabras clave: Traducci´onautom´atica,an´alisismorfol´ogico,programa de con- cordancia, transductor de estados finitos, traducci´onautom´aticah´ıbrida,, gram´aticade dependencias Abstract: Thesis written by Annette Rios under the supervision of Prof. Dr. Mar- tin Volk at the University of Zurich. The thesis defense was held at the University of Zurich on September 21, 2015 and was awarded ‘Summa Cum Laude’. The mem- bers of the committee were Prof. Dr. Martin Volk (University of Zurich, Institute of Computational Linguistics), Prof. Dr. Balthasar Bickel (University of Zurich, Department of Comparative Linguistics) and Dr. Paul Heggarty (Max Planck In- stitute for Evolutionary Anthropology). Keywords: Machine , morphological analysis, concordancer, finite state, hybrid MT, treebank, dependency grammar

1 Introduction also be used for closely related varieties. In this thesis, we describe the development The main focus of this work lies on the of several natural language processing tools implementation of a sys- and resources for the Andean language Cuzco tem for the language pair Spanish-Cuzco Quechua as part of the SQUOIA project at Quechua. Since the target language Quechua the University of Zurich. is not only a non-mainstream language in Quechua is a group of closely related the field of computational linguistics, but languages, spoken by 8-10 million people in also typologically quite different from the Peru, Bolivia, Ecuador, Southern Colombia source language Spanish, several rather un- and the North-Western parts of Argentina. usual problems became evident, and we had Although Quechua is often referred to to find solutions in order to deal with them. as a ‘language’ and its local varieties as Therefore, the first part of this thesis presents ‘dialects’, Quechua represents a language monolingual tools and resources that are not family, comparable in depth to the Romance directly related to machine translation, but or Slavic languages (Adelaar and Muysken, are nevertheless indispensable. 2004).Mutual intelligibility, especially be- All resources and tools are freely available tween speakers of distant dialects, is not from the project’s website.1 always given. The applications described Apart from the scientific interest in devel- in this thesis were developed for Cuzco oping tools and applications for a language Quechua, but some applications, such as the morphology tools and the , can 1https://github.com/ariosquoia/squoia ISSN 1135-5948 © 2016 Sociedad Española para el Procesamiento del Lenguaje Natural Annette Rios that is typologically distant from the main- directly related to machine translation, au- stream languages in computational linguis- tomatic processing of Quechua morphology tics, we hope that the various resources pre- provides the necessary resources for impor- sented in this thesis will be useful not only tant parts of the translation system. for language learners and linguists, but also Moreover, the project involved the cre- to Quechua speakers who want to use modern ation of a parallel treebank in all 3 languages. technology in their native language. While the Spanish and German parts of the treebank were finished within the first year 2 Structure of Thesis of the project, building the Quechua tree- The thesis is structured as follows: bank took considerably longer: before the ac- In the first chapter, we set the broader tual annotation process started, the texts had context for the development of NLP tools for to be translated into Quechua. Additionally, a low-resource language and we give a short we had to design an annotation scheme from overview on the characteristics and the dis- scratch and set up pre-processing and anno- tribution of the Quechua languages. tation tools. Chapter 2 explains the morphological A by-product of the resulting parallel cor- structures in Quechua formation and pus is the Spanish-Quechua version of Bil- how we deal with them in technological ap- ingwis, a web tool that allows to search for plications, this includes morphological anal- in context in word-aligned texts. ysis, disambiguation, automatic text normal- Generally speaking, the monolingual tools ization and spell checking. and resources described in the first part of Chapter 3 summarizes the treebanking this thesis are necessary to build the multi- process and how existing tools were adapted lingual applications of the second part. for the syntactic annotation of Quechua texts 3 Contributions with dependency trees. Chapter 4 describes how Bilingwis, an on- The main contributions of this thesis are as line service for searching translations in par- follows: allel, word-aligned texts, was adapted to the language pair Spanish-Quechua. • We built a hybrid machine translation Chapter 5 describes the implementation of system that can translate Spanish text the hybrid machine translation system for the into Cuzco Quechua. The core system language pair Spanish-Quechua, with a spe- is a classical rule-based transfer engine, cial focus on resolving morphological ambi- however, several statistical modules are guities. included for tasks that cannot be re- The main part of this work was the de- solved reliably with rules. velopment of the translation system Spanish- • We have created an extensive finite Cuzco Quechua. However, the special situa- state morphological analyzer for South- tion of the target language Quechua as a non- ern Quechua that achieves high cover- mainstream language in computational lin- age. The analyzer consists of a set cas- guistics and as a low-prestige language in so- caded finite state transducers, where the ciety resulted in many language specific prob- last transducer is a guesser that can an- lems that had to be solved along the way. alyze with unknown roots. Fur- For instance, the wide range of different thermore, we included the Spanish lem- orthographies used in written Quechua is a mas from the FreeLing library2 into the problem for any statistical , analyzer in order to recognize the nu- such as the training of a language model to merous Spanish loan words in Quechua rank different translation options. Thus, in texts (words that consist of a Spanish order to get a statistical language model, we root with Quechua suffixes). had to first find a way to normalize Quechua • We implemented a text normaliza- texts automatically. tion pipeline that automatically rewrites Furthermore, the resulting normalization Quechua texts in different orthographies pipeline can be adapted for spell checking or dialects to the official Peruvian stan- with little effort. For this reason, an entire dard orthography. Additionally, we cre- chapter of this thesis is dedicated to the treat- ment of Quechua morphology: although not 2http://nlp.lsi.upc.edu/freeling/ 92 A Basic Language Technology Toolkit for Quechua

ated a slightly adapted version that translation system was trained on normalized can be used as spell checker back-end, morphemes instead of word forms, in order to in combination with a plug-in for the mitigate data sparseness. open-source productivity suite LibreOf- Apart from the rich morphology, Quechua fice/OpenOffice.3 has several characteristics that have rarely (or not at all) been dealt with in machine • We built a Quechua dependency tree- translation. One of the most important issues bank of about 2000 annotated sentences, concerns verb forms in subordinated clauses: that provided not only training data for while Spanish has mostly finite verbs in sub- some of the translation modules, but ordinated clauses that are marked for tense, also served as a source of verification, aspect, modality and person, Quechua often since it allows to observe the distribu- has nominal forms that vary according to tion of certain syntactic and morpholog- the clause type. Furthermore, Quechua uses ical structures. Furthermore, we trained switch-reference as a device of clause linkage, a statistical parser on the treebank and while in Spanish, co-reference of subjects is thus have now a complete pipeline to unmarked and pronominal subjects are usu- morphologically analyze, disambiguate ally omitted (‘pro-drop’). This leads to am- and then parse Quechua texts. biguities when Spanish text is translated into Quechua. 4 Conclusions and Outlook Another special case are relative clauses: We have created tools and applications for a the form of the nominalized verb in the language with very limited resources: While Quechua relative clause depends on whether printed Quechua dictionaries and grammars the head noun is the semantic agent of the rel- exist, some of them quite outdated, digital ative clause. In Spanish, on the other hand, resources are scarce. Furthermore, the lack relative clauses can be highly ambiguous, as of standardization in written Quechua texts in certain cases relativization on subjects and combined with the rich morphology hampers objects is not formally distinguished, but in- any statistical approach. We have imple- stead requires semantic knowledge to under- mented a pipeline to automatically analyze stand (Rios and G¨ohring,2016). Consider and normalize Quechua word forms, which these examples: lays the foundation for any further process- (1) non-agentive: ing (Rios and Castro Mamani, 2014). Due to the rich morphology, we decided to el pan que la mujer comi´o use morphemes as basic units instead of com- the bread REL the woman ate plete word forms in several of our resources: ’the bread that the woman ate’ For instance, the dependency treebank is built on morphemes since many of the typical (2) agentive: ’function words’ in European languages cor- la mujer que comi´o el pan respond to Quechua suffixes, e.g. we consider the woman REL ate the bread case markers as equivalent to prepositions in languages such as English (e.g. Quechua in- ’the woman who ate the bread’ strumental case -wan corresponds to English Furthermore, Spanish can express possession ‘by, with’). In accordance with the Stanford in a relative clause with cuyo - ‘whose’, while Dependency scheme (de Marneffe and Man- there is no such option in Quechua. The ning, 2008) we treat case suffixes as the head translation system can currently not handle of the noun they modify (Rios and G¨ohring, this case, as it would require a complete re- 2012). structuring of the sentence. Furthermore, the tools for the morpholog- Another difficult case are translations that ical analysis and normalization are relevant involve the first person plural: Spanish has as well for machine translation: The statisti- only one form, but Quechua distinguishes be- cal language model that we use in our hybrid tween an inclusive (‘we and you’) and an ex- clusive (‘we, but not you’) form. Unless the 3The plug-in was implemented by Richard Cas- tro from the Universidad Nacional de San Anto- Spanish source explicitly mentions if the ‘you’ nio Abad in Cuzco and is available from: https: is included or not, we cannot know which //github.com/hinantin/LibreOfficePlugin. form to use in Quechua and thus generate 93 Annette Rios both. The user will have to choose which References form is appropriate. Adelaar, W. F. H. and P. Muysken. 2004. Furthermore, Quechua conveys informa- The Languages of the Andes. Cambridge tion structure in discourse not only through Language Surveys. Cambridge University word order, but also through morphological Press, Cambridge. markings on in situ elements, while in Span- ish, information structure is mostly expressed de Marneffe, M.-C. and C. D. Manning. 2008. through non-textual features, such as into- Stanford Dependencies manual. Technical nation and stress. We have experimented report. with machine learning to insert discourse- Faller, M. 2002. Semantics and Pragmatics relevant morphology into the Quechua trans- of Evidentials in Cuzco Quechua. Ph.D. lation, but the results are not good enough thesis, Stanford University. to be used reliably for machine translation. Apart from a few cases that allow a rule- Rios, A. and R. Castro Mamani. 2014. Mor- based insertion, we do not include the suffixes phological Disambiguation and Text Nor- that mark topic and focus in the Quechua malization for Southern Quechua Vari- translation by default, but the module can eties. In Proceedings of the First Work- be activated through an optional parameter shop on Applying NLP Tools to Similar at runtime. Languages, Varieties and Dialects (Var- However, the most challenging issue re- Dial), pages 39–47, Dublin, Ireland, Au- garding different grammatical categories for gust. Association for Computational Lin- the translation system is evidentiality:4 while guistics. Spanish, like every language, has means to Rios, A. and A. G¨ohring. 2012. A tree is express the source of knowledge for an utter- a Baum is an ´arbol is a sach’a: Creat- ance, evidentiality is not a grammatical cat- ing a trilingual treebank. In N. Calzolari, egory in this language. Therefore, explicit K. Choukri, T. Declerck, M. U. Do˘gan, mention of the data source is optional and B. Maegaard, J. Mariani, J. Odijk, and usually absent. In Quechua, on the other S. Piperidis, editors, Proceedings of the hand, evidentiality needs to be expressed for Eight International Conference on Lan- every statement. Unmarked sentences are guage Resources and Evaluation (LREC possible in discourse, but they are usually 2012), pages 1874–1879, Istanbul, Turkey, understood as the speaker having direct evi- May. European Language Resources Asso- dence, unless the context clearly indicates in- ciation (ELRA). direct evidentiality (Faller, 2002). Since ev- identiality encodes a relation of the speaker Rios, A. and A. G¨ohring. 2016. Machine (or writer) to his proposition, and thus re- Learning applied to Rule-Based Machine quires knowledge about the speaker and his Translation. In M. Costa-juss`a,R. Rapp, experience in the world, this information can- P. Lambert, K. Eberle, R. E. Banchs, and not be inferred from the Spanish source text. B. Babych, editors, Hybrid Approaches to Since we cannot automatically infer the cor- Machine Translation, Theory and Appli- rect evidentiality, the translation system has cations of Natural Language Processing. a switch that allows the user to set eviden- Springer International Publishing. tiality for the translation of a document.

5 Acknowledgements This research was funded by the Swiss National Science Foundation under grants 100015 132219 and 100015 149841.

4Evidentiality is the indication of the source of knowledge for a given utterance. Cuzco Quechua dis- tinguishes three evidential categories: direct (speaker has witnessed/experienced what he describes), indi- rect (speaker heard from someone else) and conjec- ture (speaker makes an assumption). In fact, this is a highly simplified description, for a more elaborate analysis of Quechua evidentiality see (Faller, 2002). 94