<<

Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 2629–2633 Marseille, 11–16 May 2020 c European Language Resources Association (ELRA), licensed under CC-BY-NC

Development of a Guarani - Spanish Parallel Corpus

Luis Chiruzzo1, Pedro Amarilla2, Adolfo R´ıos2, Gustavo Gimenez´ Lugo3 1Universidad de la Republica,´ , 2Universidad Nacional de Asuncion,´ San Lorenzo, 3Universidade Tecnologica´ Federal do Parana,´ Curitiba, PR - Brasil luischir@fing.edu.uy, [email protected], [email protected], [email protected]

Abstract This paper presents the development of a Guarani - Spanish parallel corpus with sentence-level alignment. The Guarani sentences of the corpus use the Jopara Guarani dialect, the dialect of Guarani spoken in Paraguay, which is based on Guarani grammar and may include several Spanish loanwords or neologisms. The corpus has around 14,500 sentence pairs aligned using a semi-automatic process, containing 228,000 Guarani tokens and 336,000 Spanish tokens extracted from web sources.

Keywords: Guarani, Spanish, parallel corpus

1. Introduction ica. It is an official language in Paraguay and also in some Despite being widely spoken in many regions of South territories of and . America, with estimates of between 5 and 12 million speak- Guarani can be classified as an agglutinative polysynthetic ers, Guarani is an under-resourced language in the Natural language (Estigarribia and Pinta, 2017; Lustig, 2010), Language Processing and Computational Linguistics com- where the words are built by concatenation of affixes, gen- munities. The aim of this work is to present a parallel erally around a head (Academia de la Lengua Guaran´ı corpus of Guarani and Spanish with sentence-level align- (ALG), 2018). A word can act as a noun or a verb depend- ment. In this work we focus on the Paraguayan dialect of ing on the affixes it uses or the context it is in. For example Guarani, called Jopara, which is the widest spoken dialect the word “memby” could mean the noun “child” or it could of Guarani. The corpus contains text extracted from differ- be used as the root for the verb “to give birth”. ent web sources; we describe the extraction, alignment and In Paraguay the language has official status since 1992 and evaluation processes. is spoken by the majority of the population in the coun- As stated in the declaration of the 2019 International Year try (Academia de la Lengua Guaran´ı (ALG), 2018), with of Indigenous Languages by the United Nations1, language only a minority of the population that speaks exclusively is a core component of human rights and fundamental Spanish or other languages. However, as a result of the freedoms. Focusing on languages that are less resourced education policies and the demographic movements in the and under-represented in the Natural Language Processing country throughout the years, the language is not spoken community could foster interest in the development of tools homogeneously and different levels of mixture with Span- for these languages that might help making the lives of their ish can be seen in different parts of the territory or different speakers better by narrowing the digital divide. Because social strata (Lustig, 2010). The mix between Guarani and of this, we consider generating linguistic tools for less re- Spanish in Paraguay receives the name of Jopara, which is sourced languages such as Guarani is of the utmost impor- a Guarani word that could be translated as “mixture”. The tance. Jopara Guarani can be seen as a form of creole, and the The rest of the document is structured as follows: Section many variants of its use include: 2 gives a brief introduction of the Guarani language, par- ticularly the Paraguayan dialect; Section 3 discusses some • Pure Guarani sentences: prior linguistic resources and work in the area that exist for “Embohasam´ına ko marandu umi rehayhuvevape...”´ / Guarani; Section 4 presents the extraction and alignment “Pass this message to the people you love...” methods used for building the corpus; Section 5 shows the results of the corpus evaluation; and Section 6 gives some • Using Guarani grammar incorporating Spanish neolo- final conclusions and future work. gisms adapted to Guarani morphology: 2. Guarani and Jopara “Afara orenunsiata´ ko’erˆ o”˜ / “Afara will resign to- morrow” Guarani is a language of the Tupi-Guarani family spo- ken by the Guarani people, a group of indigenous peo- In this case “orenunsiata”´ is based on the Spanish ple from . Unlike other indigenous lan- verb “renunciar” / “to resign”. guages, Guarani did not disappear during the colonization • Using Guarani grammar incorporating Spanish loan- period, and nowadays it is spoken by indigenous and non- words: indigenous people from several countries in South Amer- “Ojuhuma´ 52 allanamiento Argentina gotyo ha 21 de- 1https://en.iyil2019.org/ tenido, 200.000 municion´ ha 2.500 fusil ojokova.”´ /

2629 “In Argentina there have been about 52 raids, 21 de- inspection was necessary in order to separate the Guarani tainees, 200,000 ammunition and 2,500 rifles were and the Spanish text in both cases, then a semi-automatic collected.” process was used to align the Guarani and Spanish sen- In this example the words “allanamiento” / “raid”, tences. “detenido” / “detainee”, “municion”´ / “ammunition” 4.1. News text and “fusil” / “rifle” are written in Spanish. We collected 2,878 news articles written in Guarani be- • Using Spanish sentences incorporating Guarani loan- tween December 2017 and August 2019, out of which words: 1,793 had a corresponding version in Spanish. However, “Tambien´ de ese horno salen humeantes y olorosas upon manual inspection it was noticeable that the text in sopas paraguayas y chipa guasu.” / “Smoky and fra- both documents was rather different: Guarani articles were grant Paraguayan soups and chipa guasu also come generally much shorter (three or four lines) and represented out of that oven.” a translation of only a fraction of the content in the Spanish versions. On the other hand, there were many articles that “chipa guasu” In this case the name , a type of food had Guarani version but no Spanish version. that has no translation, is used. In total we extracted 312,850 tokens in Guarani and In this work we try to focus on sentences that are based 622,779 in Spanish. Notice the difference between the on Guarani grammar, either with or without Spanish loan- number of files and the number of tokens: although there words or neologisms. are more articles in Guarani, the number of tokens is about half the number for Spanish. This can be explained in part 3. Related Work because Guarani sentences tend to have fewer tokens and Except for a few exceptions, Guarani remains largely unex- in part because the Spanish files included more content. plored in the Natural Language Processing and Computa- The text in these Guarani news articles uses a colloquial tional Linguistics fields. There is a reference corpus for cur- form of Paraguayan Guarani (Jopara) that uses Guarani rent Paraguayan Guarani called COREGUAPA (Secretar´ıa grammar mixed with several Spanish terms and loanwords. de Pol´ıticas Lingu¨´ısticas del Paraguay, 2019), although this 4.2. Blogs text corpus is available for online searches only and it is mono- We collected 116 blog entries written both in Guarani and lingual, containing no Spanish translations. Spanish that range from folktales and biographies to de- For other South American languages there have been at- scriptions of religious ceremonies. In this case the two ver- tempts to create parallel corpora or treebanks, for exam- sions of the text could be displayed side to side in the ar- ple for the Quechua-Spanish pair (Rios et al., 2008). Al- ticle, or intertwined in the same paragraphs. We manually though there have been no similar attempts for doing this processed these files to extract the sentences in both lan- kind of parallel treebank resource in Guarani, there is a guages and align them properly. small corpus of a Guarani dialect called Mbya Guarani an- The blog entries are on average twice as long as the news ar- notated with dependencies (Thomas, 2019; Dooley, 2006) ticles. We extracted 26,461 tokens in Guarani correspond- as part of the Universal Dependencies project (Nivre et ing to 36,149 tokens in Spanish. In this case we are sure that al., 2016). This corpus contains 98 sentences (around both texts contain approximately the same information, so 1,300 words) and has English translations of the sentences. the difference in token count should be explained because Mbya Guarani is the language spoken by the Mbya Guarani of the differences in the languages. people, and it is significantly different from Paraguayan The text extracted from Guarani blogs uses far less Spanish Guarani. loanwords, and is more akin to pure Guarani, although in Other small collections of sentences extracted from both cases they are forms of Paraguayan Guarani. Wikipedia and automatically aligned for Guarani-English or Guarani-Spanish also exist. These corpora are very small 4.3. Sentence Alignment because the Guarani version of Wikipedia itself is small. As mentioned before, the blogs text was aligned manually There is also an attempt to create a computer aided transla- as it was in general less noisy than the news text, but in tion system between Guarani and Spanish (Gasser, 2018), total it represents about 11% of the corpus. For the lat- using morpho-syntactic rules for providing translation can- ter, we designed a heuristic (similar to (Gale and Church, didates for the users. 1993)) that exploits the property that these texts tend to As far as we know, our work presents the first medium sized share many loanwords, and also include many names of aligned corpus of current Paraguayan Guarani and Spanish people and places that are written similarly in Guarani and sentences. Spanish. First of all we used the sentence splitter from the 4. Development NLTK library (Bird et al., 2009) to separate each document in sentences. We then used a dynamic programming algo- We collected text from different Paraguayan websites rithm that tries to find the best alignment between pairs of (blogs and news sources) that contained articles in Guarani sentences with the following considerations: and Spanish versions. In some cases the Guarani and Span- ish text was intertwined in the same article (e.g. at para- • The weight of the alignment between two sentences graph level) and in other cases there were two different ver- is based on the Jaccard coefficient of two and three sions of the article accessible via links in the text. Manual character n-grams.

2630 • There are usually more sentences on the Spanish side, 5. Evaluation so many of them will have to be discarded: it is more The evaluation of the corpus was done by inspecting a sam- likely to discard a Spanish sentence than a Guarani ple of the documents and manually analyzing the quality of sentence. the different alignments. In order to make the evaluation • Assume that no sentences have to be split up or merged process fair, it was done by people that did not participate on either side. in the alignment process. The task of the evaluators was to analyze each sentence pair and indicate which of the fol- This process was run over the news articles collected from lowing classes the pair belongs to: December 2017 through March 2019 and a manual evalu- ation was done over the results. During that evaluation we found out that some of the assumptions done during pre- • A - The Guarani sentence contains the same informa- processing and automatic alignment did not hold. In partic- tion as the Spanish sentence. ular: • B - The Spanish sentence contains more information • The sentence splitter was noisy for both Guarani and than the Guarani sentence. Spanish sentences alike. Interestingly, it committed the same kind of mistakes in both cases, for example • C - The Guarani sentence contains more information splitting a sentence when finding not so common ab- than the Spanish sentence. breviations or titles such as “Cnel.” (short for “Coro- nel” / “Colonel”). • D - The sentences do not match at all. • Sometimes, the sentences in one language were in a different order than in the other language. A summary of the results of both methods, the fully auto- • On occasions, the articles collected for both languages matic one and the manually curated semi-automatic one, is did not refer to the same topic at all, probably because shown in Figure 1. of an error in the linking of the articles. Because of this, we decided to go for a semi-automatic pro- cess in a second step, as follows: For all news articles col- lected from December 2017 through August 2019, we cre- ated a pre-calculated version of the alignment matrices that contained information about the number of tokens and Jac- card index for n-grams of different ranges. We used this information to analyze all cases and manually correct er- rors in sentence splitting and wrong alignments. We also removed content in Guarani files that was written entirely in Spanish, such as verbatim quotes and sentences that had not been translated. (a) Automatic method Docs Pairs of Guarani Spanish Sentences Tokens Tokens News 1,742 12,077 201,685 300,048 Blogs 116 2,454 26,461 36,149 Total 1,858 14,531 228,146 336,197

Table 1: Composition of the corpus in number of docu- ments, sentences and tokens for each language.

The final corpus consists of a set of text documents con- taining pairs of sentences. Each sentence in the corpus has a prefix “gn:” or “es:” depending on its language. Ta- (b) Semi-automatic method ble 1 shows the number of sentences and tokens for each language in the corpus, broken down by source (news or Figure 1: Estimated proportion of pairs that are correctly blogs text). Notice that all of the extracted text from blogs aligned (category A), partially correct with more informa- is present in the final version of the corpus, but only about tion on the Spanish side (category B), partially correct with two thirds of the news text. The news text that is miss- more information on the Guarani side (category C) or in- ing from the final corpus includes articles that only had a correctly aligned (category D) for (a) the automatic method Guarani version and also sentences that were pruned be- and (b) the manually curated semi-automatic method. cause they could not be properly aligned.

2631 Category Original Guarani sentence English translation for the Original Spanish sentence English translation for the Guarani sentence Spanish sentence Onemombyta˜ umi obra 17 They stop the works until Paran obras hasta el 17 They stop the works until A peve the 17th the 17th Onomongeta˜ oikovo´ We are talking with every- Se esta´ conversando con to- We are talking with every- opavave ndive. one. dos. one.

Opurahei´ Patria Querida, Signing Patria Querida, Al son de la cancion´ Pa- Signing the song Patria peva´ 100 mitapyahu˜ poyvi about 100 young people tria Querida, poco mas´ de Querida, just over 100 B ha pankarta oguerohoryvo´ with flags and banners cel- 100 jovenes´ con banderas y young people with flags and hikuai´ okuigui´ Jose´ Mar´ıa ebrated the resignation of pancartas celebraron la re- banners celebrated the res- Iba´nez˜ (ANR). Jose´ Mar´ıa Iba´nez˜ (ANR). nuncia de por lo menos uno ignation of at least one de los diputados corrup- of the corrupt legislators, tos, en este caso el ladron´ in this case the confessed confeso Jose´ Mar´ıa Iba´nez˜ thief Jose´ Mar´ıa Iba´nez˜ (ANR). (ANR). Peteˆı tenda ojehechava´ A place that caught their at- Un sitio que les llamo´ la A place that caught their area´ de Urgencias. tention was the Emergency atencion,´ a juzgar por sus attention, judging by their department. rostros, fue el area´ de Ur- faces, was the Emergency gencias. department. Ndorekoi´ proyectos a fu- He doesn’t have future Agradecido, el joven no Grateful, the young man turo ni opesa ouvo.;´ ome’eˆ projects or thinks about habla de proyectos a fu- doesn’t speak about future aguyke umi familia oipy- coming back; he thanks his turo ni de cuando´ piensa projects or when he plans to tyvova˜ ichupe. family for supporting him. regresar; solo agradece a come back; he just thanks su familia por el respaldo his family for their uncondi- incondicional, y a la vida, tional support , and life, for por las interminables the endless anecdotes and anecdotas´ y experiencias. experiences. C Mirta Gusinky “ojukaite” Mirta Gusinky “killed” the Mirta Gusinky “mato”´ el Mirta Gusinky “killed” the plan orekova´ hikuai´ plan they had plan plan D Onemyengovia˜ He was replaced for work- Jefe policial fue destituido Police chief was dismissed omba’apogui´ hekopete ing correctly por atacar a contrabandis- for attacking smugglers tas

Table 2: Examples of Guarani-Spanish pairs of sentences for each category, with English translation for each one high- lighting the main differences in the meaning of each sentence.

5.1. Automatic method Guarani side, and the pairs that were a complete mismatch For the first evaluation, a small sample of 20 documents dropped to only 0.3%, as shown in Figure 1b. (around 150 sentences) was used. From this small sample, Table 2 shows examples for each one of the categories in the the evaluation process found that 64.0% of the pairs were a final version of the corpus. Sentence pairs in category C are full match, in 25.5% cases there was more information on similar to the one shown in the table, where the sentences the Spanish side, in 4.6% cases there was more information have approximately the same meaning with some clarifica- on the Guarani side and on 5.9% cases the sentences did tions in Guarani that are not in Spanish but could be derived not match. This is shown in Figure 1a. from context given the rest of the text. However, the most Given this situation, we decided that the automated process interesting class is category B, where there are more het- was not enough to obtain a high quality alignment, so we erogeneous examples: cases with information that could be used a semi-automatic process to align the rest of the corpus derived from context, case with differences due to nuances and manually correct the sentence splitting and alignments, of meaning, but also cases where there is new information using the n-gram overlap as a base to guide the manual anal- that is not present in the Guarani version, like in the first ysis. and third examples of category B in the table. Further work is needed to analyze these different classes and find ways 5.2. Semi-automatic method to improve the corpus. After performing the manually curated alignment process for all the corpus, we ran a second round of evaluation with 6. Conclusions a sample of 38 documents (around 300 sentences). In this We presented the development of a parallel corpus of case, the evaluation process found that 84.1% of the pairs Guarani-Spanish text with sentence-level alignment. The were a full match, 13.9% contained more information on corpus is medium-sized, containing around 228,000 to- the Spanish side, 1.7% contained more information on the kens in Guarani along the corresponding 336,000 tokens

2632 in Spanish. The Guarani dialect we use is Jopara, the Rios, A., Gohring,¨ A., and Volk, M. (2008). A quechua- Paraguayan Guarani dialect, which in many cases includes spanish parallel treebank. LOT Occasional Series, Spanish loanwords or neologisms. We manually evaluated 12:53–64. a small sample of sentence pairs from the corpus and found Secretar´ıa de Pol´ıticas Lingu¨´ısticas del Paraguay. (2019). that 84.1% of the pairs were correctly matched, a further Corpus de Referencia del Guaran´ı Paraguayo Actual 15.6% were correctly aligned but one of the two sentences – COREGUAPA. http://www.spl.gov.py. Ac- might contain more information than the other, while only cessed: 2019-11-01. about 0.3% of the pairs were incorrectly aligned. Thomas, G. (2019). Universal dependencies for mbya´ As future work, we plan to augment the corpus by extract- guaran´ı. In Proceedings of the Third Workshop on Uni- ing text from more sources while using this corpus to guide versal Dependencies (UDW, SyntaxFest 2019), pages the alignment process in order to make it less dependent 70–77. on manual curation. It would also be interesting to design ways to detect and correct examples like the ones in cat- egory B that have new information in Spanish that is not present in Guarani, either by pruning this kind of sentences or by editing them. Furthermore, it would be interesting to use this corpus to develop a machine translation system. Due to the size of the corpus, a statistical or neural machine translation sys- tem might not yield optimal results. However, as both Guarani and Spanish are morphologically rich languages, some morpho-syntactic preprocessing might be done in order to extract more information from the text already present in the corpus. Besides this, it would be interest- ing to develop resources for other pairs such as Guarani- English, which might be easier to do having this initial Guarani-Spanish corpus to begin with and using the re- sources that have already been created for Spanish.

7. Bibliographical References Academia de la Lengua Guaran´ı (ALG). (2018). Gramatica´ Guaran´ı. Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. ” O’Reilly Media, Inc.”. Estigarribia, B. and Pinta, J. (2017). Guarani linguistics in the 21st century. Brill. Gale, W. A. and Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational linguistics, 19(1):75–102. Gasser, M. (2018). Mainumby: un ayudante para la traduccion´ castellano-guaran´ı. arXiv preprint arXiv:1810.08603. Lustig, W. (2010). Mba’eichapa´ oiko la guarani? guaran´ı y jopara en el paraguay. PAPIA-Revista Brasileira de Es- tudos do Contato Lingu´ıstico, 4(2):19–43.

8. Language Resource References Dooley, R. A. (2006). Lexico´ guarani, dialeto mbya´ com informac¸oes˜ uteis´ para o ensino medio,´ a aprendizagem e a pesquisa lingu¨´ıstica. Cuiaba,´ MT: Sociedade Inter- nacional de Lingu¨´ıstica, 143:206. Nivre, J., De Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal de- pendencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666.

2633