AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages Abteen Ebrahimi♦ Manuel Mager♠ Arturo Oncevay♥ Vishrav Chaudhary∇ Luis Chiruzzo4 Angela Fan∇ John OrtegaΩ Ricardo Ramosη Annette Riosψ Ivan Vladimir] Gustavo A. Gimenez-Lugo´ ♣ Elisabeth Mager] Graham Neubig./ Alexis Palmer♦ Rolando A. Coto Solanof Ngoc Thang Vu♠ Katharina Kann♦ ./Carnegie Mellon University fDartmouth College ∇Facebook AI Research ΩNew York University 4Universidad de la Republica,´ ηUniversidad Tecnologica´ de Tlaxcala ]Universidad Nacional Autonoma´ de ´ ♣Universidade Tecnologica´ Federal do Parana´ ♦University of Colorado Boulder ♥University of Edinburgh ♠University of Stuttgart ψUniversity of Zurich

Abstract Language ISO Family Dev Test Aymara aym Aymaran 743 750 Pretrained multilingual models are able to per- Ashaninka´ cni Arawak 658 750 form cross-lingual transfer in a zero-shot set- Bribri bzd Chibchan 743 750 ting, even for languages unseen during pre- Guarani gn Tupi-Guarani 743 750 training. However, prior work evaluating per- ´ nah Uto-Aztecan 376 738 formance on unseen languages has largely Otom´ı oto Oto-Manguean 222 748 Quechua quy Quechuan 743 750 been limited to low-level, syntactic tasks, and Raramuri´ tar Uto-Aztecan 743 750 it remains unclear if zero-shot learning of Shipibo-Konibo shp Panoan 743 750 high-level, semantic tasks is possible for un- Wixarika hch Uto-Aztecan 743 750 seen languages. To explore this question, we present AmericasNLI, an extension of XNLI Table 1: The languages in AmericasNLI, along with (Conneau et al., 2018) to 10 indigenous lan- their ISO codes, language families, and dataset sizes. guages of the . We conduct experi- ments with XLM-R, testing multiple zero-shot and translation-based approaches. Addition- further improvements (Muller et al., 2020; Pfeiffer ally, we explore model adaptation via contin- et al., 2020a,b; Wang et al., 2020). ued pretraining and provide an analysis of the Practically all work evaluating the zero-shot per- dataset by considering hypothesis-only mod- formance of models on languages not seen dur- els. We find that XLM-R’s zero-shot perfor- ing pretraining is currently limited to low-level, mance is poor for all 10 languages, with an av- erage performance of 38.62%. Continued pre- syntactic tasks such as part-of-speech tagging, de- training offers improvements, with an average pendency parsing, and named-entity recognition accuracy of 44.05%. Surprisingly, training on (Muller et al., 2020; Wang et al., 2020). This is due poorly translated data by far outperforms all to the fact that most multilingual datasets for high- other methods with an accuracy of 48.72%. level, semantic tasks only cover languages which are well-resourced enough to already be contained 1 Introduction in the pretraining data.1 This limits our ability to Pretrained multilingual models such as XLM draw more general conclusions with regards to the arXiv:2104.08726v1 [cs.CL] 18 Apr 2021 (Lample and Conneau, 2019), multilingual BERT zero-shot learning abilities of pretrained multilin- (mBERT; Devlin et al., 2019), and XLM-R (Con- gual models for unseen languages. neau et al., 2020) achieve strong cross-lingual trans- In order to make such an evaluation possi- fer results for many languages and natural language ble, we introduce AmericasNLI, an extension of processing (NLP) tasks. However, there exists a XNLI (Conneau et al., 2018) – a natural language discrepancy in terms of zero-shot performance be- inference (NLI; cf. §2.3) dataset covering 15 tween languages present in the pretraining data and high-resource languages – to 10 indigenous lan- those that are not: performance is generally highest guages spoken in the Americas: Ashaninka,´ Ay- for well-represented languages, and decreases with mara, Bribri, Guarani, Nahuatl,´ Otom´ı, Quechua, less representation. Yet, even for unseen languages, 1One notable exception is XCoPA (Ponti et al., 2020), performance is generally above chance, and model which covers two languages not present in mBERT’s pretrain- adaptation approaches have been shown to yield ing data: Haitian Creole and Quechua. Raramuri,´ Shipibo-Konibo, and Wixarika. All These models follow the standard pretraining– of them are truly low-resource languages: they finetuning paradigm: they are first trained on unla- have small or no Wikipedia corpora, and are not beled monolingual corpora from various languages present in the training data of current state-of-the- (the pretraining languages) and later finetuned art pretrained multilingual models. This dataset en- on target-task data in a – usually high-resource ables us to address the following research questions – source language. Having been exposed to a vari- (RQs): (1) How do existing multilingual models ety of languages through this training setup, cross- perform on unseen languages as compared to their lingual transfer results for these models are com- performance on XNLI? (2) Do methods aimed at petitive with the state of the art for many languages adapting models to unseen languages – previously and tasks. Commonly used models are mBERT exclusively evaluated on low-level, syntactic tasks (Devlin et al., 2019), which is pretrained on the – also increase performance on NLI? Wikipedias of 104 languages with masked language We experiment with XLM-R, both with and with- modeling (MLM) and next sentence prediction out model adaptation via continued pretraining on (NSP), and XLM, which is trained on 15 languages monolingual corpora in the target language. Our and introduces the translation language modeling results show that the performance of XLM-R out- objective, which is based on MLM, but uses pairs of-the-box is moderately above chance, and model of parallel sentences. XLM-R has improved per- adaptation leads to improvements of up to 5.88 formance over XLM, and trains on data from 100 percentage points. Training on machine-translated different languages with only the MLM objective. training data, however, results in an even larger per- Common to all models is a large shared subword formance gain of 10.1 percentage points over the vocabulary created using either WordPiece (Sen- corresponding XLM-R model without adaptation. nrich et al., 2016) or SentencePiece (Kudo and We further perform an analysis via experiments Richardson, 2018) tokenization. with hypothesis-only models, to examine poten- tial artifacts which may have been inherited from 2.2 Evaluating Pretrained Multilingual XNLI and find that performance is above chance Models for most models, but still below that for using the full example. Just as in the monolingual setting, where bench- AmericasNLI is publicly available at marks such as GLUE (Wang et al., 2018) and Super- nala-cub.github.io/resources. We GLUE (Wang et al., 2019) provide a look into the hope that it will serve as a benchmark for measur- performance of models across various tasks, multi- ing the zero-shot natural language understanding lingual benchmarks (Hu et al., 2020; Liang et al., abilities of multilingual models for unseen lan- 2020) cover a wide variety of tasks involving sen- guages. Additionally, we hope that our dataset will tence structure, classification, retrieval, and ques- motivate the development of novel pretraining and tion answering. Evaluation is based on zero-shot model adaptation techniques which are suitable for transfer from English, providing a strong bench- truly low-resource languages. mark for cross-lingual transfer. Additional work has been done examining what 2 Background and Related Work mechanisms allow multilingual models to trans- fer across languages (Pires et al., 2019; Wu and 2.1 Pretrained Multilingual Models Dredze, 2019). Wu and Dredze(2020) examine Prior to the widespread use of pretrained trans- transfer performance dependent on a language’s former models, cross-lingual transfer was mainly representation in the pretraining data. For language achieved through word embeddings (Mikolov et al., with low representation, multiple methods have 2013; Pennington et al., 2014; Bojanowski et al., been proposed to improve performance, including 2017), either by aligning monolingual embeddings extending the vocabulary, transliterating the target into the same embedding space (Conneau et al., text, and continuing pretraining before finetuning 2017; Lample et al., 2017; Grave et al., 2018) or (Lauscher et al., 2020; Chau et al., 2020; Muller by training multilingual embeddings (Ammar et al., et al., 2020; Pfeiffer et al., 2020a,c; Wang et al., 2016; Artetxe and Schwenk, 2019). Pretrained mul- 2020). In this work, we focus on continued pretrain- tilingual models represent the extension of multilin- ing to analyze the performance of model adaptation gual embeddings to pretrained transformer models. for a high-level, semantic task. 2.3 Natural Language Inference Lang. Source(s) Sent. aym Tiedemann(2012) 6,531 Given two sentences, the premise and the hypothe- Feldman and Coto-Solano(2020); Margery(2005); sis, the task of NLI consists of determining whether Jara Murillo(2018a); Constenla et al.(2004); bzd 7,508 the hypothesis logically entails, contradicts, or is Jara Murillo and Garc´ıa Segura(2013); Jara Murillo(2018b); Flores Sol orzano´ (2017) neutral to the premise. The most widely used cni Cushimariano Romano and Sebastian´ Q.(2008) 3,883 datasets for NLI in English are SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). gn Chiruzzo et al.(2020) 26,032 XNLI (Conneau et al., 2018) is the multilingual hch Mager et al.(2017) 8,966 expansion of MNLI to 15 languages, providing nah Gutierrez-Vasques et al.(2016) 16,145 manually translated evaluation sets and machine- oto https://tsunkua.elotl.mx 4,889 translated training sets. While datasets for NLI or quy Agic´ and Vulic´(2019) 125,008 Galarreta et al.(2017); Loriot et al.(1993); the similar task of recognizing textual entailment shp 14,592 Gomez´ Montoya et al.(2019) exist for other languages (Bos et al., 2009; Alab- Brambila(1976); bas, 2013; Eichler et al., 2014; Amirkhani et al., tar github.com/pywirrarika/tar_par 14,720 2020), their lack of similarity prevents a general- ized study of cross-lingual zero-shot performance. Table 2: Parallel data used for our translation models. This is in contrast to XNLI, where examples for all 15 languages are parallel. To preserve this property 3.2 Languages of XNLI, when creating AmericasNLI, we choose to translate Spanish XNLI as opposed to creating We now discuss the languages in AmericasNLI. For examples directly in the target language. additional background on previous NLP research However, NLI datasets are not without issue: on indigenous languages of the Americas, we refer Gururangan et al.(2018) shows that artifacts from the reader to Mager et al.(2018). the creation of MNLI allow for models to classify Aymara Aymara is a polysynthetic Amerindian examples depending on only the hypothesis, show- language spoken in , , and by over ing that models may not be reasoning as expected. two million people (Homola, 2012). Aymara has Motivated by this, we provide further analysis of multiple , including Northern and Southern AmericasNLI in Section6 by comparing the perfor- Aymara, spoken on the southern Peruvian shore of mance of hypothesis-only models to models trained Lake Titicaca as well as around La Paz and, respec- on full examples. tively, in the eastern half of the Iquique province in northern Chile, the Bolivian department of Oruro, 3 AmericasNLI in northern Potosi, and southwest Cochabamba. However, Southern Aymara is slowly being re- 3.1 Data Collection Setup placed by Quechua in the last two regions. A rare AmericasNLI is the translation of a subset of XNLI linguistic phenomenon found in Aymara is vowel (Conneau et al., 2018). As translators between elision, an omission of various sounds in a lan- Spanish and the target languages are more fre- guage. Aymara has an SOV word order. Amer- quently available than those for English, we trans- icasNLI examples are translated into the Central late from the Spanish version. Additionally, some Aymara variant, specifically Aymara La Paz. translators reported that code-switching is often used to describe certain topics, and, while many Ashaninka´ Ashaninka´ is an Amazonian lan- words without an exact equivalence in the target guage from the Arawak family, spoken in Central language are worked in through translation or inter- and Eastern Peru, in a geographical region located pretation, others are kept in Spanish. To minimize between the eastern foothills of the and the the amount of Spanish vocabulary in the translated western fringe of the Amazon basin (Mihas, 2017). examples, we choose sentences from genres that A national-scale census in 2017 revealed a popu- 2 we judged to be relatively easy to translate into lation of 73,567 speakers. While Ashaninka´ in a the target languages: “face-to-face,” “letters,” and strict sense refers to the linguistic varieties spoken “telephone.” We choose up to 750 examples from in Ene, Tambo and Bajo Perene´ rivers, the name each of the development and test set, with exact 2https://bdpi.cultura.gob.pe/pueblos/ counts for each language in Table1. ashaninka Language Premise Hypothesis en And he said, Mama, I’m home. He told his mom he had gotten home. es Y el´ dijo: Mama,´ estoy en casa. Le dijo a su madre que hab´ıa llegado a casa. aym Jupax sanwa: Mamita, utankastwa. Utar purinxtwa sasaw mamaparux sanxa bzd En˜ a˜ ie’ iche: am˜ `ıx, ye’ tso’ u` a.˜ I am˜ `ıx a˜ iche´ irir to¨ ye’ dem´ ˜ın˜e u` a.˜ cni Iriori ikantiro: Ina, nosaiki pankotsiki. Ikantiro iriniro yaretaja pankotsiki. gn Ha ha’e he’i: Mama, aime ogape.´ He’´ıkuri isype´ oguahˆ ehagueˆ hogape.´ hch meta´ mik+ petay+: ne mama kita´ nepa yeka.´ yu mama m+pa+ p+ra h+awe kai kename yu kita´ he nuakai. nah huan yehhua quiihtoh: Nonantzin, niyetoc nochan quiilih inantzin niehcoquia oto xi nydi bienˆ a:ˆ maMe dimi an nguˆ bimabiˆ o ini maMe gueˆ o nguˆ quy Hinaptinmi pay nirqa: Mamay wasipim kachkani. Wasinman chayasqanmanta mananta willarqa. shp Jara neskata iki: tita, xobonkoriki ea. Jawen tita yoiaia iki moa xobon nokota. tar A’l´ı je an´ıli echiko:´ ku bitich´ı ne at´ıki Nana Iyela´ ku ruyeli,´ mapu bitich´ı ku nawali.´

Table 3: A parallel example in AmericasNLI with the label entailment. is also used to talk about the following nearby and like head-internal relative clauses, directional verbs closely-related Asheninka varieties: Alto Perene,´ and numerical classifiers (Jara Murillo, 2018a). Pichis, Pajonal, Ucayali-Yurua, and Apurucayali. There are several orthographies which use differ- Although it is the most widely spoken Amazonian ent diacritics for the same phenomena. For exam- language in Peru, certain varieties, such as Alto ple, the Constenla et al.(2004) system marks nasal Perene,´ are highly endangered. vowels with a line underneath the vowel, whereas Ashaninka´ is an agglutinating and polysynthetic the Jara Murillo and Garc´ıa Segura(2013) system language with a VSO word order. The verb is the does this with a tilde, e.g., e’/e’˜ (there). Even for most morphologically complex word class, with a researchers who use the same orthography, the Uni- rich repertoire of aspectual and modal categories. code encoding of similar diacritics differs amongst The language lacks case, except for one locative authors (e.g., the combining low line, minus sign suffix, so the grammatical relations of and and macron are all found for the nasal marking). are indexed as affixes on the verb itself. The dialects of Bribri differ in their exact vocab- Other notable linguistic features of the language ularies, e.g., n˜ala`/nol˜ ˜o` (road), and there are phono- include obligatory marking of a realis/irrealis dis- logical processes, like the deletion of unstressed tinction on the verb, a rich system of applicative suf- vowels, which also change the tokens found in texts, fixes, serial verb constructions, and a pragmatically e.g., dakaro`/kro` (chicken). In addition, Bribri has conditioned split intransitivity. Code-switching only been a written language for about 40 years, so with Spanish or Portuguese is a regular practice there are very few people who produce written ma- in everyday dialogue. terials in the language, and existing materials have Bribri Bribri is a Chibchan language spoken by a large degree of idiosyncratic variation. These 7000 people in Southern Costa Rica (INEC, 2011). variations are standardized in AmericasNLI, which It has three dialects, and it is still spoken by chil- is written in the Amubri variant. dren. However, it is a vulnerable language (Mose- ley, 2010;S anchez´ Avendano˜ , 2013), which means Guarani Guarani is spoken by between 6 to 10 that there are few settings where the language is million people in . Roughly 3 mil- written or used in official functions. The language lion people use it as their main language, including does not have official status and it is not the main more than 10 native nations in Paraguay, Brazil, medium of instruction of Bribri children, but it is , and Bolivia, along with Paraguayan, Ar- offered as a class in primary and secondary schools. gentinian, and Brazilian peoples. According to Bribri is a tonal language, with fusional morphol- the Paraguayan Census, in 2002 there were around ogy, SOV syntax and an ergative-absolutive case 1.35 million monolingual speakers, which has since system. Bribri grammar also includes phenomena increased to around 1.5 million people (Dos Santos, 2017; Melia`, 1992).3 ish (Cajero, 1998, 2009). When speaking nuhmu˜ , Although the use of Guarani as spoken language pronunciation is elongated, especially on the last is much older, the first written record dates to 1591 syllable. In this variant there are 13 pronunciations, (Catechism) followed by the first dictionary in 1639 and each is clearly marked in writing, where the and linguistic descriptions in 1640. Guarani usage alphabet is composed of 19 consonants, 12 vowel in text continued until the Paraguay-Triple Alliance phonemes, and characters formed from the combi- War (1864-1870) and declined thereafter. However, nation of consonants, cedillas, and circumflex ac- from the 1920s on, Guarani has slowly re-emerged cents (Cajero, 1998). Words follow an SVO order, and received renewed focus. In 1992, Guarani was with affirmative, negative, interrogative, exclama- the first American language declared an official tory, and imperative sentences, which are simple, language of a country, followed by a surge of lo- compound, and complex (Cajero, 1998; Lastra and cal, national, and international recognition in the de Suarez´ , 1997). early 21st century.4 The official grammar of the Guarani language was approved in 2018. Guarani Quechua Quechua, or Runasimi, is an indige- is an , with ample use of pre- nous spoken by the Quechua peo- fixes and suffixes. Code-switching with Spanish or ples that primarily live in the Peruvian Andes. De- Portuguese is common among speakers. rived from an ancestral language, it is the most widely spoken pre-Columbian language family of Nahuatl´ Nahuatl´ belongs to the Nahuan subdivi- the Americas. It has around 8-10 million speakers, sion of the Uto-Aztecan langauge family. There are and approximately 25% (7.7 million) of Peruvians 30 recognized variants of Nahuatl´ spoken by over speak a Quechuan language. Historically, Quechua 1.5 millions speakers across 17 different states of was the main language family during the Incan Mexico, where Nahuatl´ is recognized as an official Empire, and it was spoken until the Peruvian strug- language (SEGOB, 2021). Nahuatl´ is polysynthetic gle for independence from Spain during the 1780s. and agglutinative; different roots with or without af- Currently, many variants of Quechua are widely fixes are combined to form new words. The suffixes spoken and it is the co-official language of many that are added to a word modify the meaning of the regions in Peru. original word (Sullivan and Leon-Portilla´ , 1976), There are multiple subdivisions of Quechua, in- and 18 prepositions stand out based on postposi- cluding Southern, Northern, and Central Quechua. tions of names and (Simeon´ , 1977). In AmericasNLI examples are translated into the Nahuatl´ many sentences have an SVO structure or, standard version of , Quechua for emphasis, an OVS-structure (MacSwan, 1998). Chanka, also known as Quechua , which The translations in AmericasNLI belong to the Cen- is spoken in different regions of Peru and can be tral Nahuatl´ (Nahuatl´ de la Huasteca) . As understood in different areas of other countries, there is a lack of consensus regarding the ortho- such as Bolivia or Argentina. In the translations of graphic standard, the orthography is normalized to AmericasNLI, the apostrophe and the pentavocal- a version similar to .´ ism from other regions are not used Otom´ı Otom´ı belongs to the Oto-Pamean lan- guage family and has nine linguistic variants Raramuri´ The Raramuri´ language, also known with different regional self-denominations, such as as Tarahumara, which means light foot (INALI, n˜ah¨ nu˜ or n˜ah¨ no˜ , hn˜ah¨ nu,˜ nuju,˜ noju,˜ yuhu,¨ hnah¨ no,˜ 2017), belongs to the Taracahitan subgroup of n˜uh¨ u,´ nanh˜ u,´ n˜oth¨ o,´ nhat˜ o´ and hnoth˜ o´ (INALI, the Uto-Aztecan language family (Goddard, 1996). 2014). There are around 307,928 speakers spread Raramuri´ is an official language of Mexico, spoken across 7 Mexican states. In the state of Tlaxcala, mainly in the Sierra Madre Occidental region in the yuhmu or nuhmu˜ variant is spoken by less than the state of Chihuahua by a total of 89,503 speak- 100 speakers, and we use this variant for the Otom´ı ers (SEGOB, 2021). Raramuri´ is a polysynthetic examples in AmericasNLI. Otom´ı is a tonal lan- language, characterized by a head-marking struc- guage and many words are homophonous to Span- ture (Nichols, 1986). There are five variants of Raramuri,´ and AmericasNLI examples are trans- 3 https://www.ine.gov.py/news/ lated into the Highlands variant (INEGI, 2008). 25-de-agosto-dia-del-Idioma-Guarani.php 4https://es.wikipedia.org/wiki/Idioma_ Translation orthography and word boundaries are guarani similar to Caballero(2008). aym bzd cni gn hch nah oto quy shp tar es→XX 0.19 0.08 0.10 0.22 0.13 0.18 0.06 0.33 0.14 0.05 ChrF XX→es 0.09 0.06 0.09 0.14 0.07 0.10 0.06 0.14 0.09 0.08 es→XX 0.30 0.54 0.03 3.26 3.18 0.33 0.01 1.58 0.34 0.01 BLEU XX→es 0.04 0.01 0.01 0.18 0.01 0.02 0.02 0.05 0.01 0.01

Table 4: Translation performance for all target languages. es→XX represents translating into the target language, which is used for translate-train, and XX→es represents translating into Spanish, used for translate-test.

Shipibo-Konibo Shipibo-Konibo is a Panoan is based on RoBERTa (Liu et al., 2019), and it is language spoken by around 35,000 native speakers trained using MLM on web-crawled data in 100 in the Amazon region of Peru. It is a language with languages. It uses a shared vocabulary consisting agglutinative processes, a majority of which are of 250k subwords, created using SentencePiece suffixes. However, clitics are also used, and are (Kudo and Richardson, 2018) tokenization. We use a widespread element in Panoan literature (Valen- the Base version of XLM-R for our experiments. zuela, 2003). Shipibo-Konibo uses an SOV word order (Faust, 1973) and postpositions (Vasquez et al., 2018). The translations in AmericasNLI Adaptation Methods To adapt XLM-R to the make use of the official alphabet and standard writ- various target languages, we continue training with ing supported by the Ministry of Education in Peru. the MLM objective on monolingual text in the tar- get language before finetuning. To keep a fair com- Wixarika The Wixarika, or , language, parison with other approaches, we only use target the language of the doctors and heal- meaning data which was also used to train the translation ers (Lumholtz, 2011), is a language in the Cora- models, which we describe in Section 4.2. How- chol subgroup of the Uto-Aztecan language family ever, we note that one benefit of continued pretrain- (Campbell, 2000). Wixarika is a national language ing for adaptation is that it does not require parallel of Mexico with four variants: Northern, Southern, text, and could therefore benefit from text which Eastern, and Western (INEGI, 2008). It is spoken could not be used for a translation-based approach. mainly in the three Mexican states of Jalisc , Na- For continued pretraining, we use a batch size of yari , and , with a total of around 47,625 32 and a learning rate of 2e-5. We train for a to- speakers (INEGI, 2001). tal of 40 epochs. Each adapted model starts from Wixarika is a with head- the same version of XLM-R, and is adapted indi- marking (Nichols, 1986), a head-final structure vidually to each target language, which leads to a (Greenberg, 1963), nominal incorporation, argu- different model for each language. We denote mod- mentative marks, inflected adpositions, possession els adapted with continued pretraining as +MLM. marks, as well as instrumental and directional af- fixes (Iturrioz and Gomez-L´ opez´ , 2008). Wixarika follows an SOV word order, and lexical borrowing Finetuning To finetune XLM-R, we follow the from and code-switching with Spanish are com- approach of Devlin et al.(2019) and use an addi- monly used Translations in AmericasNLI are in tional linear layer. We train on either the English Northern Wixarika and use an orthography com- MNLI data or the machine-translated Spanish data, mon among native speakers (Mager-Hois, 2017). and we call the final models XLM-R (en) and XLM- 4 Experiments R (es), respectively. Following Hu et al.(2020), we use a batch size of 32 and a learning rate of 2e-5. In this section, we detail the experimental setup We train for a maximum of 5 epochs, and evaluate we use to evaluate the performance of various ap- performance every 625 steps on the XNLI develop- proaches on AmericasNLI. ment set corresponding to the finetuning language. We employ early stopping with a patience of 15 4.1 Zero-Shot Learning evaluation steps and use the best performing check- Pretrained Model We use XLM-R (Conneau point for the final evaluation. All finetuning is done et al., 2020) as the pretrained multilingual model using the Huggingface Transformers library (Wolf in our experiments. The architecture of XLM-R et al., 2020) with two Nvidia V100 GPUs. FT aym bzd cni gn hch nah oto quy shp tar Avg. Majority baseline - 33.33 33.33 33.33 33.33 33.33 33.47 33.42 33.33 33.33 33.33 - Zero-shot XLM-R (en) 85.15 36.00 39.20 37.20 40.67 36.80 42.28 36.90 35.73 40.67 36.27 38.17 XLM-R (es) 81.32 37.87 41.60 37.87 39.47 36.27 39.57 39.04 40.93 38.27 35.33 38.62 Zero-shot w/ adaptation XLM-R (en) +MLM - 41.60 36.53 40.80 51.47 39.87 46.48 37.83 64.53 40.67 40.67 44.05 XLM-R (es) +MLM - 43.87 37.60 38.80 52.27 36.00 45.12 41.58 60.80 41.20 38.80 43.60 Translate-train XLM-R - 49.33 52.00 42.80 55.87 41.07 54.07 36.50 59.87 52.00 43.73 48.72 Translate-test XLM-R - 39.47 40.40 35.07 49.20 38.40 41.19 33.96 52.40 39.07 36.27 40.54

Table 5: Results for zero-shot, translate-train, and translate-test. FT represents the XNLI test set performance for the finetuning language and is not included in the average. The majority baseline represents expected performance when predicting only the majority class of the test set. Random guessing would result in an accuracy of 33.33%.

4.2 Translation-based Approaches which work well across all languages, we evalu- We also experiment with two translation-based ap- ate each run using the average performance on the proaches, translate-train and translate-test, detailed machine-translated Aymara and Guarani develop- below along with the used translation model. ment sets, as these languages have moderate and high translation quality, respectively. We find that Translation Models For our translation-based decreasing the learning rate to 5e-6 and keeping the approaches, we train two sets of translation mod- batch size at 32 yields the best performance. Other els: one to translate from Spanish into the tar- than the learning rate, we use the same approach as get language, and one in the opposite direction. for zero-shot finetuning. We use transformer sequence-to-sequence models Translate-test For the translate-test approach, (Vaswani et al., 2017) with the hyperparameters we translate the test sets of each target language proposed by Guzman´ et al.(2019). We employ into Spanish. This allows us to apply the model the same model architecture for both translation finetuned on Spanish, XLM-R (es), to each test directions, and we measure translation quality in set. Additionally, a benefit of translate-test over terms of BLEU (Papineni et al., 2002) and ChrF translate-train and the adapted XLM-R models is (Popovic´, 2015), cf. Table4. We use fairseq (Ott that we only need to finetune once overall, as op- et al., 2019) to implement all translation models.5 posed to once per language. For evaluation, we use Translate-train For the translate-train approach, the checkpoint with the highest performance on the the Spanish training data provided by XNLI is Spanish XNLI development set. translated into each target language. It is then used to finetune XLM-R for each language individually. 5 Results and Discussion Along with the training data, we also translate the Zero-shot Models We present our results in Ta- Spanish development data, which is used for val- ble5. Zero-shot performance is low for all 10 idation and early stopping. Notably, we find that languages, with an average accuracy of 38.17% the finetuning hyperparameters defined above do and 38.62% for the English and Spanish model, not reliably allow the model to converge for many respectively. However, in all cases the performance of the target languages. To find suitable hyper- is higher than the majority baseline. As shown parameters, we tune the batch size and learning in Table A.3 in the appendix, the same models rate by conducting a grid search over {5e-6, 2e-5, achieve an average of 74.65% and, respectively, 1e-4 } for the learning rate and {32, 64, 128} for 75.58% accuracy when evaluated on the 15 XNLI the batch size. In order to select hyperparameters languages. Thus, answering RQ 1, we conclude 5The code for translation models can be found at https: that zero-shot tasks are much harder for our mod- //github.com/AmericasNLP/americasnlp2021 els if the target language is unseen. Interestingly, even though code-switching with Spanish is en- test. When averaged across all languages, translate- countered in many target languages, finetuning on train outperforms the Spanish zero-shot model by Spanish labeled data only slightly outperforms the 10.10 points, and translate-test by 8.18 points. It is model trained on English and does not improve con- important to note that the translation performance sistently across languages – performance is better from Spanish to each target language is not par- for only five of the languages. The English model ticularly high: when considering ChrF scores, the achieves a highest accuracy of 42.28%, when evalu- highest is 0.33, and the highest BLEU score is 3.26. ated on Nahuatl,´ while the Spanish model achieves Both translation-based models are correlated with a highest accuracy of 41.60%, when evaluated on ChrF scores, with a Pearson correlation coefficient Bribri. The lowest performance is achieved when of 0.79 and 0.90 for translate-train and translate- evaluating on Quechua and Raramuri,´ for the En- test, respectively. Correlations are not as strong for glish and Spanish model, respectively. BLEU, with coefficients of 0.25 and 0.58. Turning to RQ 2, we find that model adapta- The sizable difference between translate-train tion via continued pretraining improves both mod- and the other methods suggests that translation- els, with an average gain of 5.88 percentage points based approaches may be a valuable asset for cross- for English and 4.98 percentage points for Span- lingual transfer, especially for low-resource lan- ish. Notably, continued pretraining increases per- guages. While the largest downsides to this ap- formance for Quechua by 28.8 percentage points proach are the requirement for parallel data and when finetuning on English, and 19.87 points when the need for multiple models, the potential per- finetuning on Spanish. Performance only decreases formance gain over other approaches may prove for Bribri (in both cases) and for Wixarika when worthwhile. Additionally, we believe that the using Spanish data. performance of both translation-based approaches would improve given a stronger translation system, Translate-test Performance of the translate-test and future work detailing the necessary level of model improves over both zero-shot baselines. We translation quality for the best performance would see the largest increase in performance for Guarani offer great practical usefulness for NLP applica- and Quechua, with gains of 8.53 and, respectively, tions for low-resource languages. 11.47 points over the best performing zero-shot model without adaptation. Considering the trans- 6 Analysis lation metrics in Table4, models for Guarani and 6.1 Hypothesis-only Models Quechua achieve the two highest scores for both metrics. Interestingly, translate-test performance is As shown by Gururangan et al.(2018), SNLI and lower than zero-shot performance for Ashaninka´ MNLI – the datasets AmericasNLI is based on and Otom´ı. While these two languages do not have – contain artifacts created during the annotation the lowest translation performance, other languages process which models exploit to artificially inflate with similar translation quality either achieve sim- performance. To analyze whether similar artifacts ilar, or higher scores than their zero-shot counter- exist in AmericasNLI and if they can also be ex- parts. On average, translate-test does worse when ploited, we train and evaluate models using only the compared to the adapted zero-shot models, and hypothesis, for the 5 languages which performed in all but two cases both adapted models perform highest in the standard setting, when averaged better than translate-test. across all approaches. Most importantly, as displayed in Table6, the av- Translate-train The most surprising result is erage performance across languages is better than that of translate-train, which considerably outper- chance for all models except for XLM-R without forms the performance of translate-test models for adaptation. Translate-train obtains the highest re- all languages, and outperforms the zero-shot mod- sult with 47.35% accuracy. Thus, similar as for els for all but two languages. Compared to the SNLI and MNLI, artifacts in the hypotheses can be best non-adapted zero-shot model, the largest per- used to predict, to some extent, the correct labels. formance gain is 18.94 points for Quechua. For However, as shown in Table A.1 in the appendix, the language with the lowest performance, Otom´ı, in all but 2 cases the performance of hypothesis- translate-train performs 2.54 points worse than only models is lower than that of the standard zero-shot; however, it still outperforms translate- ones. This indicates that the models are learning FT aym gn nah quy shp Avg. Avg.+P Zero-shot XLM-R (en) 62.34 33.60 33.47 33.06 33.33 33.60 33.41 39.07 XLM-R (es) 62.26 34.13 35.33 33.60 33.07 36.80 34.59 39.22 Zero-shot w/ adaptation XLM-R (en) +MLM - 37.07 42.40 34.55 44.40 35.33 38.75 48.95 XLM-R (es) +MLM - 36.27 41.73 35.37 47.87 35.60 39.37 48.65 Translate-train XLM-R - 44.93 47.60 45.80 52.13 46.27 47.35 54.23 Translate-test XLM-R - 36.53 43.60 43.22 48.13 42.67 42.83 44.27

Table 6: Hypothesis-only results. The Avg. column represents the average of the hypothesis-only results, while the Avg.+P column, computed from Table5, represents the average of the 5 languages when using both the premise and hypothesis. something beyond just exploiting artifacts in the 7 Conclusion hypotheses, even though AmericasNLI is a zero- shot task with the additional challenge that the lan- To better understand the zero-shot abilities of pre- guages are unseen during pretraining. trained multilingual models for semantic tasks in unseen languages, we present AmericasNLI, a par- 6.2 Early Stopping allel NLI dataset covering 10 low-resource lan- Early stopping is vital to prevent overfitting in deep guages indigenous to the Americas. learning models. However, in the case of zero-shot We conduct experiments with XLM-R, and find learning, to truly mimic a realistic scenario, hand- that the model’s zero-shot performance, while bet- labeled development sets in the target language ter than a majority baseline, is poor. However, it cannot be used (Kann et al., 2019). Therefore, in can be improved by model adaptation via continued our main experiments, when finetuning on a high- pretraining. Additionally, we find that translation- resource language, we use a development set in based approaches outperform a zero-shot approach, that language for early stopping. For translate-train, which is surprising given the low quality of the em- we translate the source language’s development set ployed translation systems. We hope that this work into the target language. In both cases, performance will not only spur further research into improving on the development set is an imperfect signal for model adaptation to unseen languages, but also mo- how the model will ultimately perform. tivate the creation of more resources for languages To explore how this affects final model perfor- not frequently studied by the NLP community. mance, we present the difference in results for 8 Acknowledgments translate-train models when an oracle translation is used for early stopping in Table7. We find that per- We thank the following people for their work on formance is 2.74 points higher on average, with a the translations: Francisco Morales for Bribri, Fe- maximum difference of 6.93 points for Ashaninka.´ liciano Torres R´ıos for Ashaninka,´ Perla Alvarez Thus, creating ways to better approximate a devel- Britez for Guarani, Silvino Gonzalez´ de la Cruz´ opment set in the target language might be useful for Wixarika, Giovany Mart´ınez Sebastian,´ Pedro for achieving higher performance. Kapoltitan, and Jose´ Antonio for Nahuatl,´ Jose´ Ma- teo Lino Cajero Velazquez´ for Otom´ı, Liz Chavez´ aym bzd cni gn hch nah oto quy shp tar Avg. for Shipibo-Konibo, and Mar´ıa del Carmen´ Sotelo 2.80 0.40 6.93 3.60 2.66 3.38 2.54 1.46 0.93 2.67 2.74 Holgu´ın for Raramuri.´ This work would not have been possible without the financial support of Face- Table 7: Difference between translate-train results ob- book AI Research, Microsoft Research, Google Re- tained using the oracle development set and the trans- search, the Institute of Computational Linguistics lated development set for early stopping. at the University of Zurich, the NAACL Emerging Regions Fund, Comunidad Elotl, and Snorkel AI. References Luis Chiruzzo, Pedro Amarilla, Adolfo R´ıos, and Gus- ˇ tavo Gimenez´ Lugo. 2020. Development of a Zeljko Agic´ and Ivan Vulic.´ 2019. JW300: A wide- Guarani - Spanish parallel corpus. In Proceedings of coverage parallel corpus for low-resource languages. the 12th Language Resources and Evaluation Con- In Proceedings of the 57th Annual Meeting of the ference, pages 2629–2633, Marseille, France. Euro- Association for Computational Linguistics, pages pean Language Resources Association. 3204–3210, Florence, Italy. Association for Compu- tational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Maytham Alabbas. 2013. A dataset for textual Guzman,´ E. Grave, Myle Ott, Luke Zettlemoyer, and entailment. In Proceedings of the Student Research Veselin Stoyanov. 2020. Unsupervised cross-lingual Workshop associated with RANLP 2013, pages 7– representation learning at scale. In ACL. 13, Hissar, Bulgaria. INCOMA Ltd. Shoumen, BUL- GARIA. Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Herve´ Jegou.´ 2017. Hossein Amirkhani, Mohammad AzariJafari, Azadeh Word translation without parallel data. arXiv Amirak, Zohreh Pourjafari, Soroush Faridan preprint arXiv:1710.04087. Jahromi, and Zeinab Kouhkan. 2020. Farstail: A persian natural language inference dataset. ArXiv, Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- abs/2009.08820. ina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. Xnli: Evaluating cross- Waleed Ammar, George Mulcaire, Yulia Tsvetkov, lingual sentence representations. In Proceedings of Guillaume Lample, Chris Dyer, and Noah A. Smith. the 2018 Conference on Empirical Methods in Natu- 2016. Massively multilingual word embeddings. ral Language Processing. Association for Computa- ArXiv, abs/1602.01925. tional Linguistics. M. Artetxe and Holger Schwenk. 2019. Massively mul- Adolfo Constenla, Feliciano Elizondo, and Francisco tilingual sentence embeddings for zero-shot cross- Pereira. 2004. Curso Basico´ de Bribri. Editorial de lingual transfer and beyond. Transactions of the As- la Universidad de Costa Rica. sociation for Computational Linguistics, 7:597–610. Ruben´ Cushimariano Romano and Richer C. Se- Piotr Bojanowski, Edouard Grave, Armand Joulin, and bastian´ Q. 2008. Naantsipeta˜ ashaninkaki´ bi- Tomas Mikolov. 2017. Enriching word vectors with rakochaki. diccionario ashaninka-castellano.´ version´ subword information. Transactions of the Associa- preliminar. http://www.lengamer.org/ tion for Computational Linguistics, 5:135–146. publicaciones/diccionarios/. Visitado: 01/03/2013. Johan Bos, Fabio Massimo Zanzotto, and M. Pennac- chiotti. 2009. Textual entailment at evalita 2009. J. Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirec- Samuel R. Bowman, Gabor Angeli, Christopher Potts, tional transformers for language understanding. In and Christopher D. Manning. 2015. A large anno- NAACL-HLT. tated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Rafaela Alves Dos Santos. 2017. DIGLOSSIA Methods in Natural Language Processing (EMNLP). NO PARAGUAI: A restric¸ao˜ dos monol´ıngues em Association for Computational Linguistics. guarani no acesso a` informac¸ao˜ . Trabalho de Conclusao˜ de Curso, Bacharelado em L´ınguas Es- David Brambila. 1976. Diccionario raramuri - castel- trangeiras. Universidade de Bras´ılia, Brasilia. lano: Tarahumar. Kathrin Eichler, Aleksandra Gabryszak, and Gunter¨ Gabriela Caballero. 2008. Choguita raramuri´ (tarahu- Neumann. 2014. An analysis of textual inference mara) phonology and morphology. in German customer emails. In Proceedings of the Third Joint Conference on Lexical and Com- M. Cajero. 1998. Ra´ıces del Otom´ı: diccionario. Gob- putational Semantics (*SEM 2014), pages 69–74, ierno del Estado de Tlaxcala. Dublin, Ireland. Association for Computational Lin- guistics and Dublin City University. Mateo Cajero. 2009. Historia de los Otom´ıes en Ix- tenco, volume 1. Instituto Tlaxcalteca de la Cultura, Norma Faust. 1973. Lecciones para el aprendizaje del Tlaxcala, Mexico.´ idioma shipibo-conibo, volume 1 of Documento de Trabajo. Instituto Lingu¨´ıstico de Verano, Yarina- Lyle Campbell. 2000. American Indian languages: the cocha. historical linguistics of Native America, volume 4. Oxford University Press on Demand. Isaac Feldman and Rolando Coto-Solano. 2020. Neu- ral machine translation models with back-translation Ethan C. Chau, Lucy H. Lin, and Noah A. Smith. 2020. for the extremely low-resource indigenous language Parsing with multilingual bert, a small corpus, and a Bribri. In Proceedings of the 28th International small treebank. ArXiv, abs/2009.14124. Conference on Computational Linguistics, pages 3965–3976, Barcelona, Spain (Online). Interna- Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- tional Committee on Computational Linguistics. ham Neubig, Orhan Firat, and Melvin Johnson. 2020. Xtreme: A massively multilingual multi-task Sof´ıa Flores Solorzano.´ 2017. Corpus oral pandialectal benchmark for evaluating cross-lingual generaliza- de la lengua bribri. http://bribri.net. tion. CoRR, abs/2003.11080.

Ana-Paula Galarreta, Andres´ Melgar, and Arturo On- INALI. 2014. Norma de escritura de la Lengua cevay. 2017. Corpus creation and initial SMT ex- Hn˜ah¨ nu˜ (Otom´ı), 1st edition. Secretaria de culatura. periments between Spanish and Shipibo-konibo. In Proceedings of the International Conference Recent INALI. 2017. Etnograf´ıa del pueblo tarahumara Advances in Natural Language Processing, RANLP (raramuri).´ 2017, pages 238–244, Varna, Bulgaria. INCOMA INEC. 2011. Poblacion´ total en territorios ind´ıgenas Ltd. por autoidentificacion´ a la etnia ind´ıgena y habla ´ ´ Ives Goddard. 1996. Introduccion. In William C. de alguna lengua indıgena, segun pueblo y territo- ´ ´ Sturtevant, editor, Handbook of North American In- rio indıgena. In Instituto Nacional de Estadıstica y Censo 2011 dians (vol. 17), chapter 1, pages 1–6. University of Censos, editor, . Texas. INEGI. 2001. Censo general de poblacion´ y vivienda, 2000. Aguascalientes, Mexico,´ Instituto Nacional Hector´ Erasmo Gomez´ Montoya, Kervy Dante Rivas de Estad´ıstica, Geograf´ıa e Informatica´ . Rojas, and Arturo Oncevay. 2019. A continuous improvement framework of machine translation for INEGI. 2008. Catalogo´ de las lenguas ind´ıgenas na- Shipibo-konibo. In Proceedings of the 2nd Work- cionales: Variantes lingu¨´ısticas de mexico´ con sus shop on Technologies for MT of Low Resource Lan- autodenominaciones y referencias geoestad´ısticas. guages, pages 17–23, Dublin, Ireland. European As- Diario Oficial, pages 31–108. sociation for Machine Translation. Jose´ L. Iturrioz and Paula Gomez-L´ opez.´ 2008. Gra- Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar- matica wixarika i. mand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings Carla Victoria Jara Murillo. 2018a. Gramatica´ de la of the International Conference on Language Re- Lengua Bribri. EDigital. sources and Evaluation (LREC 2018). Carla Victoria Jara Murillo. 2018b. I Tte` Historias Joseph Harold Greenberg. 1963. Universals of lan- Bribris, second edition. Editorial de la Universidad guage. de Costa Rica.

Suchin Gururangan, Swabha Swayamdipta, Omer Carla Victoria Jara Murillo and Al´ı Garc´ıa Segura. Levy, Roy Schwartz, Samuel R. Bowman, and 2013. Se’ tto’¨ bribri ie Hablemos en bribri. EDigi- Noah A. Smith. 2018. Annotation artifacts in nat- tal. NAACL-HLT ural language inference data. In . Katharina Kann, Kyunghyun Cho, and Samuel R. Bow- man. 2019. Towards realistic practices in low- Ximena Gutierrez-Vasques, Gerardo Sierra, and resource natural language processing: The develop- Isaac Hernandez Pompa. 2016. Axolotl: a web ment set. In Proceedings of the 2019 Conference on accessible parallel corpus for Spanish-Nahuatl. In Empirical Methods in Natural Language Processing Proceedings of the Tenth International Conference and the 9th International Joint Conference on Natu- on Language Resources and Evaluation (LREC’16), ral Language Processing (EMNLP-IJCNLP), pages pages 4210–4214, Portoroz,ˇ Slovenia. European 3342–3349, Hong Kong, China. Association for Language Resources Association (ELRA). Computational Linguistics.

Francisco Guzman,´ Peng-Jen Chen, Myle Ott, Juan Taku Kudo and John Richardson. 2018. SentencePiece: Pino, Guillaume Lample, Philipp Koehn, Vishrav A simple and language independent subword tok- Chaudhary, and Marc’Aurelio Ranzato. 2019. The enizer and detokenizer for neural text processing. In flores evaluation datasets for low-resource machine Proceedings of the 2018 Conference on Empirical translation: Nepali–english and sinhala–english. In Methods in Natural Language Processing: System Proceedings of the 2019 Conference on Empirical Demonstrations, pages 66–71, Brussels, Belgium. Methods in Natural Language Processing and the Association for Computational Linguistics. 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), pages 6100– Guillaume Lample and Alexis Conneau. 2019. Cross- 6113. lingual language model pretraining. In NeurIPS.

Petr Homola. 2012. Building a formal grammar for a Guillaume Lample, Alexis Conneau, Ludovic Denoyer, polysynthetic language. In Formal Grammar, pages and Marc’Aurelio Ranzato. 2017. Unsupervised ma- 228–242, Berlin, Heidelberg. Springer Berlin Hei- chine translation using monolingual corpora only. delberg. arXiv preprint arXiv:1711.00043. Y. Lastra and Y.L. de Suarez.´ 1997. El otom´ı de Ix- Elena Mihas. 2017. The kampa subgroup of the arawak tenco. Universidad Nacional Autonoma´ de Mexico,´ language family. The Cambridge Handbook of Lin- Instituto de Investigaciones Antropologicas.´ guistic Typology, page 782–814.

Anne Lauscher, Vinit Ravishankar, Ivan Vulic,´ and Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Goran Glavas.ˇ 2020. From zero to hero: On the lim- rado, and Jeff Dean. 2013. Distributed representa- itations of zero-shot language transfer with multilin- tions of words and phrases and their compositional- gual Transformers. pages 4483–4499. ity. In Advances in Neural Information Processing Systems, volume 26, pages 3111–3119. Curran As- Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fen- sociates, Inc. fei Guo, Weizhen Qi, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei Christopher Moseley. 2010. Atlas of the World’s Lan- Zhang, Rahul Agrawal, Edward Cui, Sining Wei, guages in Danger. Unesco. Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie Wu, Shuguang Liu, Fan Yang, Daniel Campos, Ran- B. Muller, Antonis Anastasopoulos, Benoˆıt Sagot, and gan Majumder, and Ming Zhou. 2020. XGLUE: A Djame´ Seddah. 2020. When being unseen from new benchmark datasetfor cross-lingual pre-training, mbert is just the beginning: Handling new lan- understanding and generation. In Proceedings of the guages with multilingual language models. ArXiv, 2020 Conference on Empirical Methods in Natural abs/2010.12858. Language Processing (EMNLP), pages 6008–6018, Online. Association for Computational Linguistics. Johanna Nichols. 1986. Head-marking and dependent- marking grammar. Language, 62(1):56–119. Y. Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Luke Myle Ott, Sergey Edunov, Alexei Baevski, Angela Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: Fan, Sam Gross, Nathan Ng, David Grangier, and A robustly optimized bert pretraining approach. Michael Auli. 2019. fairseq: A fast, extensible ArXiv, abs/1907.11692. toolkit for sequence modeling. In Proceedings of James Loriot, Erwin Lauriout, and Dwight Day. NAACL-HLT 2019: Demonstrations. 1993. Diccionario Shipibo-Castellano. Instituto Lingu¨´ıstico de Verano. Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- Carl Lumholtz. 2011. Unknown Mexico: A Record uation of machine translation. In Proceedings of of Five Years’ Exploration Among the Tribes of the the 40th Annual Meeting of the Association for Com- Western Sierra Madre, volume 2. Cambridge Uni- putational Linguistics, pages 311–318, Philadelphia, versity Press. Pennsylvania, USA. Association for Computational Linguistics. Jeff MacSwan. 1998. The argument status of nps in southeast puebla nahuatl: Comments on the polysyn- Jeffrey Pennington, Richard Socher, and Christopher thesis parameter. Southwest Journal of Linguistics, Manning. 2014. GloVe: Global vectors for word 17(2):101–114. representation. In Proceedings of the 2014 Confer- ence on Empirical Methods in Natural Language Manuel Mager, Dionico Gonzalez, and Ivan Meza. Processing (EMNLP), pages 1532–1543, Doha, 2017. Probabilistic finite-state morphological seg- Qatar. Association for Computational Linguistics. menter for wixarika (huichol). Jonas Pfeiffer, Ivan Vulic,´ Iryna Gurevych, and Se- Manuel Mager, Ximena Gutierrez-Vasques, Gerardo bastian Ruder. 2020a. Mad-x: An adapter-based Sierra, and Ivan Meza-Ruiz. 2018. Challenges of framework for multi-task cross-lingual transfer. In language technologies for the indigenous languages EMNLP. of the Americas. In Proceedings of the 27th Inter- national Conference on Computational Linguistics , Jonas Pfeiffer, Ivan Vulic,´ Iryna Gurevych, and Sebas- pages 55–69, Santa Fe, New Mexico, USA. Associ- tian Ruder. 2020b. Unks everywhere: Adapting mul- ation for Computational Linguistics. tilingual language models to new scripts. ArXiv, Jesus Manuel Mager-Hois. 2017. Traductor abs/2012.15562. hıbrido wixarika-espanol´ con escasos recursos bilingues¨ . Ph.D. thesis, Master’s thesis, Universidad Jonas Pfeiffer, Ivan Vulic,´ Iryna Gurevych, and Sebas- Autonoma´ Metropolitana. tian Ruder. 2020c. Unks everywhere: Adapting mul- tilingual language models to new scripts. Enrique Margery. 2005. Diccionario Fraseologico´ Bribri-Espanol˜ Espanol-Bribri˜ , second edition. Edi- Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. torial de la Universidad de Costa Rica. How multilingual is multilingual BERT? In Pro- ceedings of the 57th Annual Meeting of the Asso- Bartomeu Melia.` 1992. La lengua Guaran´ı del ciation for Computational Linguistics, pages 4996– Paraguay: Historia, sociedad y literatura. Editorial 5001, Florence, Italy. Association for Computa- MAPFRE, Madrid. tional Linguistics. Edoardo Maria Ponti, Goran Glavas,ˇ Olga Majewska, Alex Wang, Amanpreet Singh, Julian Michael, Fe- Qianchu Liu, Ivan Vulic,´ and Anna Korhonen. 2020. lix Hill, Omer Levy, and Samuel Bowman. 2018. XCOPA: A multilingual dataset for causal common- GLUE: A multi-task benchmark and analysis plat- sense reasoning. In Proceedings of the 2020 Con- form for natural language understanding. In Pro- ference on Empirical Methods in Natural Language ceedings of the 2018 EMNLP Workshop Black- Processing (EMNLP), pages 2362–2376, Online. As- boxNLP: Analyzing and Interpreting Neural Net- sociation for Computational Linguistics. works for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics. Maja Popovic.´ 2015. chrF: character n-gram F-score for automatic MT evaluation. In Proceedings of the Zihan Wang, Karthikeyan K, Stephen Mayhew, and Tenth Workshop on Statistical Machine Translation, Dan Roth. 2020. Extending multilingual BERT to pages 392–395, Lisbon, Portugal. Association for low-resource languages. In Findings of the Associ- Computational Linguistics. ation for Computational Linguistics: EMNLP 2020, pages 2649–2656, Online. Association for Computa- Carlos Sanchez´ Avendano.˜ 2013. Lenguas en peli- tional Linguistics. gro en Costa Rica: vitalidad, documentacion´ y de- scripcion.´ Revista Ka´nina˜ , 37(1):219–250. Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A broad-coverage challenge corpus for sen- SEGOB. 2021. Sistema de Informacion´ Cultural. tence understanding through inference. In Proceed- https://sic.cultura.gob.mx/. ings of the 2018 Conference of the North American Rico Sennrich, Barry Haddow, and Alexandra Birch. Chapter of the Association for Computational Lin- 2016. Neural machine translation of rare words guistics: Human Language Technologies, Volume with subword units. In Proceedings of the 54th An- 1 (Long Papers), pages 1112–1122. Association for nual Meeting of the Association for Computational Computational Linguistics. Linguistics (Volume 1: Long Papers), pages 1715– 1725, Berlin, Germany. Association for Computa- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien tional Linguistics. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- Remi´ Simeon.´ 1977. Diccionario de la lengua nahuatl´ icz, Joe Davison, Sam Shleifer, Patrick von Platen, o mexicana, volume 1. Siglo XXI. Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Thelma D Sullivan and Miguel Leon-Portilla.´ 1976. Quentin Lhoest, and Alexander M. Rush. 2020. Compendio de la gramatica´ nahuatl´ , volume 18. Transformers: State-of-the-art natural language pro- Universidad nacional autonoma´ de Mexico,´ Instituto cessing. In Proceedings of the 2020 Conference on de investigaciones historicas.´ Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- Jorg¨ Tiedemann. 2012. Parallel data, tools and inter- ciation for Computational Linguistics. faces in OPUS. In Proceedings of the Eighth In- ternational Conference on Language Resources and Shijie Wu and Mark Dredze. 2019. Beto, bentz, be- Evaluation (LREC’12), pages 2214–2218, Istanbul, cas: The surprising cross-lingual effectiveness of Turkey. European Language Resources Association BERT. In Proceedings of the 2019 Conference on (ELRA). Empirical Methods in Natural Language Processing Pilar Valenzuela. 2003. Transitivity in Shipibo-Konibo and the 9th International Joint Conference on Natu- grammar. Ph.D. thesis, University of Oregon. ral Language Processing (EMNLP-IJCNLP), pages 833–844, Hong Kong, China. Association for Com- Alonso Vasquez, Renzo Ego Aguirre, Candy An- putational Linguistics. gulo, John Miller, Claudia Villanueva, Zeljkoˇ Agic,´ Roberto Zariquiey, and Arturo Oncevay. 2018. To- Shijie Wu and Mark Dredze. 2020. Are all lan- ward Universal Dependencies for Shipibo-konibo. guages created equal in multilingual bert? In In Proceedings of the Second Workshop on Uni- RepL4NLP@ACL. versal Dependencies (UDW 2018), pages 151–161, Brussels, Belgium. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems, volume 30. Curran Associates, Inc. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537. A Detailed Results

FT aym gn nah quy shp Avg. Zero-shot XLM-R (en) -22.81 -2.40 -7.20 -9.22 -2.40 -7.07 -5.66 XLM-R (es) -19.06 -3.74 -4.14 -5.97 -7.86 -1.47 -4.64 Zero-shot w/ adaptation XLM-R (en) +MLM - -4.53 -9.07 -11.93 -20.13 -5.34 -10.20 XLM-R (es) +MLM - -7.60 -10.54 -9.75 -12.93 -5.60 -9.28 Translate-train XLM-R - -4.40 -8.27 -8.27 -7.74 -5.73 -6.88 Translate-test XLM-R - -2.94 -5.60 2.03 -4.27 3.60 -1.44 Table A.1: Differences between hypothesis-only and standard results.

Language Split Entailment Contradiction Neutral Majority Baseline Test 250 250 250 0.333 aym Dev 248 248 247 0.334 Test 250 250 250 0.333 bzd Dev 248 248 247 0.334 Test 250 250 250 0.333 cni Dev 220 220 218 0.334 Test 250 250 250 0.333 gn Dev 248 248 247 0.334 Test 250 250 250 0.333 hch Dev 248 248 247 0.334 Test 246 245 247 0.335 nah Dev 124 124 128 0.340 Test 249 249 250 0.334 oto Dev 78 75 69 0.351 Test 250 250 250 0.333 quy Dev 248 248 247 0.334 Test 250 250 250 0.333 shp Dev 248 248 247 0.334 Test 250 250 250 0.333 tar Dev 248 248 247 0.334 Table A.2: Distribution of labels in the test and development sets, per language.

Source ar bg de el en es fr hi ru sw th tr ur vi zh Avg. en 72.40 78.38 77.33 76.35 85.15 79.14 78.26 70.74 76.05 63.97 72.99 72.48 66.91 75.33 74.27 74.65 es 73.53 79.32 77.96 76.85 83.63 81.32 79.04 72.79 77.05 64.63 73.33 73.63 68.78 76.41 75.41 75.58 Table A.3: XNLI results for zero-shot models. Scores are underlined when the same language used for training is used for evaluation as well.