Arxiv:2104.08726V1 [Cs.CL] 18 Apr 2021
Total Page:16
File Type:pdf, Size:1020Kb
AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages Abteen Ebrahimi} Manuel Mager♠ Arturo Oncevay~ Vishrav Chaudharyr Luis Chiruzzo4 Angela Fanr John OrtegaΩ Ricardo Ramosη Annette Rios Ivan Vladimir] Gustavo A. Gimenez-Lugo´ | Elisabeth Mager] Graham Neubig./ Alexis Palmer} Rolando A. Coto Solanof Ngoc Thang Vu♠ Katharina Kann} ./Carnegie Mellon University fDartmouth College rFacebook AI Research ΩNew York University 4Universidad de la Republica,´ Uruguay ηUniversidad Tecnologica´ de Tlaxcala ]Universidad Nacional Autonoma´ de Mexico´ |Universidade Tecnologica´ Federal do Parana´ }University of Colorado Boulder ~University of Edinburgh ♠University of Stuttgart University of Zurich Abstract Language ISO Family Dev Test Aymara aym Aymaran 743 750 Pretrained multilingual models are able to per- Ashaninka´ cni Arawak 658 750 form cross-lingual transfer in a zero-shot set- Bribri bzd Chibchan 743 750 ting, even for languages unseen during pre- Guarani gn Tupi-Guarani 743 750 training. However, prior work evaluating per- Nahuatl´ nah Uto-Aztecan 376 738 formance on unseen languages has largely Otom´ı oto Oto-Manguean 222 748 Quechua quy Quechuan 743 750 been limited to low-level, syntactic tasks, and Raramuri´ tar Uto-Aztecan 743 750 it remains unclear if zero-shot learning of Shipibo-Konibo shp Panoan 743 750 high-level, semantic tasks is possible for un- Wixarika hch Uto-Aztecan 743 750 seen languages. To explore this question, we present AmericasNLI, an extension of XNLI Table 1: The languages in AmericasNLI, along with (Conneau et al., 2018) to 10 indigenous lan- their ISO codes, language families, and dataset sizes. guages of the Americas. We conduct experi- ments with XLM-R, testing multiple zero-shot and translation-based approaches. Addition- further improvements (Muller et al., 2020; Pfeiffer ally, we explore model adaptation via contin- et al., 2020a,b; Wang et al., 2020). ued pretraining and provide an analysis of the Practically all work evaluating the zero-shot per- dataset by considering hypothesis-only mod- formance of models on languages not seen dur- els. We find that XLM-R’s zero-shot perfor- ing pretraining is currently limited to low-level, mance is poor for all 10 languages, with an av- erage performance of 38.62%. Continued pre- syntactic tasks such as part-of-speech tagging, de- training offers improvements, with an average pendency parsing, and named-entity recognition accuracy of 44.05%. Surprisingly, training on (Muller et al., 2020; Wang et al., 2020). This is due poorly translated data by far outperforms all to the fact that most multilingual datasets for high- other methods with an accuracy of 48.72%. level, semantic tasks only cover languages which are well-resourced enough to already be contained 1 Introduction in the pretraining data.1 This limits our ability to Pretrained multilingual models such as XLM draw more general conclusions with regards to the arXiv:2104.08726v1 [cs.CL] 18 Apr 2021 (Lample and Conneau, 2019), multilingual BERT zero-shot learning abilities of pretrained multilin- (mBERT; Devlin et al., 2019), and XLM-R (Con- gual models for unseen languages. neau et al., 2020) achieve strong cross-lingual trans- In order to make such an evaluation possi- fer results for many languages and natural language ble, we introduce AmericasNLI, an extension of processing (NLP) tasks. However, there exists a XNLI (Conneau et al., 2018) – a natural language discrepancy in terms of zero-shot performance be- inference (NLI; cf. §2.3) dataset covering 15 tween languages present in the pretraining data and high-resource languages – to 10 indigenous lan- those that are not: performance is generally highest guages spoken in the Americas: Ashaninka,´ Ay- for well-represented languages, and decreases with mara, Bribri, Guarani, Nahuatl,´ Otom´ı, Quechua, less representation. Yet, even for unseen languages, 1One notable exception is XCoPA (Ponti et al., 2020), performance is generally above chance, and model which covers two languages not present in mBERT’s pretrain- adaptation approaches have been shown to yield ing data: Haitian Creole and Quechua. Raramuri,´ Shipibo-Konibo, and Wixarika. All These models follow the standard pretraining– of them are truly low-resource languages: they finetuning paradigm: they are first trained on unla- have small or no Wikipedia corpora, and are not beled monolingual corpora from various languages present in the training data of current state-of-the- (the pretraining languages) and later finetuned art pretrained multilingual models. This dataset en- on target-task data in a – usually high-resource ables us to address the following research questions – source language. Having been exposed to a vari- (RQs): (1) How do existing multilingual models ety of languages through this training setup, cross- perform on unseen languages as compared to their lingual transfer results for these models are com- performance on XNLI? (2) Do methods aimed at petitive with the state of the art for many languages adapting models to unseen languages – previously and tasks. Commonly used models are mBERT exclusively evaluated on low-level, syntactic tasks (Devlin et al., 2019), which is pretrained on the – also increase performance on NLI? Wikipedias of 104 languages with masked language We experiment with XLM-R, both with and with- modeling (MLM) and next sentence prediction out model adaptation via continued pretraining on (NSP), and XLM, which is trained on 15 languages monolingual corpora in the target language. Our and introduces the translation language modeling results show that the performance of XLM-R out- objective, which is based on MLM, but uses pairs of-the-box is moderately above chance, and model of parallel sentences. XLM-R has improved per- adaptation leads to improvements of up to 5.88 formance over XLM, and trains on data from 100 percentage points. Training on machine-translated different languages with only the MLM objective. training data, however, results in an even larger per- Common to all models is a large shared subword formance gain of 10.1 percentage points over the vocabulary created using either WordPiece (Sen- corresponding XLM-R model without adaptation. nrich et al., 2016) or SentencePiece (Kudo and We further perform an analysis via experiments Richardson, 2018) tokenization. with hypothesis-only models, to examine poten- tial artifacts which may have been inherited from 2.2 Evaluating Pretrained Multilingual XNLI and find that performance is above chance Models for most models, but still below that for using the full example. Just as in the monolingual setting, where bench- AmericasNLI is publicly available at marks such as GLUE (Wang et al., 2018) and Super- nala-cub.github.io/resources. We GLUE (Wang et al., 2019) provide a look into the hope that it will serve as a benchmark for measur- performance of models across various tasks, multi- ing the zero-shot natural language understanding lingual benchmarks (Hu et al., 2020; Liang et al., abilities of multilingual models for unseen lan- 2020) cover a wide variety of tasks involving sen- guages. Additionally, we hope that our dataset will tence structure, classification, retrieval, and ques- motivate the development of novel pretraining and tion answering. Evaluation is based on zero-shot model adaptation techniques which are suitable for transfer from English, providing a strong bench- truly low-resource languages. mark for cross-lingual transfer. Additional work has been done examining what 2 Background and Related Work mechanisms allow multilingual models to trans- fer across languages (Pires et al., 2019; Wu and 2.1 Pretrained Multilingual Models Dredze, 2019). Wu and Dredze(2020) examine Prior to the widespread use of pretrained trans- transfer performance dependent on a language’s former models, cross-lingual transfer was mainly representation in the pretraining data. For language achieved through word embeddings (Mikolov et al., with low representation, multiple methods have 2013; Pennington et al., 2014; Bojanowski et al., been proposed to improve performance, including 2017), either by aligning monolingual embeddings extending the vocabulary, transliterating the target into the same embedding space (Conneau et al., text, and continuing pretraining before finetuning 2017; Lample et al., 2017; Grave et al., 2018) or (Lauscher et al., 2020; Chau et al., 2020; Muller by training multilingual embeddings (Ammar et al., et al., 2020; Pfeiffer et al., 2020a,c; Wang et al., 2016; Artetxe and Schwenk, 2019). Pretrained mul- 2020). In this work, we focus on continued pretrain- tilingual models represent the extension of multilin- ing to analyze the performance of model adaptation gual embeddings to pretrained transformer models. for a high-level, semantic task. 2.3 Natural Language Inference Lang. Source(s) Sent. aym Tiedemann(2012) 6,531 Given two sentences, the premise and the hypothe- Feldman and Coto-Solano(2020); Margery(2005); sis, the task of NLI consists of determining whether Jara Murillo(2018a); Constenla et al.(2004); bzd 7,508 the hypothesis logically entails, contradicts, or is Jara Murillo and Garc´ıa Segura(2013); Jara Murillo(2018b); Flores Sol orzano´ (2017) neutral to the premise. The most widely used cni Cushimariano Romano and Sebastian´ Q.(2008) 3,883 datasets for NLI in English are SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018). gn Chiruzzo et al.(2020) 26,032 XNLI (Conneau et al., 2018) is the multilingual hch Mager et al.(2017) 8,966 expansion of MNLI to 15 languages, providing nah Gutierrez-Vasques et al.(2016) 16,145 manually translated evaluation sets and machine- oto https://tsunkua.elotl.mx 4,889 translated training sets. While datasets for NLI or quy Agic´ and Vulic´(2019) 125,008 Galarreta et al.(2017); Loriot et al.(1993); the similar task of recognizing textual entailment shp 14,592 Gomez´ Montoya et al.(2019) exist for other languages (Bos et al., 2009; Alab- Brambila(1976); bas, 2013; Eichler et al., 2014; Amirkhani et al., tar github.com/pywirrarika/tar_par 14,720 2020), their lack of similarity prevents a general- ized study of cross-lingual zero-shot performance.