When Being Unseen from Mbert Is Just the Beginning: Handling New Languages with Multilingual Language Models

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models Benjamin Mullery Antonis Anastasopoulosz Benoît Sagoty Djamé Seddahy yInria, Paris, France zDepartment of Computer Science, George Mason University, USA [email protected] [email protected] Abstract East) or Bambara (spoken by around 5 million people in Mali and neighboring countries) are not cov- Transfer learning based on pretraining lan- ered by any available language models. guage models on a large amount of raw data Even if training multilingual models that cover has become a new norm to reach state-of-the- art performance in NLP. Still, it remains un- more languages and language varieties is tempt- clear how this approach should be applied for ing, the curse of multilinguality described by Con- unseen languages that are not covered by any neau et al.(2020) makes it an impractical solu- available large-scale multilingual language tion, as it would require to train ever larger mod- model and for which only a small amount els. Furthermore, as shown by Wu and Dredze of raw data is generally available. In this (2020), large-scale multilingual language models work, by comparing multilingual and mono- reach sub-optimal performance for languages that lingual models, we show that such models only account for a small portion of the pretraining behave in multiple ways on unseen languages. Some languages greatly benefit from trans- data. fer learning and behave similarly to closely In this paper, we describe and analyze task related high resource languages whereas oth- and language adaptation experiments to get usable ers apparently do not. Focusing on the lat- language model-based representations for under- ter, we show that this failure to transfer is studied low resource languages. We run exper- largely related to the impact of the script used iments on 16 typologically diverse unseen lan- to write such languages. Transliterating those languages improves very significantly the abil- guages on three NLP tasks with different char- ity of large-scale multilingual language mod- acteristics: part-of-speech (POS) tagging, depen- els on downstream tasks. dency parsing (DEP) and named entity recognition (NER). Our results bring forth a great diversity of be- 1 Introduction haviors that we classify in three categories re- Language models are now a new standard to flecting the abilities of pretrained multilingual lan- build state-of-the-art Natural Language Processing guage models to be used for low-resource lan- (NLP) systems. In the past year, monolingual languages. Some languages, the “Easy” ones, largely guage models have been released for more than 20 behave like high resource languages. Fine-tuning languages including Arabic, French, German, Ital- large-scale multilingual language models in a task- ian, Polish, Russian, Spanish, Swedish, and Viet- specific way leads to state-of-the-art performance. namese (Antoun et al., 2020; Martin et al., 2020b; The “Intermediate” languages are harder to pro- de Vries et al., 2019; Cañete et al., 2020; Kura- cess as large-scale multilingual language models tov and Arkhipov, 2019; Schweter, 2020, et alia). lead to sub-optimal performance as such. How- Additionally, large-scale multilingual models cov- ever, adapting them using unsupervised fine-tuning ering more than 100 languages are now available on available raw data in the target language leads (XLM-R by Conneau et al.(2020) and mBERT to a significant boost in performance, reaching or by Devlin et al.(2019)). Still, most of the 7000+ extending the state of the art. Finally, the “Hard” spoken languages in the world are not covered— languages are those for which large-scale multi- remaining unseen—by those models. Even lan- lingual models fail to provide decent downstream guages with millions of native speakers like Sorani performance even after unsupervised adaptation. Kurdish (about 7 million speakers in the Middle “Hard" languages include both stable and endangered languages, but they predominantly are 2020) enabled transfer from high to low-resource languages of communities that are majorly under- languages, leading to significant improvements in served by modern NLP. Hence, we direct our at- downstream task performance (Rahimi et al., 2019; tention to these “Hard" languages. For those lan- Kondratyuk and Straka, 2019). Furthermore, in guages, we show that the script they are written their most recent forms, multilingual models, such in is a critical element in the transfer abilities as mBERT, process tokens at the sub-word level us- of pretrained multilingual language models. We ing SentencePiece tokenization (Kudo and Richard- show that transliterating them into the script of a son, 2018). This means that they work in an open possibly-related high resource language leads to vocabulary setting.1 This flexibility enables such large gains in performance leading to outperform- models to process any language, even those that ing non-contextual strong baselines. are not part of their pretraining data. To sum up, our main contributions are the fol- When it comes to low-resource languages, one lowing: direction is simply to train such contextualized • Based on our empirical results, we propose a embedding models on whatever data is available. new categorization of low-resource languages Another option is to adapt/finetune a multilingual that are currently not covered by any available pretrained model to the language of interest. We language models: the Hard, the Intermediate briefly discuss these two options. and the Easy languages. • We show that Hard languages can be better Pretraining language models on a small addressed by transliterating them into a better- amount of raw data Even though the amount handled script (typically Latin), providing a of pre-training data seems to correlate with down- promising direction for rendering multilingual stream task performance (e.g. compare BERT and language models useful for a new set of unseen RoBERTa), several attempts have shown that train- languages. ing a new model from scratch can be efficient even if the amount of data in that language is limited. 2 Background and Motivations Indeed, Suárez et al.(2020) showed that pretraining ELMo models (Peters et al., 2018) on less than As Joshi et al.(2020) vividly illustrate, there is 1GB of Wikipedia text leads to state-of-the-art per- a great divergence in the coverage of languages formance while Martin et al.(2020a) showed for by NLP technologies. The majority of the 7000+ French that pretraining a BERT model on as few of the world’s languages are not studied by the as 4GB of diverse enough data results in state-of- NLP community. Some languages have very few the-art performance. This was further confirmed by or no annotated datasets, making the development Micheli et al.(2020) who demonstrated that decent of systems challenging. performance was achievable with as low as 100BM The development of such models is a matter of of raw text data. first importance for the inclusion of communities, the preservation of endangered languages and more Adapting large-scale models for low-resource generally to support the rise of tailored NLP ecosys- languages Multilingual language models can be tems for such languages (Schmidt and Wiegand, used directly on unseen languages, or they can 2017; Stecklow, 2018; Seddah et al., 2020) In that also be adapted using unsupervised methods. For regard, the advent of the Universal Dependency example, Han and Eisenstein(2019) successfully project (Nivre et al., 2016) and the WikiAnn dataset used unsupervised model adaptation of the English (Pan et al., 2017) have greatly opened the num- BERT model to Early Modern English for sequence ber of covered languages by providing annotated labeling. Instead of finetuning the whole model, datasets for respectively 90 languages for depen- Pfeiffer et al.(2020) recently showed that adapter dency parsing and 282 languages for Named Entity layers (Houlsby et al., 2019) can be injected into Recognition. multilingual language models to provide parameter efficient task and language transfer. Regarding modeling approaches, the emergence of multilingual representation models, first with Still, as of today, the availability of monolin- static word embeddings (Ammar et al., 2016) and gual or multilingual language models is limited to then with language model-based contextual rep- 1As long as the input text is written in a script that is used resentations (Devlin et al., 2019; Conneau et al., in the training languages. approximately 120 languages, leaving many lan- Language (iso) Script Family #sents source guages without access to valuable NLP technology, Bambara (bm) Latin Niger-Congo 1k OSCAR Wolof (wo) Latin Niger-Congo 10k OSCAR although some are spoken by millions of people, Swiss* (gsw) Latin West Germanic 250k OSCAR including Bambara, Maltese and Sorani Kurdish. Naija (pcm) Latin Pidgin (En) 237k Other Faroese (fao) Latin North Germanic 297k Leipzig What can be done for unseen languages? Un- Maltese (mlt) Latin Semitic 50k OSCAR Narabizi (nrz) Latin Semitic** 87k Other seen languages vary greatly in the amount of avail- Sorani (ckb) Arabic Indo-Iranian 380k OSCAR able data, in their script (many languages use non- Uyghur (ug) Arabic Turkic 105k OSCAR Sindhi (sd) Arabic Indo-Aryan 375k OSCAR Latin scripts such as Sorani Kurdish and Mingre- Mingrelian (xmf) Georg. Kartvelian 29k Wiki lian), and in their morphological or syntactical Buryat (bxu) Cyrillic Mongolic 7k Wiki Mari (mhr) Cyrillic

When Being Unseen from Mbert Is Just the Beginning: Handling New Languages with Multilingual Language Models

Pulley / Descender / Belay Device

Da Heng Building

Reply to Heng: Inborn Aneuploidy and Chromosomal Instability

Unicode Alphabets for L ATEX

Heng Xian and the Problem of Studying Looted Artifacts

Highway Dharma Letters: Two Buddhist Pilgrims Write to Their Teacher

Quantum Integrable Systems from Conformal Blocks Quantum

1 Symbols (2286)

The Brill Typeface User Guide & Complete List of Characters

Yu Tse Heng [email protected] • Yutseheng.Com • 4I2–5I7–294I

Rubber Bands Thailand AD Prelim

Heng Zhang EDUCATION PUBLICATION WORKING PAPER