When Being Unseen from Mbert Is Just the Beginning: Handling New Languages with Multilingual Language Models

When Being Unseen from mBERT is just the Beginning: Handling New Languages With Multilingual Language Models Benjamin Mullery* Antonios Anastasopoulosz Benoît Sagoty Djamé Seddahy yInria, Paris, France *Sorbonne Université, Paris, France zDepartment of Computer Science, George Mason University, USA [email protected] [email protected] Abstract those models. Even languages with millions of na- tive speakers like Sorani Kurdish (about 7 million Transfer learning based on pretraining lan- speakers in the Middle East) or Bambara (spoken guage models on a large amount of raw data by around 5 million people in Mali and neighbor- has become a new norm to reach state-of-the- art performance in NLP. Still, it remains unclear ing countries) are not covered by any available how this approach should be applied for unseen language models at the time of writing. languages that are not covered by any available Even if training multilingual models that cover large-scale multilingual language model and more languages and language varieties is tempting, for which only a small amount of raw data is the curse of multilinguality (Conneau et al., 2020) generally available. In this work, by compar- makes it an impractical solution, as it would require ing multilingual and monolingual models, we show that such models behave in multiple ways to train ever larger models. Furthermore, as shown on unseen languages. Some languages greatly by Wu and Dredze(2020), large-scale multilingual benefit from transfer learning and behave simi- language models are sub-optimal for languages that larly to closely related high resource languages are under-sampled during pretraining. whereas others apparently do not. Focusing In this paper, we analyze task and language adap- on the latter, we show that this failure to trans- tation experiments to get usable language model- fer is largely related to the impact of the script based representations for under-studied low re- used to write such languages. We show that transliterating those languages significantly im- source languages. We run experiments on 15 ty- proves the potential of large-scale multilingual pologically diverse languages on three NLP tasks: language models on downstream tasks. This part-of-speech (POS) tagging, dependency parsing result provides a promising direction towards (DEP) and named-entity recognition (NER). making these massively multilingual models Our results bring forth a diverse set of behaviors 1 useful for a new set of unseen languages. that we classify in three categories reflecting the 1 Introduction abilities of pretrained multilingual language models to be used for low-resource languages. We dub Language models are now a new standard to those categories Easy, Intermediate and Hard. build state-of-the-art Natural Language Process- Hard languages include both stable and endan- ing (NLP) systems. In the past year, monolingual gered languages, but they predominantly are lan- language models have been released for more than guages of communities that are majorly under- 20 languages including Arabic, French, German, served by modern NLP. Hence, we direct our atten- and Italian (Antoun et al., 2020; Martin et al., 2020; tion to these Hard languages. For those languages, de Vries et al., 2019; Cañete et al., 2020; Kuratov we show that the script they are written in can be and Arkhipov, 2019; Schweter, 2020, inter alia). a critical element in the transfer abilities of pre- Additionally, large-scale multilingual models cov- trained multilingual language models. Translit- ering more than 100 languages are now available erating them leads to large gains in performance (XLM-R by Conneau et al.(2020) and mBERT outperforming non-contextual strong baselines. To by Devlin et al.(2019)). Still, most of the 6500+ sum up, our contributions are the following: spoken languages in the world (Hammarström, • We propose a new categorization of the low- 2016) are not covered—remaining unseen—by resource languages that are unseen by avail- 1Code available at https://github.com/benjami able language models: the Hard, the Interme- n-mlr/mbert-unseen-languages.git diate and the Easy languages. 448 Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 448–462 June 6–11, 2021. ©2021 Association for Computational Linguistics • We show that Hard languages can be better of pretraining data seems to correlate with down- addressed by transliterating them into a better- stream task performance (e.g. compare BERT and handled script (typically Latin), providing a RoBERTa (Liu et al., 2020)), several attempts have promising direction towards making multilin- shown that training a model from scratch can be ef- gual language models useful for a new set of ficient even if the amount of data in that language is unseen languages. limited. Indeed, Ortiz Suárez et al.(2020) showed that pretraining ELMo models (Peters et al., 2018) 2 Background and Motivation on less than 1GB of text leads to state-of-the-art As Joshi et al.(2020) vividly illustrate, there is a performance while Martin et al.(2020) showed that large divergence in the coverage of languages by pretraining a BERT model on as few as 4GB of di- NLP technologies. The majority of the 6500+ of verse enough data results in state-of-the-art perfor- the world’s languages are not studied by the NLP mance. Micheli et al.(2020) further demonstrated community, since most have few or no annotated that decent performance was achievable with only datasets, making systems’ development challeng- 100MB of raw text data. ing. Adapting large-scale models for low-resource The development of such models is a matter of languages Multilingual language models can be high importance for the inclusion of communities, used directly on unseen languages, or they can the preservation of endangered languages and more also be adapted using unsupervised methods. For generally to support the rise of tailored NLP ecosys- example, Han and Eisenstein(2019) successfully tems for such languages (Schmidt and Wiegand, used unsupervised model adaptation of the English 2017; Stecklow, 2018; Seddah et al., 2020). In BERT model to Early Modern English for sequence that regard, the advent of the Universal Dependen- labeling. Instead of fine-tuning the whole model, cies project (Nivre et al., 2016) and the WikiAnn Pfeiffer et al.(2020) recently showed that adapter dataset (Pan et al., 2017) have greatly increased layers (Houlsby et al., 2019) can be injected into the number of covered languages by providing an- multilingual language models to provide parameter notated datasets for more than 90 languages for efficient task and language transfer. dependency parsing and 282 languages for NER. Regarding modeling approaches, the emergence Still, as of today, the availability of monolin- of multilingual representation models, first with gual or multilingual language models is limited to static word embeddings (Ammar et al., 2016) and approximately 120 languages, leaving many lan- then with language model-based contextual rep- guages without access to valuable NLP technology, resentations (Devlin et al., 2019; Conneau et al., although some are spoken by millions of people, 2020) enabled transfer from high to low-resource including Bambara and Sorani Kurdish, or are an languages, leading to significant improvements in official language of the European Union, like Mal- downstream task performance (Rahimi et al., 2019; tese. Kondratyuk and Straka, 2019). Furthermore, in What can be done for unseen languages? Un- their most recent forms, these multilingual mod- seen languages strongly vary in the amount of els process tokens at the sub-word level (Kudo available data, in their script (many languages use and Richardson, 2018). As such, they work in an non-Latin scripts such as Sorani Kurdish and Min- open vocabulary setting, only constrained by the grelian), and in their morphological or syntactical pretraining character set. This flexibility enables properties (most largely differ from high-resource such models to process any language, even those Indo-European languages). This makes the design that are not part of their pretraining data. of a methodology to build contextualized models When it comes to low-resource languages, one for such languages challenging at best. In this work, direction is to simply train contextualized embed- by experimenting with 15 typologically diverse un- ding models on whatever data is available. An- seen languages, (i) we show that there is a diversity other option is to adapt/fine-tune a multilingual of behavior depending on the script, the amount of pretrained model to the language of interest. We available data, and the relation to the pretraining briefly discuss these two options. languages; (ii) Focusing on the unseen languages Pretraining language models on a small that lag in performance compared to their easier-to- amount of raw data Even though the amount handle counterparts, we show that the script plays a 449 critical role in the transfer abilities of multilingual 3.3 Language Models language models. Transliterating such languages In all our study, we train our language models using to a script which is used by a related language seen the Transformers library (Wolf et al., 2020). during pretraining can lead to significant improve- ment in downstream performance. MLM from scratch The first approach we eval- uate is to train a dedicated language model from 3 Experimental Setting scratch on the available

Load more