A Primer on Pretrained Multilingual Language Models

A Primer on Pretrained Multilingual Language Models Sumanth Doddapaneni1∗ Gowtham Ramesh1 ∗ Anoop Kunchukuttan3;4 Pratyush Kumar1;2;4 Mitesh M. Khapra1;2;4 1Robert Bosch Center for Data Science and Artificial Intelligence, 2Indian Institute of Technology, Madras 3The AI4Bharat Initiative, 4Microsoft Abstract (Le et al., 2020), CamemBERT(French) (Martin et al., 2020), BERTje (Dutch) (Delobelle et al., Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. have 2020), FinBERT (Finnish) (Rönnqvist et al., 2019), emerged as a viable option for bringing the BERTeus (Basque) (Agerri et al., 2020), AfriBERT power of pretraining to a large number of lan- (Afrikaans) (Ralethe, 2020), IndicBERT (Indian guages. Given their success in zero shot trans- languages) (Kakwani et al., 2020) etc. However, fer learning, there has emerged a large body training such language-specific models is only fea- of work in (i) building bigger MLLMs cov- sible for a few languages which have the necessary ering a large number of languages (ii) creat- data and computational resources. ing exhaustive benchmarks covering a wider variety of tasks and languages for evaluating The above situation has lead to the undesired ef- MLLMs (iii) analysing the performance of fect of limiting recent advances in NLP to English MLLMs on monolingual, zero shot crosslin- and a few high resource languages (Joshi et al., gual and bilingual tasks (iv) understanding the 2020a). The question then is How do we bring universal language patterns (if any) learnt by the benefit of such pretrained BERT based models MLLMs and (v) augmenting the (often) lim- to a very long list of languages of interest? One ited capacity of MLLMs to improve their per- alternative, which has become popular, is to train formance on seen or even unseen languages. In this survey, we review the existing literature multilingual language models (MLLMs) such as covering the above broad areas of research per- mBERT (Devlin et al., 2019), XLM (Conneau and taining to MLLMs. Based on our survey, we Lample, 2019), XLM-R (Conneau et al., 2020a), recommend some promising directions of fu- etc. A MLLM is pretrained using large amounts ture research. of unlabeled data from multiple languages with 1 Introduction the hope that low resource languages may benefit from high resource languages due to shared vo- The advent of BERT (Devlin et al., 2019) has rev- cabulary, genetic relatedness (Nguyen and Chiang, olutionised the field of NLP and has lead to state 2017) or contact relatedness (Goyal et al., 2020). of the art performance on a wide variety of tasks Several such MLLMs have been proposed in the (Wang et al., 2018a). The recipe is to train a deep past 3 years and they differ in the architecture (e.g., transformer based model (Vaswani et al., 2017) number of layers, parameters, etc), objective func- arXiv:2107.00676v1 [cs.CL] 1 Jul 2021 on large amounts of monolingual data and then tions used for training (e.g., monolingual masked fine-tune it on small amounts of task-specific data. language modeling objective, translation language The pretraining happens using a masked language modeling objective, etc), data used for pretraining modeling objective and essentially results in an en- (Wikipedia, CommonCrawl, etc) and the number coder which learns good sentence representations. of languages involved (ranging from 12 to 100). These pretrained sentence representations then lead To keep track of these rapid advances in MLLMs, to improved performance on downstream tasks as a first step, we present a survey of all existing when fine-tuned on even small amounts of task- MLLMs clearly highlighting their similarities and specific training data (Devlin et al., 2019). Given differences. its success in English NLP, this recipe has been While training an MLLM is more efficient and replicated across languages leading to many lan- inclusive (covers more languages), is there a trade- guage specific BERTs such as FlauBERT (French) off in the performance compared to a monolingual *∗ The first two authors have contributed equally. model? More specifically, for a given language is a language-specific BERT better than a MLLM? For We survey several works (Conneau and Lample, example, if one is only interested in English NLP 2019; Kakwani et al., 2020; Huang et al., 2019; should one use English BERT or a MLLM. The Conneau et al., 2020a; Eisenschlos et al., 2019; advantage of the former is that there is no capacity Zampieri et al., 2020; Libovický et al., 2020; dilution (i.e., the entire capacity of the model is Jalili Sabet et al., 2020; Chen et al., 2020; Zenkel dedicated to a single language), whereas the advan- et al., 2020; Dou and Neubig, 2021; Imamura and tage of the latter is that there is additional pretrain- Sumita, 2019; Ma et al., 2020; Zhu et al., 2020; Liu ing data from multiple (related) languages. In this et al., 2020b; Xue et al., 2021) which use MLLMs work, we survey several existing studies (Conneau for downstream bilingual tasks such as unsuper- et al., 2020a; Wu and Dredze, 2020; Agerri et al., vised machine translation, crosslingual word align- 2020; Virtanen et al., 2019; Rönnqvist et al., 2019; ment, crosslingual QA, etc. We summarise the Ro et al., 2020; de Vargas Feijó and Moreira, 2020; main findings of these studies which indicate that Virtanen et al., 2019; Wang et al., 2020a; Wu and MLLMs are useful for bilingual tasks, particularly Dredze, 2020) which show that the right choice in low resource scenarios. depends on various factors such as model capacity, The surprisingly good performance of amount of pretraining data, fine-tuning mechanism MLLMs in crosslingual transfer as well as and amount of task-specific training data. bilingual tasks motivates the hypothesis that MLLMs are learning universal patterns. However, One of the main motivations of training our survey of the studies in this space indicates that MLLMs is to enable transfer from high resource there is no consensus yet. While representations languages to low resource languages. Of particular learnt by MLLMs share commonalities across interest, is the ability of MLLMs to facilitate zero languages identified by different correlation anal- shot crosslingual transfer (K et al., 2020) from a yses, these commonalities are dominantly within resource rich language to a resource deprived lan- languages of the same family, and only in certain guage which does not have any task-specific train- parts of the network (primarily middle layers). ing data. To evaluate such crosslingual transfer, Also, while probing tasks such as POS tagging several benchmarks, such as XGLUE (Liang et al., are able to benefit from such commonalities, 2020), XTREME (Hu et al., 2020), XTREME-R harder tasks such as evaluating MT quality remain (Ruder et al., 2021) have been proposed. We review beyond the scope as yet. Thus, though promising, these benchmarks which contain a wide variety of MLLMs do not yet represent inter-lingua. tasks such as classification, structure prediction, Lastly, given the effort involved in training Question Answering, and crosslingual retrieval. Us- MLLMs it is desirable that it easy to extend it ing these benchmarks several studies (Pires et al., to new languages which weren’t a part of the ini- 2019; Wu and Dredze, 2019; K et al., 2020; Artetxe tial pretraining. We review existing studies which et al., 2020a; K et al., 2020; Dufter and Schütze, propose methods for (a) extending MLLMs to 2020; Liu et al., 2020a; Lauscher et al., 2020; Liu unseen languages, and (b) improving the capac- et al., 2020c; Conneau and Lample, 2019; Wang ity (and hence performance) of MLLMs for lan- et al., 2019; Liu et al., 2019a; Cao et al., 2020; guages already seen during pretraining. These Wang et al., 2020d; Zhao et al., 2020; Wang et al., range from simple techniques such as fine-tuning 2020b; Chi et al., 2020b) have studied the crosslin- the MLLM for a few epochs on the target language gual effectiveness of MLLMs and have shown that to using language and task specific adapters to aug- such transfer depends on various factors such as ment the capacity of MLLMs. amount of shared vocabulary, explicit alignment of representations across languages, size of pretrain- 1.1 Goals of the survey ing corpora, etc. We collate the main findings of Summarising the above discussion, the main goal these studies in this survey. of this survey is to review existing work with a While the above discussion has focused on trans- focus on the following questions: fer learning and facilitating NLP in low resource • How are different MLLMs built and how do languages, MLLMs could also be used for bilin- they differ from each other? (Section2) gual tasks. For example, could the shared representations learnt by MLLMs improve Machine • What are the benchmarks used for evaluating Translation between two resource rich languages? MLLMs? (Section3) • For a given language, are MLLMs better than data can be sampled using exponential weighted monolingual LMs? (Section4) smoothing (discussed later) (Conneau et al., 2020a; Devlin et al., 2018) or separate vocabularies can be • Do MLLMs facilitate zero shot crosslingual learnt for clusters of languages (Chung et al., 2020) transfer? (Section5) partitioning the vocab size. • Are MLLMs useful for bilingual tasks? (Sec- Transformer Layers A typical MLLM com- tion6) prises the encoder of the transformer network and • Do MLLMs learn universal patterns? (Sec- contains a stack of N layers with each layer con- tion7) taining k attention heads followed by a feedforward neural network. For every token in the input se- • How to extend MLLMs to new languages? quence, an attention head computes an embedding (Section8) using an attention weighted linear combination of the representations of all the other tokens in the • What are the recommendations based on this sentence.

A Primer on Pretrained Multilingual Language Models

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support