A Primer on Pretrained Multilingual Language Models

Sumanth Doddapaneni1∗ Gowtham Ramesh1 ∗ Anoop Kunchukuttan3,4 Pratyush Kumar1,2,4 Mitesh M. Khapra1,2,4

1Robert Bosch Center for Data Science and Artificial Intelligence, 2Indian Institute of Technology, Madras 3The AI4Bharat Initiative, 4Microsoft

Abstract (Le et al., 2020), CamemBERT(French) (Martin et al., 2020), BERTje (Dutch) (Delobelle et al., Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, etc. have 2020), FinBERT (Finnish) (Rönnqvist et al., 2019), emerged as a viable option for bringing the BERTeus (Basque) (Agerri et al., 2020), AfriBERT power of pretraining to a large number of lan- (Afrikaans) (Ralethe, 2020), IndicBERT (Indian guages. Given their success in zero shot trans- languages) (Kakwani et al., 2020) etc. However, fer learning, there has emerged a large body training such language-specific models is only fea- of work in (i) building bigger MLLMs cov- sible for a few languages which have the necessary ering a large number of languages (ii) creat- data and computational resources. ing exhaustive benchmarks covering a wider variety of tasks and languages for evaluating The above situation has lead to the undesired ef- MLLMs (iii) analysing the performance of fect of limiting recent advances in NLP to English MLLMs on monolingual, zero shot crosslin- and a few high resource languages (Joshi et al., gual and bilingual tasks (iv) understanding the 2020a). The question then is How do we bring universal language patterns (if any) learnt by the benefit of such pretrained BERT based models MLLMs and (v) augmenting the (often) lim- to a very long list of languages of interest? One ited capacity of MLLMs to improve their per- alternative, which has become popular, is to train formance on seen or even unseen languages. In this survey, we review the existing literature multilingual language models (MLLMs) such as covering the above broad areas of research per- mBERT (Devlin et al., 2019), XLM (Conneau and taining to MLLMs. Based on our survey, we Lample, 2019), XLM-R (Conneau et al., 2020a), recommend some promising directions of fu- etc. A MLLM is pretrained using large amounts ture research. of unlabeled data from multiple languages with 1 Introduction the hope that low resource languages may benefit from high resource languages due to shared vo- The advent of BERT (Devlin et al., 2019) has rev- cabulary, genetic relatedness (Nguyen and Chiang, olutionised the field of NLP and has lead to state 2017) or contact relatedness (Goyal et al., 2020). of the art performance on a wide variety of tasks Several such MLLMs have been proposed in the (Wang et al., 2018a). The recipe is to train a deep past 3 years and they differ in the architecture (e.g., transformer based model (Vaswani et al., 2017) number of layers, parameters, etc), objective func-

arXiv:2107.00676v1 [cs.CL] 1 Jul 2021 on large amounts of monolingual data and then tions used for training (e.g., monolingual masked fine-tune it on small amounts of task-specific data. language modeling objective, translation language The pretraining happens using a masked language modeling objective, etc), data used for pretraining modeling objective and essentially results in an en- (Wikipedia, CommonCrawl, etc) and the number coder which learns good sentence representations. of languages involved (ranging from 12 to 100). These pretrained sentence representations then lead To keep track of these rapid advances in MLLMs, to improved performance on downstream tasks as a first step, we present a survey of all existing when fine-tuned on even small amounts of task- MLLMs clearly highlighting their similarities and specific training data (Devlin et al., 2019). Given differences. its success in English NLP, this recipe has been While training an MLLM is more efficient and replicated across languages leading to many lan- inclusive (covers more languages), is there a trade- guage specific BERTs such as FlauBERT (French) off in the performance compared to a monolingual *∗ The first two authors have contributed equally. model? More specifically, for a given language is a language-specific BERT better than a MLLM? For We survey several works (Conneau and Lample, example, if one is only interested in English NLP 2019; Kakwani et al., 2020; Huang et al., 2019; should one use English BERT or a MLLM. The Conneau et al., 2020a; Eisenschlos et al., 2019; advantage of the former is that there is no capacity Zampieri et al., 2020; Libovický et al., 2020; dilution (i.e., the entire capacity of the model is Jalili Sabet et al., 2020; Chen et al., 2020; Zenkel dedicated to a single language), whereas the advan- et al., 2020; Dou and Neubig, 2021; Imamura and tage of the latter is that there is additional pretrain- Sumita, 2019; Ma et al., 2020; Zhu et al., 2020; Liu ing data from multiple (related) languages. In this et al., 2020b; Xue et al., 2021) which use MLLMs work, we survey several existing studies (Conneau for downstream bilingual tasks such as unsuper- et al., 2020a; and Dredze, 2020; Agerri et al., vised machine translation, crosslingual word align- 2020; Virtanen et al., 2019; Rönnqvist et al., 2019; ment, crosslingual QA, etc. We summarise the Ro et al., 2020; de Vargas Feijó and Moreira, 2020; main findings of these studies which indicate that Virtanen et al., 2019; Wang et al., 2020a; Wu and MLLMs are useful for bilingual tasks, particularly Dredze, 2020) which show that the right choice in low resource scenarios. depends on various factors such as model capacity, The surprisingly good performance of amount of pretraining data, fine-tuning mechanism MLLMs in crosslingual transfer as well as and amount of task-specific training data. bilingual tasks motivates the hypothesis that MLLMs are learning universal patterns. However, One of the main motivations of training our survey of the studies in this space indicates that MLLMs is to enable transfer from high resource there is no consensus yet. While representations languages to low resource languages. Of particular learnt by MLLMs share commonalities across interest, is the ability of MLLMs to facilitate zero languages identified by different correlation anal- shot crosslingual transfer (K et al., 2020) from a yses, these commonalities are dominantly within resource rich language to a resource deprived lan- languages of the same family, and only in certain guage which does not have any task-specific train- parts of the network (primarily middle layers). ing data. To evaluate such crosslingual transfer, Also, while probing tasks such as POS tagging several benchmarks, such as XGLUE (Liang et al., are able to benefit from such commonalities, 2020), XTREME (Hu et al., 2020), XTREME-R harder tasks such as evaluating MT quality remain (Ruder et al., 2021) have been proposed. We review beyond the scope as yet. Thus, though promising, these benchmarks which contain a wide variety of MLLMs do not yet represent inter-lingua. tasks such as classification, structure prediction, Lastly, given the effort involved in training Question Answering, and crosslingual retrieval. Us- MLLMs it is desirable that it easy to extend it ing these benchmarks several studies (Pires et al., to new languages which weren’t a part of the ini- 2019; Wu and Dredze, 2019; K et al., 2020; Artetxe tial pretraining. We review existing studies which et al., 2020a; K et al., 2020; Dufter and Schütze, propose methods for (a) extending MLLMs to 2020; Liu et al., 2020a; Lauscher et al., 2020; Liu unseen languages, and (b) improving the capac- et al., 2020c; Conneau and Lample, 2019; Wang ity (and hence performance) of MLLMs for lan- et al., 2019; Liu et al., 2019a; Cao et al., 2020; guages already seen during pretraining. These Wang et al., 2020d; Zhao et al., 2020; Wang et al., range from simple techniques such as fine-tuning 2020b; Chi et al., 2020b) have studied the crosslin- the MLLM for a few epochs on the target language gual effectiveness of MLLMs and have shown that to using language and task specific adapters to aug- such transfer depends on various factors such as ment the capacity of MLLMs. amount of shared vocabulary, explicit alignment of representations across languages, size of pretrain- 1.1 Goals of the survey ing corpora, etc. We collate the main findings of Summarising the above discussion, the main goal these studies in this survey. of this survey is to review existing work with a While the above discussion has focused on trans- focus on the following questions: fer learning and facilitating NLP in low resource • How are different MLLMs built and how do languages, MLLMs could also be used for bilin- they differ from each other? (Section2) gual tasks. For example, could the shared rep- resentations learnt by MLLMs improve Machine • What are the benchmarks used for evaluating Translation between two resource rich languages? MLLMs? (Section3) • For a given language, are MLLMs better than data can be sampled using exponential weighted monolingual LMs? (Section4) smoothing (discussed later) (Conneau et al., 2020a; Devlin et al., 2018) or separate vocabularies can be • Do MLLMs facilitate zero shot crosslingual learnt for clusters of languages (Chung et al., 2020) transfer? (Section5) partitioning the vocab size. • Are MLLMs useful for bilingual tasks? (Sec- Transformer Layers A typical MLLM com- tion6) prises the encoder of the transformer network and • Do MLLMs learn universal patterns? (Sec- contains a stack of N layers with each layer con- tion7) taining k attention heads followed by a feedforward neural network. For every token in the input se- • How to extend MLLMs to new languages? quence, an attention head computes an embedding (Section8) using an attention weighted linear combination of the representations of all the other tokens in the • What are the recommendations based on this sentence. The embeddings from all the attention survey? (Section9) heads are then concatenated and passed through a This survey focuses on the multilingual aspects feedforward network to produce a d dimensional of language models for NLU. Hence, we do not embedding for each input token. discuss related topics like monolingual LMs, pre- As shown in Table1, existing MLLMs may dif- trained models for NLG, training of large models, fer in the choice of N, k and d. Further, the param- model compression, etc. eters in each layer may be shared as in (Kakwani et al., 2020). 2 How are MLLMs built? Output Layer The outputs of the last trans- The goal of MLLMs to is learn a model that can former layer are typically used as contextual rep- generate a multilingual representation of a given resentations for each token, while the embedding text. Loosely, the model should generate similar corresponding to the [CLS] token is considered representations in a common vector space for to be the embedding of the entire input text. Alter- similar sentences and words (or words in similar natively, the text embedding can also be computed context) across languages. In this section we via pooling operations on the token embeddings. describe the neural network architecture, objective The output layer contains simple linear trans- functions, data and languages used for building formation followed by a softmax that takes as in- MLLMs. We also highlight the similarities and put a token embedding from the last transformer differences between existing MLLMs. layer and outputs a probability distribution over the tokens in the vocabulary. Note that the output layer is required only during pretraining and can 2.1 Architecture be discarded during fine-tuning and task-inference Multilingual Language models are typically based - a fact that the RemBERT model uses to reduce on the transformer architecture introduced by model size (Chung et al., 2021b). Vaswani et al.(2017) and then adapted for Nat- ural Language Understanding (NLU) by Devlin 2.2 Training Objective Functions et al.(2019). A variety of objective functions have been proposed Input Layer The input to the MLM is a sequence for training MLLMs. These can be broadly cate- of tokens. The token input comes from a one-hot gorized as monolingual or parallel objectives de- representation of a finite vocabulary, which is typ- pending on the nature of training data required. We ically a subword vocabulary. This vocabulary is discuss and compare these objective functions in generally learnt from a concatenation of monolin- this section. gual data from various languages using algorithms like BPE (Sennrich et al., 2016b), wordpiece (Wu 2.2.1 Monolingual Objectives et al., 2016) or sentencePiece (Kudo and Richard- The objective functions are defined on monolingual son, 2018a). To ensure reasonable representation in data alone. These are unsupervised/self-supervised the vocabulary for different languages and scripts, objectives that train the model to generate multilin- Model Architecture pretraining Languages Task Objective N k d #P arams. Mono. Parallel specific #langs. vocab. Function data

IndicBERT (Kakwani et al., 2020) 12 12 768 33M MLM IndicCorp   12 200K MLM, TLM, Unicoder (Huang et al., 2019) 12 16 1024 250M Wikipedia  15 95K CLWR, CLPC, CLMLM X XLM-15 (Conneau and Lample, 2019) 12 8 1024 250M MLM, TLM Wikipedia X  15 95K XLM-17 (Conneau and Lample, 2019) 16 16 1280 570M MLM, TLM Wikipedia X  17 200K CommonCrawl MuRIL (Khanuja et al., 2021) 12 12 768 236M MLM, TLM  17 197K + Wikipedia X † VECO-small (Luo et al., 2021) 6 12 768 247M MLM, CS-MLM CommonCrawl X  50 250K VECO-Large (Luo et al., 2021) 24 16 1024 662M MLM, CS-MLM CommonCrawl X  50 250K InfoXLM-base (Chi et al., 2021a) 12 12 768 270M MLM, TLM, XLCO CommonCrawl X  94 250K InfoXLM-Large (Chi et al., 2021a) 24 16 1024 559M MLM, TLM, XLCO CommonCrawl X  94 250K XLM-100 (Conneau and Lample, 2019) 16 16 1280 570M MLM, TLM Wikipedia   100 200K XLM-R-base (Conneau et al., 2020a) 12 12 768 270M MLM CommonCrawl   100 250K XLM-R-Large (Conneau et al., 2020a) 24 16 1024 559M MLM CommonCrawl   100 250K X-STILTS (Phang et al., 2020) 24 16 1024 559M MLM CommonCrawl  X 100 250K HiCTL-base (Wei et al., 2021) 12 12 768 270M MLM, TLM, HICTL CommonCrawl X  100 250K HiCTL-Large (Wei et al., 2021) 24 16 1024 559M MLM, TLM, HICTL CommonCrawl X  100 250K MLM, TLM, Ernie-M-base (Ouyang et al., 2021) 12 12 768 270M CommonCrawl  100 250K CAMLM, BTMLM X MLM, TLM, Ernie-M-Large (Ouyang et al., 2021) 24 16 1024 559M CommonCrawl  100 250K CAMLM, BTMLM X mBERT (Devlin et al., 2019) 12 12 768 172M MLM Wikipedia   104 110K Amber (Hu et al., 2021) 12 12 768 172M MLM, TLM, CLWA, CLSA Wikipedia X  104 120K CommonCrawl RemBERT (Chung et al., 2021a) 32 18 1152 , 559M‡ MLM   110 250K + Wikipedia

Table 1: A comparison of existing Multilingual Language Models. † - Cross sequence MLM which is useful for NLG tasks. ‡ - For pretraining, RemBERT has pretraining parameters of 995M gual representations by predicting missing tokens masked tokens. While this objective is monolin- given the context tokens. gual, it surprisingly helps learn multilingual models Masked Language Model (MLM). This is the where the encoder representations across languages most standard training objective used for training are aligned without the need for any parallel cor- most MLLMs. Typically, other pretraining pora. The potential reasons for this surprising ef- objectives are used in conjunction with the fectiveness of MLM for multilingual models are MLM objective. It is a simple extension of discussed later. the unsupervised MLM objective for a single Causal Language Model (CLM). This is the tra- language to multiple languages by pooling together ditional language modelling objective of predicting monolingual data from multiple languages. Let the next word given the previous words. Unlike x1, x2, . . . , xT be the sequence of words in a given MLM, CLM has access to just unidirectional con- training example. Of these, k tokens (≈ 15%) are text. Given the success of MLM based language randomly selected for masking. If the i-th token models for NLU applications, CLM has fallen out is selected for masking then it is replaced by (i) of favour and is currently used for pretraining NLG the [MASK] token (≈ 80% of the time), or (ii) a models where only unidirectional context is avail- random token (≈ 10% of time), or (iii) kept as it able for generation. We describe it for the sake of is (≈ 10% of time). The goal is to then predict completeness. Let x1, x2, . . . , xT be the sequence these k masked tokens using the remaining T − k of words in a given training batch. The goal then tokens. More formally, the model is trained to is to predict the i-th word given the previous i − 1 minimize the cross entropy loss for predicting the words. Specifically, the model is trained to mini- d masked tokens. Specifically, if ui ∈ R is the mize the cross entropy loss for predicting the i-th representation for the i-th masked token computed word given the previous i − 1 words. by the last layer, then the cross-entropy loss for predicting this token is computed as: 2.2.2 Parallel-corpora Objectives These objectives require parallel corpora and are designed to explicitly force representations of simi- eW ui lar text across languages to be close to each other LL(i) = − log PV W uj in the multilingual encoder space. The objectives j=1 e are either word-level (TLM, CAMLM, CLMLM, V ×d where V is the size of the vocabulary, W ∈ R HICTL, CLSA) or sentence-level (XLCO, HICTL, is a parameter to be learned. The total loss is CLSA). Since parallel corpus is generally much then obtained by summing over the loss of all the smaller than the monolingual data, the parallel ob- jectives are used in conjunction with monolingual Crosslingual Contrastive Learning (XLCO). models. This can be done via joint optimization Chi et al.(2021b) propose that we can leverage of parallel and monolingual objectives, with each parallel data for training MLLMs by maximizing objective weighted appropriately. Sometimes, the the information content between parallel sentences. A initial training period may involve only monolin- For example, let ai be a sentence in language A B gual objectives (XLCO, HICTL, CLSA). and let bi be its translation in language B. In ad- B N Translation Language Model (TLM). In addi- dition, let {bj }j=1,j6=i be N − 1 sentences in B A tion to monolingual data in each language, we may which are not translations of ai . Chi et al.(2021b) A also have access to parallel data between some show that the information content between ai and B languages. Conneau and Lample(2019) intro- bi can be maximized by minimizing the following duced the translation language modeling objective InfoNCE (van den Oord et al., 2019) based loss A A A function: to leverage such parallel data. Let x1 , x2 , . . . , xT be the sequence of words in a language A and B B B let x1 , x2 , . . . , xT be the corresponding parallel A > B exp(f(ai ) f(bi )) sequence of words in a language B. Both the se- LXLCO = − log (1) PN A > B quences are fed as input to the MLM with a [SEP] j=1 exp(f(ai ) f(bj ) token in between. Similar to MLM, a total of k to- where f(a) is the encoding of the sentence a as kens are masked such that these tokens could either computed by the transformer. Instead of just explic- belong to the sequence in A or the sequence in B. itly sampling negative sentences from {bB}N , To predict a masked word in A the model could j j=1,j6=i the authors use momentum (He et al., 2020) and rely on the surrounding words in A or the transla- mixup contrast (Chi et al., 2021b) to construct tion in B, thereby implicitly being forced to learn harder negative samples. aligned representations. More specifically, if the Hierarchical Contrastive Learning (HICTL). context in A is not sufficient (due to masking or HICTL introduced in Wei et al.(2021) also uses otherwise), then the model can use the context in B an InfoNCE based contrastive loss (CTL) and ex- to predict a masked token in A. The final objective tends it to learn both sentence level and word function is the same as MLM, i.e., to minimise the level crosslingual representations. For sentence cross entropy loss of the masked tokens. The only level CTL (same as Equation (1)), instead of di- difference is that that masked tokens could belong rectly sampling from {bB}N , the authors to either language. j j=1,j6=i use smoothed linear interpolation (Bowman et al., Cross-attention Masked Language Modeling 2016; Zheng et al., 2019) between sentences in the (CAMLM). CAMLM introduced in Ouyang et al. embedding space to construct hard negative sam- (2021) learns crosslingual representation by pre- ples. For word level CTL, the similarity score used dicting masked tokens in a parallel sentence pair. in the contrastive loss is calculated between the While predicting the masked tokens in a source sentence representation q ([CLS] token of a par- sentence, the model is restricted to use semantics A B allel sentence pair < ai , bi >) and other words. of the target sentence and vice versa. As opposed A bag of words W is maintained for each parallel to TLM, where the model has access to both in- sentence pair input and each word in W is consid- put sentence pairs to predict the mask tokens, in ered a positive sample while the other words in CAMLM, the model is restricted to only use the the vocabulary are considered negative. Instead of tokens in the corresponding parallel sentence to sampling negative words from the large vocabulary predict the masked tokens in the source sentence. V , they construct a subset S ⊂ V − W of negative Crosslingual Masked Language Modeling words that are very similar to q in the embedding (CLMLM). CLMLM (Huang et al., 2019) is space. very similar to TLM objective as the masked- Crosslingual Sentence alignment (CLSA). Hu language-modeling is performed with crosslingual et al.(2021) leverage parallel data by training the sentences as input. The main difference is that Crosslingual model to align sentence representa- unlike TLM, the input is constructed at document tions. Given a parallel sentence (X,Y), the model level where multiple sentences from a crosslingual is trained to predict the corresponding translation document are replaced by their translations in Y for sentence X from negative samples in a mini- another language. batch. Unlike mBERT which encodes two sen- tences and uses [CLS] embeddings for sentence gual representations by leveraging back translation representation, the corresponding sentence repre- (Sennrich et al., 2016a). It has 2 stages wherein sentation in AMBER (Hu et al., 2021) is computed in the first stage, pseudo-parallel data is generated by averaging the word embeddings in the final layer from a given monolingual sentence. In ERNIE-M of the MLLM. (Ouyang et al., 2021), the pseudo-parallel sentence is generated by pretraining the model first with 2.2.3 Objectives based on other parallel CAMLM and adding placeholder masks at the end resources of the original monolingual sentence to indicate While parallel corpora are the most commonly used the position and language the model needs to gen- resource in the parallel objectives, other objectives erate. In the second stage, tokens in the original use additional parallel resources like word align- monolingual sentence are masked and the sentence ments (CLWR, CLWA), crosslingual paraphrases in then concatenated with the generated pseudo- (CLPC), code-mixed data (ALM) and backtrans- parallel sentence. The model has to then predict lated data (BTMLM). the masked tokens. Crosslingual word recovery (CLWR): Similar to In summary, parallel data can be used to improve TLM, this task introduced by Huang et al.(2019) both word level and sentence level cross lingual rep- aims to learn the word alignments between parallel resentations. The word alignment based objectives sentences in two languages. A trainable attention help for zero shot transfer on word level tasks like matrix (Bahdanau et al., 2016) is used to represent POS, NER, etc while the sentence level objectives source language word embeddings by the target lan- are useful for tasks like crosslingual sentence re- guage word embeddings. The crosslingual model trieval. is then trained to reconstruct the source language word embeddings from the transformation. 2.3 Pretraining Data Crosslingual paraphrase classification (CLPC). Huang et al.(2019) leverage parallel data by intro- Pretraining data. During pretraining MLLMs use ducing a paraphrase classification objective where two different sources of data (a) large monolingual parallel sentences (X,Y) across languages are posi- corpora in individual languages, and (b) parallel tive samples and non-parallel sentences (X,Z) are corpora between some languages. Existing treated as negative samples. To make the task chal- MLLMs differ in the source of monolingual lenging, they train a lightweight paraphrase detec- corpora they use. For example, mBERT (Devlin tion model and sample Z that is very close to X et al., 2019) is trained using Wikipedia whereas but is not equal to Y XLM-R (Conneau et al., 2020a) is trained using the Alternating Language Model (ALM). ALM much larger common-crawl corpus. IndicBERT (Yang et al., 2020) uses parallel sentences to con- (Kakwani et al., 2020) on the other hand is trained struct code-mixed sentences and perform MLM on on custom crawled data in Indian languages. These it. The code mixed sentences are constructed by re- pretraining data-sets used by different MLLMs are placing aligned phrases between source and target summarised in Table1. language. Crosslingual Word alignment (CLWA). Hu et al. Languages. Some MLLMs like XLM-R are mas- (2021) leverage attention mechanism in transform- sively multilingual as they support ∼ 100 lan- ers to learn a word alignment based objective with guages whereas others like IndicBERT (Kakwani parallel data. The model is trained to produce two et al., 2020) and MuRIL (Khanuja et al., 2021) attention matrices - source to target attention Ax→y support a smaller set of languages. When deal- which measures the similarity of source words with ing with large number of languages one needs target words and similarly target to source atten- to be careful about the imbalance between the tion Ay→x. To encourage the model to align words amount of pretraining data available in different similarly in both source and target direction, they languages. For example, the number of English minimise the distance between Ax→y and Ay→x. articles in Wikipedia and CommonCrawl is much Back Translation Masked Language Modeling lager than the number of Finnish or Odia articles. (BTMLM). BTMLM introduced in Ouyang et al. Similarly, there might be a difference between the (2021) attempts to overcome the limitation of un- amount of parallel data available between differ- availability of parallel corpora for learning crosslin- ent languages. To ensure that low resource lan- Task Corpus Train Dev Test Test Sets Lang Task Metric Domain Benchmark XNLI 392,702 2,490 5,010 Translations 15 NLI Acc. Misc. XT, XTR, XG PAWS-X 49,401 2,000 2,000 Translations 7 Paraphrase Acc. Wiki / Quora XT, XTR, XG XCOPA 33,410+400 100 500 Translations 11 Reasoning Acc. Misc XTR Classification NC 100k 10k 10k - 5 Sent. Labelling Acc. News XG QADSM 100k 10k 10k - 3 Sent. Relevance Acc. Bing XG WPR 100k 10k 10k - 7 Sent. Relevance nDCG Bing XG QAM 100k 10k 10k - 7 Sent. Relevance Acc. Bing XG UD-POS 21,253 3,974 47-20,436 Ind. annot. 37(104) POS F1 Misc. XT, XTR, XG Struct. Pred WikiANN-NER 20,000 10,000 1,000-10,000 Ind. annot. 47(176) NER F1 Wikipedia XT, XTR NER 15k 2.8k 3.4k - 4 NER F1 News?? XG XQuAD 87,599 34,736 1,190 Translations 11 Span Extraction F1/EM Wikipedia XT, XTR QA MLQA 4,517-11,590 Translations 7 Span Extraction F1/EM Wikipedia XT, XTR TyDiQA-GoldP 3,696 634 323-2,719 Ind. annot. 9 Span Extraction F1/EM Wikipedia XT, XTR BUCC - - 1,896-14,330 - 5 Sent. Retrieval F1 Wiki / News XT Tatoeba - - 1,000 - 33(122) Sent. Retrieval Acc. Misc. XT Retrieval Mewsli-X 116,903 10,252 428-1,482 ind. annot. 11(50) Lang. agn. retrieval mAP@20 News XTR LAReQA XQuAD-R 87,599 10,570 1,190 translations 11 Lang. agn. retrieval mAP@20 Wikipedia XTR

Table 2: Benchmarks for the all tasks for evaluation of MLLMs. Lang represents the number of languages con- sidered from the entire pool of languages as part of the benchmark. Here XT refers to XTREME, XTR refers to XTREME-R and XG refers to XGLUE guages are not under-represented in the model, training/evaluation data for a wide variety of most MLLMs use exponentially smoothed weight- tasks and languages as shown in Table2. These ing of the data while creating the pretraining data. tasks can be broadly classified into the following In particular, if m% of the total pretraining data categories as discussed below (i) classification, (ii) belongs to language i then the probability of that structure prediction, (iii) question answering, and k language is pi = 100 . Each pi is exponentiated (iv) retrieval. by a factor α and the resulting values are then nor- malised to give a probability distribution over the Classification. Given an input comprising of a sin- languages. The pretraining data is then sampled gle sentence or a pair of sentences, the task here according to this distribution. If α < 1 then the net is to classify the input into one of k classes. For effect is that the high-resource languages will be example, consider the task of Natural Language under sampled and the low resource languages will Inference (NLI) where the input is a pair of sen- be over sampled. This also ensures that the low re- tences and the output is one of 3 classes: entails, source languages get a reasonable share of the total neutral, contradicts. Some of the popular text clas- vocabulary used by the model (while training the sification datasets used for evaluating MLLMs are wordpiece (Schuster and Nakajima, 2012) or sen- XNLI (Conneau et al., 2018), PAWS-X (Yang et al., tencepiece model (Kudo and Richardson, 2018b)). 2019) , XCOPA (Ponti et al., 2020), NC (Liang Table1 summarises the number of languages sup- et al., 2020), QADSM (Liang et al., 2020), WPR ported by different MLLMs and the total vocab- (Liang et al., 2020) and QAM (Liang et al., 2020). ulary used by them. Typically, MLLMs which Structure Prediction. Given an input sentence, support more languages have a larger vocabulary. the task here is to predict a label for every word in the sequence. Two popular tasks here are 3 What are the benchmarks used for Parts-Of-Speech (POS) tagging and Named Entity evaluating MLLMs? Recognition (NER). For NER, the datasets from The most common evaluation for MLLMs is WikiANN-NER (Nivre et al., 2018b), CoNLL 2002 crosslingual performance on downstream tasks (Tjong Kim Sang, 2002) and CoNLL 2003 (Tjong i.e. fine-tune the model on task-specific data Kim Sang and De Meulder, 2003) shared tasks for a high-resource language like English and are used whereas for POS tagging the Universal evaluate it on other languages. Some common Dependencies dataset (Nivre et al., 2018a) is used. crosslingual benchmarks are XGLUE (Liang et al., Question Answering. Here the task is to extract 2020), XTREME (Hu et al., 2020), XTREME-R an answer span given a context and a question. (Ruder et al., 2021). These benchmarks contain The training data is typically available only in En- glish while the evaluation sets are available in mul- NER (Tjong Kim Sang, 2002; Tjong Kim Sang tiple languages. The datasets used for this task and De Meulder, 2003), QA (Lewis et al., 2020), include XQuAD (Artetxe et al., 2020b), MLQA MNLI (Williams et al., 2018), QNLI (Wang et al., (Lewis et al., 2020) and TyDiQA-GoldP (Clark 2018b), QQP (Wang et al., 2018b), SST (Socher et al., 2020). et al., 2013; Wang et al., 2018b), MRPC (Dolan Retrieval. Given a sentence in a source language, and Brockett, 2005; Wang et al., 2018b) and STS-B the task here is to retrieve a matching sentence (Cer et al., 2017; Wang et al., 2018b). They show in the target language from a collection of sen- that, in general, XLM-R performs better than XLM- tences. The following datasets are used for this Rbase which in turn performs better than mBERT. task: BUCC (Zweigenbaum et al., 2017), Tateoba While none of the MLLMs outperform a state of (Artetxe and Schwenk, 2019), Mewsli-X (Ruder the art monolingual model, XLM-R matches its et al., 2021), LAReQA XQuAD-R (Roy et al., performance (within 1-2%) on most tasks. 2020). Of these Mewsli-X and LAReQA XQuAD- Amount of pretraining data. Conneau et al. R are considered to be more challenging as they (2020a) compare two similar capacity models involve retrieving a matching sentence in the target pretrained on different amounts of data (Wikipedia language from a multilingual pool of sentences. v/s CommonCrawl) and show that the model trained on larger data consistently performs better 4 Are MLLMs better than monolingual on 3 different tasks (XNLI, NER, MLQA). The models? model trained on larger data is also able to match As mentioned earlier, pretrained language models the performance of state of the art monolingual such as BERT (Devlin et al., 2019) and its variants models. Wu and Dredze(2020) perform an have achieved state of the art results on many exhaustive study comparing the performance NLU tasks. The typical recipe is to first pretrain a of state of the art monolingual models (trained BERT-like model on large amounts of unlabeled using in-language training data) with mBERT data and then fine-tune the model with training based models (fine-tuned using in-language data for a specific task in a language L. Given this training data). Note that the state of the art recipe, there are two choices for pretraining: (i) monolingual model used in these experiments is pretrain a monolingual model using monolingual not necessarily a pretrained BERT based model data from L only, or (ii) pretrain a MLLM using (as for many low resource languages pretrainining data from multiple languages (including L). The with smaller amounts of corpora does not really argument in favor of the former is that, since we help). They consider 3 tasks and a large number are pretraining a model for a specific language of languages: NER (99 languages), POS tagging there is no capacity dilution (i.e., all the model (54 languages) and Dependency Parsing (54 capacity is being used to cater to the language of languages). Their main finding is that for the interest). The argument in favor of the latter is that bottom 30% of languages mBERT’s performance there might be some benefit of using the additional drops significantly compared to a monolingual pretraining data from multiple (related) languages. model. They attribute this poor performance to the Existing studies show that there is no clear winner inability of mBERT to learn good representations and the right choice depends on a few factors as for these languages from the limited pretraining listed below: data. However, they also caution that this is not necessarily due to “multilngual” pretraining as a monolingual BERT trained on these low resource Model capacity. Conneau et al.(2020a) argue that languages still performs poorly compared to using a high capacity MLLM trained on much mBERT. The performance is worse than a state of larger pretraining data is better than smaller ca- the art non-BERT based model simply because pacity MLLMs. In particular, they compare the pretraining with smaller corpora in these languages performance of mBERT (172M parameters), XLM- is not useful (neither in a multilingual nor in a R (270M parameters) and XLM-R (559M pa- base monolingual setting). rameters) with a state of the art monolingual mod- els, i.e., BERT (335M parameters), RoBERTa The importance of the amount of pretraining (355M paramaters) (Liu et al., 2019b) on the fol- is also emphasised in Agerri et al.(2020) where lowing datasets: XNLI (Conneau et al., 2018), they show that a monolingual BERT based model pretrained with larger data (216M tokens v/s 35M a single joint model as opposed to 4 individual tokens in mBERT) outperforms mBERT on 4 tasks: models. Moon et al.(2019) report similar results topic classification, NER, POS tagging and senti- for NER and strongly advocate a single joint ment classification (1.5 to 10 point better across model. Lastly, Wang et al.(2020a) report that for these tasks). Similarly, Virtanen et al.(2019) show the task of identifying offensive language in social that pretraining a monolingual BERT from scratch media posts, a jointly trained model outperforms for Finnish with larger data (13.5B tokens v/s 450M individually trained models on 4 out of the 5 tokens in mBERT) outperforms mBERT on 4 tasks: languages considered. However, more careful NER, POS tagging, dependency parsing and news analysis involving a diverse set of languages and classification. Both these works partly attribute the tasks is required to conclusively say if a jointly better performance to better tokenization and vo- fine-tuned MLLM is preferable over individually cabulary representation in the monolingual model. fine-tuned language specific models. Similarly Rönnqvist et al.(2019) show that for four Nordic languages (Danish, Swedish, Norwegian, Amount of task-specific training data. Typi- Finnish) the performance of mBERT is very poor cally, there is some correlation between the amount as compared to its performance on high resource of task-specific training data and the amount of languages such as English and German. Further, pretraining data available in a language. For exam- even for English and German, the performance of ple, a low resource language would have smaller mBERT is poor when compared to the correspond- amounts of training data as well as pretraining data. ing monolingual models in these languages. Ro Hence, intuitively, one would expect that in such et al.(2020) report similar results for Open Informa- cases of extreme scarcity, a multilingual model tion extraction where a BERT based English model would perform better. However, Wu and Dredze outperforms mBERT. In contrast to the results pre- (2020) show that this is not the case. In particular, sented so far, de Vargas Feijó and Moreira(2020) they show that for the task of NER for languages show that across 7 different tasks and many differ- with only 100 labeled sentences, a monolingual ent experimental setups, on average mBERT per- model outperforms an mBERT based model. The forms better than a monolingual Portuguese model reason for this could be a mix of poor tokenization, trained on much larger data (4.8GB/992M tokens lower vocabulary share and poor representation v/s 2GB in mBERT). learning from limited pretraining data. We believe In summary, based on existing studies, there is that more experiments which carefully ablate the no clear answer to the following question: “At amount of training data across different tasks and what size of monolingual corpora does the advan- languages would help us better understand the tage of training a multilingual model disappear?”. utility of MLLMs. Also note that most of these studies use mBERT Summary: Based on existing studies it is not clear and hence more experiments involving XLM-Rbase, whether MLLMs are always better than monolin- XLM-Rlarge and other recent MLLMs are required to draw more concrete conclusions. gual models. We recommend a more systematic study where the above parameters are carefully ab- Joint v/s individual fine-tuning. Once a model is lated for a wider range of tasks and languages. pretrained, there are two choices for fine-tuning it (i) fine-tune an individual model for each language 5 Do MLLMs facilitate zero shot using the training data for that language or (ii) crosslingual transfer? fine-tune a joint model by combining the training data available in all the languages. For example, In the context of MLLMs , the standard zero shot Virtanen et al.(2019) consider a scenario where crosslingual transfer from a source language to NER training data is available in 4 languages. a target language involves the following steps: They show that a joint model fine-tuned using the (i) pretrain a MLLM using training data from training data in all the 4 languages matches the multiple languages (including the source and performance of monolingual models individually target language) (ii) fine-tune the source model on fine-tuned for each of these languages. The task-specific training data available in the source performance drop (if any) is so small that it does language (iii) evaluate the fine-tuned model on test not offset the advantage of deploying/maintaining data from the target language. In this section, we summarise existing studies on using MLLMs for show that the performance of crosslingual transfer crosslingual transfer and highlight some factors depends on the depth of the network and not which could influence their performance in such a so much on the number of attention heads. In setup. particular, they show that decent transfer can be achieved even with a single headed attention. Shared vocabulary. Before training a MLLM, the Similarly, the total number of parameters in the sentences from all languages are first tokenized by model are not as important as the number of layers jointly training a WordPiece model (Schuster and in determining the performance of crosslingual Nakajima, 2012) or a SentencePiece model (Kudo transfer. In fact, Dufter and Schütze(2020) argue and Richardson, 2018b). The basic idea is to to- that mBERT’s multilinguality is due to its limited kenize each word into high frequency subwords. number of parameters which forces it to exploit The vocabulary used by the model is then a union common structures to align representations across of such subwords identified across all languages. languages. Another important architectural choice This joint tokenization ensures that the subwords is the context window for self attention, i.e., the are shared across many similar (or even distant) number of tokens fed as input to the MLLM during languages. For example, the token ‘es’ in the vo- training. Liu et al.(2020a) show that using smaller cabulary would be shared by en, fr, de, es, context windows is useful if the pretraining data etc. It is thus possible that if a subword which ap- is small (around 200K sentences) but when large pears in the testset of the target language is also amounts of pretraining data is available then it is present in the training set of the source language beneficial to use longer context windows. then some model performance will be transferred through this shared vocabulary. Several studies ex- Size of pretraining corpora. Liu et al.(2020a) amine the role of this shared vocabulary in enabling show that crosslingual transfer is better when crosslingual transfer as discussed below. mBERT is pretrained on larger corpora (1000K Pires et al.(2019) and Wu and Dredze(2019) sentence per language) as opposed to smaller show that there is strong positive correlation corpora (200K sentences per language). Lauscher between crosslingual zero-shot transfer perfor- et al.(2020) show that for higher level tasks such mance and the amount of shared vocabulary as XNLI and XQuAD, the performance of zero between the source and target language. The shot transfer has a strong correlation with the results are consistent across 5 different tasks amount of data in the target language used for (MLDoc (Schwenk and Li, 2018), XNLI (Conneau pretraining the MLLM. For lower level tasks such et al., 2018), NER (Tjong Kim Sang, 2002; Tjong as POS, dependency parsing and NER also there Kim Sang and De Meulder, 2003), POS (Nivre is a positive correlation but not as high as that for et al., 2018a), Dependency parsing (Nivre et al., XNLI and XQuAD. 2018a) ) and 5-39 different languages (the number of languages varies across tasks). However, K et al. Fine-tuning strategies. Liu et al.(2020c) argue (2020) present contradictory results and show that that when a MLLM is fine-tuned its parameters the performance drop is negligible even when the change which weakens its crosslingual ability as word overlap is reduced to zero synthetically (by some of the alignments learnt during pretraining shifting the unicode of each English character by a are lost. They motivate this by showing that large constant thereby replacing it by a completely the crosslingual retrieval performance of a different character). On similar lines Artetxe MLLM drops drastically when it is fine-tuned et al.(2020a) show that crosslingual transfer can for the task of POS tagging. To avoid this, they happen even when the vocabulary is not shared, suggest using a continual learning framework for by appropriately fine-tuning all layers of the fine-tuning so that the model does not forget the transformer except for the input embedding layer. original task (masked language modeling) that it was trained on. Using such a fine-tuning strategy Architecture of the MLLM. The model capacity they report better results on crosslingual POS of a MLLM depends on the number of layers, tagging, NER and sentence retrieval. number of attention heads per layer and the size of the hidden representations. K et al.(2020) Using bitext. While MLLMs show good trans- fer without explicitly being trained with any due to fine-tuning on other tasks and pretraining crosslingual signals, it stands to reason that with parallel data. The progress of crosslingual QA explicitly providing such signals during training datasets (such as MLQA) is very minimal and the should improve the performance. Indeed, XLM-R overall scores on the QA task are much less when (Conneau and Lample, 2019) and InfoXLM (Chi compared to monolingual QA in English. Simi- et al., 2021b) show that using parallel corpora with larly, on structure prediction tasks like NER and the TLM objective gives improved performance. POS tagging there isn’t much improvement in the If parallel corpus is not available then Dufter and performance going from some of the earlier models Schütze(2020) suggest that it is better to train like XLM-R (Conneau et al., 2020a) to some of the MLLMs with comparable corpora (say Wikipedia more recent models like VECO (Luo et al., 2021). or CommonCrawl) than using corpora from They recommend that (i) some of the easier tasks different sources across languages. such as BUCC and PAWS-X should be dropped from these evaluation benchmarks and (ii) more Explicit alignment of representations. Some complex tasks such as crosslingual retrival from works (Wang et al., 2019; Liu et al., 2019a) a mixed multilingual pool should be added (e.g., compare the performance of zero shot crosslingual LAReQA (Roy et al., 2020) and Mewsli-X (Ruder transfer using (i) representations learned by et al., 2021)). MLLMs which are implicitly aligned and (ii) Summary: While not very conclusive, existing representations from monolingual models which studies show that there is some evidence that the are then explicitly aligned post-hoc using some performance of MLLMs on zero shot crosslingual bitext. They observe that the latter leads to better transfer is generally better when (i) the source and performance. Taking cue from this, Cao et al. target languages share some vocabulary (ii) there (2020); Wang et al.(2020d) propose a method to is some similarity between the source and target explicitly align the representations of aligned word languages (iii) the MLLM uses a deeper architec- pairs across languages while training mBERT. ture (iv) enough pretraining data is available in the This is achieved by adding a loss function which target languages (v) a continual learning (learning- minimises the Euclidean distance between the without-forgetting) framework is used (vi) the rep- embeddings of aligned words. Zhao et al.(2020) resentations are explicitly aligned using bitext and also report similar results by explicitly aligning the appropriate loss functions and (vii) the complexity representations of aligned word pairs and further of the task is less. Note that in all the above cases, normalising the vector spaces (i.e., ensuring that crosslingual transfer using MLLMs performs much the representations across all languages have zero worse than using in-language supervision (as ex- mean and unit variance). pected). Further, in most cases it performs worse than a translate-and-train1 or a translate-and-test2 baseline. Knowledge Distillation. Both Wang et al.(2020b) and Chi et al.(2020b) argue that due to limited 6 Are MLLMs useful for bilingual model capacity MLLMs cannot capture all the tasks? nuances of multiple languages as compared to a pretrained monolingual model which caters to This survey has so far looked at the utility of only one language. They show that the crosslin- MLLMs for crosslingual tasks, where the multi- gual performance of a MLLM can be improved lingual capabilities of the MLLMs help in transfer by distilling knowledge from a monolingual model. learning and building universal models. Recent work has also explored the utility of MLLMs for bilingual tasks like word-alignment, sentence- Complexity of the task. As outlined in section3, retrieval, etc. This section analyzes the role of the tasks used for evaluating MLLMs are of dif- ferent complexity (ranging from binary classifica- 1translate-and-train: The training data available in one language is translated to the language of interest using a MT tion to Question Answering). Ruder et al.(2021) system. This (noisy) data is then used for training a model in show that much of the progress on zero shot trans- the target language. fer on existing benchmarks has not been uniform 2translate-and-test: The training data available in one (high resource) source language is used to train a model in with most of the gains coming from crosslingual that language. The test data from the language of interest is retrieval tasks. Here again, the gains are mainly then translated to the source language using a MT system. MLLMs for such bilingual tasks. tence embeddings (Libovický et al., 2020) and/or fine-tuning the MLLMs on parallel corpora with 6.1 Word Alignment contrastive/margin-based objectives (Feng et al., Recent work has shown that MLLMs, which are 2020; Ruder et al., 2021) and result in high-quality trained on monolingual corpora only, can be used sentence retrieval. to learn high-quality word alignments in parallel sentences (Libovický et al., 2020; Jalili Sabet et al., 6.3 Machine Translation 2020). This presents a promising unsupervised al- NMT models are typically encoder-decoder mod- ternative to statistical aligners (Brown et al., 1993; els with attention trained using parallel corpora. Och and Ney, 2003; Östling and Tiedemann, 2016; It has been show that using BERT for initializing Dyer et al., 2013) and neural MT based aligners the encoder/decoder (Conneau and Lample, 2019; (Chen et al., 2020; Zenkel et al., 2020), both of Imamura and Sumita, 2019; Ma et al., 2020) or ex- which are trained on parallel corpora. Finding the tracting features (Zhu et al., 2020) can help MT by best word alignment can be framed as a maximum- pretraining the model with linguistic information. weight maximal matching problem in the weighted In particular, Conneau and Lample(2019) show bipartite graph induced by the distance between that initialization of the encoder/decoder with a word embeddings. Greedy, iterative and optimal pretrained model can help in data-scarce scenarios transport based solutions to the problem have been like unsupervised NMT and low-resource NMT, proposed. The iterative solutions seem to perform and act as a substitute for backtranslation in su- the best with good alignment speed as well. Us- pervised NMT scenarios. Subsequent research has ing contextual embeddings from some intermedi- drawn inspiration from the success of MLLMs and ate layers gives better word alignment than the top shown the utility of denoising pretraining strategies encoder layer. Unlike statistical aligners, these specifically for sequence to sequence models (Liu MLLM based aligners are inherently multilingual. et al., 2020b; Xue et al., 2021). They also significantly outperform aligners based To summarize, MLLMs are useful for some on static word embeddings like FastText 3. The bilingual tasks, particularly in low-resource sce- MLLM aligners can be significantly improved by narios and fine-tuning with parallel data provides aligning on parallel corpora and/or word-aligned added benefits. data; fine-tuning also preserves the multilingual word-alignment (Dou and Neubig, 2021). Word 7 Do MLLMs learn universal patterns? alignment is useful for transfer in downstream tasks using the translate-test paradigm which needs pro- The success of language models such as BERT has jection of spans between parallel text (e.g., question led to a large body of research in understanding answering, POS, NER, etc). how such language models work, what information they learn, and how such information is represented 6.2 Crosslingual Sentence Retrieval in the models. Many of these questions studied in ‘BERTology’ are also relevant to multilingual lan- Since MLLMs represent sentences from different guage models (MLLMs) given the similarity in languages in a common embedding space, they the neural architectures of these networks. But can be used to find the translation of a sentence in one question relates specifically to the analysis of another language using nearest-neighbour search. MLLMs - Do these models learn and represent pat- The sentence embedding can be obtained from the terns which generalise across languages? Such an [CLS] token or appropriate pooling operations expectation is an old one going back to the “uni- on the token embeddings. However, the encoder versals of language” proposed in 1966 (Greenberg, representations across different languages are not 1966) and has been studied at different times. For aligned well enough for high quality sentence- instance, during the onset of word embeddings, it retrieval (Libovický et al., 2020). Unlike word was shown that the embeddings learnt across lan- alignment, embeddings learnt from monolingual guages can be aligned effectively (Mikolov et al., corpora alone are not sufficient for crosslingual sen- 2013). This expectation is renewed due to MLLMs tence retrieval given the large search space. These demonstrating surprisingly high crosslingual trans- shortcomings can be overcome by centering of sen- fer as summarised in §5. 3https://github.com/facebookresearch/fastText Inference on parallel text Different works ap- proach this question of universal patterns in differ- Japanese). This disparity between SOV and SVO ent ways. One set of methods analyse the inference languages is also observed for part-of-speech tag- on MLLMs of parallel text in different languages. ging (Pires et al., 2019). Each layer has different With such parallel text, the intermediate represen- specialisations and it is therefore useful to combine tations on the MLLM can be compared to identify information from different layers for best results, alignment, quantified with different mathematical instead of selecting a single layer based on the best techniques such as Canonical Correlation Analy- overall performance as demonstrated for Dutch on sis (CCA) and Centered Kernel Alignment (CKA). a range of NLU tasks (de Vries et al., 2020). In CCA analysis for mBERT showed that the model the same work, a comparison with a monolingual does not project the representations of different Dutch model revealed that a multilingual model has languages on to a shared space - a trend that is more informative representations for POS tagging stronger towards the later layers of the network in earlier layers. (Singh et al., 2019). Further, the correlation of Controlled Ablations Another set of results con- the representations of the languages mirrored lan- trol for the several confounding factors which need guage evolution, in particular phylogenetic trees to be controlled or ablated to check the hypothesis discovered by linguists. On the other hand, there that MLLMs learn language-independent represen- exist symmetries in the representations of multiple tations. The first such confounding factor is the languages as evidenced by isomorphic embedding joint script between many of the high resource lan- spaces (Conneau et al., 2020b). The argument in guages. This was identified not to be a sensitive support for the existence of such symmetries is that factor by demonstrating that transfer between rep- monolingual BERT models exhibit high degrees of resentations also occur between languages that do CKA similarity (Conneau et al., 2020b) Another not share the same script, such as Urdu written in related technique to find common representations the Arabic script and Hindi written in Devanagari across languages is machine translation. Given script (Pires et al., 2019). An important component a sentence in a source language and a few candi- of the model is the input tokenization. There is a date sentences in a target language, can we find strong bias to learn language-independent represen- the correct translation by identifying the nearest tations when using sub-word tokenization rather neighbour in the representation space? It is found than word-level or character-level (Singh et al., that such translation is sensitively dependent on 2019). Another variable of interest is the pretrain- the layer from which the representation is learnt - ing objective. Models such as LASER and XLM peaking in the middle layers of between 5 and 8 which are trained on crosslingual objectives retain with accuracy over 75% for related language pairs language-neutral features into the higher layers such as English-German and Hindi-Urdu (Pires better than m-BERT and XLMR which are only et al., 2019). One may conclude that that the very trained on monolingual objectives (Choenni and large neural capacity in MLLMs leads to multi- Shutova, 2020). lingual representations that have language-neutral and language-specific components. The language- In summary, there is no consensus yet on neutral components are rich enough to align word MLLMs learning universal patterns. There is clear embeddings and also retrieve similar sentences, but evidence that MLLMs learn embeddings which are not rich enough to solve complex tasks such as have high overlap across languages, primarily be- MT quality evaluation (Libovicky` et al., 2019). tween those of the same family. These common representations seem to be clearest in the middle Probing tasks Another approach to study univer- layers, after which the network specialises for dif- sality is by ‘probing tasks’ on the representations ferent languages as modelled in the pretraining learnt at different layers. For instance, consistent objectives. These common representations can dependency trees can be learnt from the represen- be probed to accurately perform supervised NLU tations of intermediate layers indicating syntactic tasks such as POS tagging, dependency parsing, abstractions in mBERT (Limisiewicz et al., 2020; in some cases with zero-shot transfer. However, Chi et al., 2020a). However, the dependency trees more complex tasks such as MT quality evaluation were more accurate for Subject-Verb-Object (SVO) (Libovicky` et al., 2019) or language generation languages (such as English, French, Indonesian) (Rönnqvist et al., 2019) remain outside the realm than SOV languages (such as Turkish, Korean, and of these models currently, keeping the debate on universal patterns incomplete. target language so that the newly introduced pa- rameters get trained. Wang et al.(2020c) show 8 How to extend MLMs to new languages that such an approach is not only effective for un- seen languages but also benefits languages that Despite their success in zero shot crosslingual trans- are already present in the MLLM. The increase fer, MLLMs suffer from the curse of multilinguality in the performance on languages already seen dur- which leads to capacity dilution. This limited ca- ing training is surprising and can be attributed to pacity is an issue for (i) high resource languages (i) increased representation of this language in the as the performance of MLLMs for such languages vocabulary (ii) focused fine-tuning on the target is typically lower than corresponding monolingual language and (iii) increased monolingual data for models (ii) low resource languages where the per- the target language. Some studied have also con- formance is even poorer and finally (iii) languages sidered unseen languages which share vocabulary which are unseen in training (the last point is obvi- with already seen languages. For example, Muller ous but needs to be stated nonetheless). Given this et al.(2020) show that for Naribazi (North African situation, an obvious question to ask is how to en- Arabic dialect) which is written using Latin script hance the capacity of MLLMs such that it benefits and has a lot of code mixing with French, simply languages already seen during training (high re- fine-tuning the BERT with very few sentences from source or low resource) as well as languages which Naribazi (around 50000 sentences) leads to reason- were unseen during training. The solutions pro- able zero shot transfer from French. For languages posed in the literature to address this question can having an unseen script, Müller et al.(2020) sug- be broadly classified into four categories as dis- gest that such languages should be transliterated to cussed below. a script which is used by a related language seen Fine-tuning on the target language. Here the as- during pretraining. sumption is that we only care about the perfor- mance on a single target language at test time. To Exploiting Latent Semantics in the embedding ensure that this language gets an increased share matrix. Pfeiffer et al.(2020c) argue that lexi- of the model capacity we can simply fine-tune the cally overlapping tokens play an important role pretrained MLLM using monolingual data in this in crosslingual transfer. However, for a new lan- languages. This is akin to fine-tuning a MLLM (or guage with an unseen script such transfer is not even a monolingual LM) for a downstream task. possible. To leverage the multilingual knowledge learned in the embedding matrix, they factorise the Pfeiffer et al.(2020b) show that such target lan- |V |×d guage adaptation prior to task specific fine-tuning embedding matrix (R ) into lower dimensional |V|× using source language data leads to improved per- word embeddings (F ∈ R 1 ) and C shared up- d1×d formance over the standard crosslingual transfer projection matrices (G1, G2, GC ∈ R ). The setting. In particular, it does not result in catas- matrix F encodes token specific information and trophic forgetting of the multilingual knowledge the up-projection matrices encode general crosslin- learned during pretraining which enables crosslin- gual information. Each token is associated with gual transfer. The disadvantage of course is that the one of the C up-projection matrices. For an un- model is now specific to the given target language seen language T with an unseen script, they then 0 |V |× and may not be suitable for other languages. Fur- learn an embedding matrix (F ∈ R T 1 ) and ther, this method does not address the fundamental an assignment for each token in that language to limitation in the model capacity and still hinders one of the C up-projection matrices. This allows adaptation to low resource and unseen languages. the token to leverage the multilingual knowledge already encoded in the pretrained MLLM via the Augmenting vocabulary. A simple but effective up-projection matrices. way of extending a MLLM to a new language which was unseen during training is to simply aug- Using Adapters. Another popular way to increase ment the vocabulary of the model with new tokens model capacity is to use adapters (Rebuffi et al., corresponding to the target language. This would 2017; Houlsby et al., 2019) which essentially intro- in turn lead to additional parameters getting cre- duce a small number of additional parameters for ated in the input (embedding) layer and the decoder every language and/or task, thereby augmenting the (output) layer. The pretrained MLLM can then be limited capacity of MLLMs. Several studies have further trained using monolingual data from the shown the effectiveness of using such adapters in MLLMs (Üstün et al., 2020; Vidoni et al., 2020; to three sets - architectural parameters, pretraining Pfeiffer et al., 2020b; Nguyen et al., 2021). We objectives, and subset of languages chosen. Given explain the overall idea by referring to the work of the number of options for each of these sets, an Pfeiffer et al.(2020b). An adapter can be added exhaustive ablation study would be prohibitively at every layer of the transformer. Let h be the expensive. However, in the absence of such a study hidden size of the Transformer and d be the di- some questions remain open: For instance, what mension of the adapter. An adapter layer simply subset of languages should one choose for training h d contains a down-projection (D : R → R ) fol- a multilingual model? How should the architecture lowed by a ReLU activation and an up-projection be shaped as we change the number of languages? d h ((U : R → R ). Such an adapter layer thus in- One research direction is to design controlled and troduces 2 ∗ d ∗ h additional parameters for every scaled-down ablation studies where a broader set language. During adaptation, the rest of the param- of parameters can be evaluated and generalized eters of the MLLM are kept fixed and the adapter guidelines can be derived. parameters are trained using unlabelled data from the target language using the MLM objective. Zero-Shot Evaluation. The primary promise of Further, Pfeiffer et al.(2020b) propose that dur- MLLMs remains crosslingual performance, espe- ing task-specific fine-tuning, the language adapter cially with zero-shot learning. However, results of the source language should be used whereas on zero-shot learning have a wide variance in the during zero shot crosslingual transfer, at test time, published literature across tasks and languages the source language adapter should be replaced by (Keung et al., 2020). A more systematic study, the target language adapter. However, this requires controlling for the design parameters discussed that the underlying MLLM does not change dur- above and the training and test sets is required. ing fine-tuning. Hence, they add a task-specific Specifically, there is need for careful comparisons adapter layer on top of a language specific adapter against translation-based baselines such as layer. During task-specific fine-tuning, only the translate-test, translate-train, translate-train-all parameters of this adapter layer are trained, leaving (Conneau et al., 2019) across tasks and languages. the rest of the MLLM unchanged. Vidoni et al. (2020) use orthogonality constraints to ensure that mBERTologoy. A large body of literature has the language and task specific adapters learn com- studied what the monolingual BERT model learns, plementary information. Lastly, to account for and how and where such information is stored in unseen languages whose vocabulary is not seen the model (Rogers et al., 2020). One example is the during training, they add invertible adapters at the analysis of the role of attention heads in encoding input layer. The function of these adapaters is to syntactic, semantic, or positional information learn token level characteristics. This adapter is (Voita et al., 2019). Given the similarity in also trained with the MLM objective using the un- architecture, such analysis may be extended labelled monolingual data in the target language. to MLLMs. This may help interpretability of In the context of the above discussion on MLLMs but crucially contrast multilingual models adapters, we would like to point the readers to from monolingual models, providing clues on the AdapterHub.ml 4 which is a useful repository con- emergence of crosslinguality. taining all recent adapter architectures (Pfeiffer et al., 2020a). Language inclusivity. MLLMs hold promise as 9 Recommendations an ‘infrastructure’ resource for the long list of the languages in the world. Many of the languages are Based on our review of the literature on widely spoken but not sufficiently focused enough MLLMs we make the following recommendations: in research and development (Joshi et al., 2020b). Towards this end, MLLMs must become more Ablation Studies. The design of deep neural inclusive scaling up to 1000s of languages. This models involves various parameters which are may require model innovations such as moving often optimized only by exhaustive ablation studies. beyond language-specific adapters. Crucially, In the case of MLLMs, the axes of ablation belong it also requires the availability of inclusive 4https://github.com/Adapter-Hub/adapter-transformers benchmarks in variety of tasks and languages. Without such benchmark datasets, the research ited capacity of MLLMs and extending them to new community has little incentive to train and evaluate languages. Based on our review, we recommend MLLMs targeting the long list of the world’s that future research on MLLMs should focus on (i) languages. We see this as an important next phase controlled ablation studies involving a broader set in the development of MLLMs. of parameters (ii) comprehensive evaluation of the zero shot performance of MLLMs across a wider Efficient models. MLLMs represent some set of tasks and languages (iii) understanding the of the largest models that are being trained patterns learn by attention heads and other com- today. However, running inference on such large ponents in MLLMs (iv) including more languages models is often not possible on edge devices and in pretraining and evaluation and (v) building ef- increasingly expensive on cloud devices. One ficient and robust MLLMs using better evaluation important research direction is to downsize these frameworks (e.g., Explainaboard). large models without affecting accuracy. Several standard methods such as pruning, quantiza- Acknowledgments tion, factorization, distillation, and architecture We would like to thank the Robert Bosch Center search have been used on monolingual models for Data Science and Artificial Intelligence for sup- (Tay et al., 2020). These methods need to be porting Sumanth and Gowtham through their Post explored for MLLMs while ensuring that the Baccalaureate Fellowship Program. generality of MLLMs across languages is retained.

Robust models. MLLMs supporting multiple lan- References guages need to be extensively evaluated for any encoded biases and their ability to generalize. One Rodrigo Agerri, Iñaki San Vicente, Jon Ander Cam- pos, Ander Barrena, Xabier Saralegi, Aitor Soroa, direction of research is to build extensive diagnos- and Eneko Agirre. 2020. Give your text representa- tic and evaluation suites such as MultiChecklist tion models some love: the case for basque. In Pro- proposed in Ruder et al.(2021). Evaluation frame- ceedings of The 12th Language Resources and Eval- work such as Explainaboard (Liu et al., 2021; Fu uation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4781–4788. European Lan- et al., 2020) need to also be developed for a range of guage Resources Association. tasks and languages to identify the nature of errors made by multilingual models. Also significant is Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. extending the analysis of bias in deep NLP models 2020a. On the cross-lingual transferability of mono- lingual representations. In Proceedings of the 58th to multilingual systems (Blodgett et al., 2020). Annual Meeting of the Association for Computa- tional Linguistics, ACL 2020, Online, July 5-10, 10 Conclusion 2020, pages 4623–4637. Association for Computa- tional Linguistics. We reviewed existing literature on MLLMs cov- ering a wide range of sub-areas of research. In Mikel Artetxe, Sebastian Ruder, and Dani Yogatama. 2020b. On the cross-lingual transferability of mono- particular, we surveyed papers on building bet- lingual representations. In Proceedings of the 58th ter and bigger MLLMs using different training Annual Meeting of the Association for Computa- objectives and different resources (monolingual tional Linguistics, pages 4623–4637, Online. Asso- data, parallel data, back-translated data, etc). We ciation for Computational Linguistics. also reviewed existing benchmarks for evaluating Mikel Artetxe and Holger Schwenk. 2019. Mas- MLLMs and covered several studies which use sively multilingual sentence embeddings for zero- these benchmarks to assess the factors which con- shot cross-lingual transfer and beyond. Trans. Assoc. tribute to the performance of MLLMs in a (i) zero Comput. Linguistics, 7:597–610. shot crosslingual transfer setup (ii) monolingual Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- setup and (iii) bilingual setup. Given the surpris- gio. 2016. Neural machine translation by jointly ingly good performance of MLLMs in several zero learning to align and translate. shot transfer learning setups we also reviewed ex- Su Lin Blodgett, Solon Barocas, Hal Daumé III, and isting work which investigates whether these mod- Hanna Wallach. 2020. Language (technology) is els learn any universal patterns across languages. power: A critical survey of" bias" in nlp. arXiv Lastly, we reviewed studies on improving the lim- preprint arXiv:2005.14050. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, An- information-theoretic framework for cross-lingual drew Dai, Rafal Jozefowicz, and Samy Bengio. language model pre-training. In Proceedings of 2016. Generating sentences from a continuous the 2021 Conference of the North American Chap- space. In Proceedings of The 20th SIGNLL Con- ter of the Association for Computational Linguistics: ference on Computational Natural Language Learn- Human Language Technologies, NAACL-HLT 2021, ing, pages 10–21, Berlin, Germany. Association for Online, June 6-11, 2021, pages 3576–3588. Associ- Computational Linguistics. ation for Computational Linguistics.

Peter F. Brown, Stephen A. Della Pietra, Vincent J. Rochelle Choenni and Ekaterina Shutova. 2020. What Della Pietra, and Robert L. Mercer. 1993. The math- does it mean to be language-agnostic? probing mul- ematics of statistical machine translation: Parameter tilingual sentence encoders for typological proper- estimation. Computational Linguistics, 19(2):263– ties. arXiv preprint arXiv:2009.12862. 311. Hyung Won Chung, Thibault Fevry, Henry Tsai, Steven Cao, Nikita Kitaev, and Dan Klein. 2020. Mul- Melvin Johnson, and Sebastian Ruder. 2021a. Re- tilingual alignment of contextual word representa- thinking embedding coupling in pre-trained lan- tions. In 8th International Conference on Learning guage models. In International Conference on Representations, ICLR 2020, Addis Ababa, Ethiopia, Learning Representations. April 26-30, 2020. OpenReview.net. Hyung Won Chung, Thibault Févry, Henry Tsai, Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez- Melvin Johnson, and Sebastian Ruder. 2021b. Re- Gazpio, and Lucia Specia. 2017. Semeval-2017 thinking embedding coupling in pre-trained lan- task 1: Semantic textual similarity-multilingual and guage models. In 9th International Conference on cross-lingual focused evaluation. arXiv preprint Learning Representations, ICLR 2021, Virtual Event, arXiv:1708.00055. Austria, May 3-7, 2021. OpenReview.net. Yun Chen, Yang Liu, Guanhua Chen, Xin Jiang, and Hyung Won Chung, Dan Garrette, Kiat Chuan Tan, and Qun Liu. 2020. Accurate word alignment induction Jason Riesa. 2020. Improving multilingual models from neural machine translation. In Proceedings of with language-clustered vocabularies. In Proceed- the 2020 Conference on Empirical Methods in Natu- ings of the 2020 Conference on Empirical Methods ral Language Processing (EMNLP), pages 566–576, in Natural Language Processing, EMNLP 2020, On- Online. Association for Computational Linguistics. line, November 16-20, 2020, pages 4536–4546. As- Ethan A. Chi, John Hewitt, and Christopher D. Man- sociation for Computational Linguistics. ning. 2020a. Finding universal grammatical rela- tions in multilingual BERT. In Proceedings of the Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan 58th Annual Meeting of the Association for Compu- Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and tational Linguistics, ACL 2020, Online, July 5-10, Jennimaria Palomaki. 2020. Tydi qa: A benchmark 2020, pages 5564–5577. Association for Computa- for information-seeking question answering in typo- tional Linguistics. logically diverse languages. Transactions of the As- sociation for Computational Linguistics. Zewen Chi, Li Dong, Furu Wei, Xianling Mao, and Heyan Huang. 2020b. Can monolingual pretrained Alexis Conneau, Kartikay Khandelwal, Naman Goyal, models help cross-lingual classification? In Pro- Vishrav Chaudhary, Guillaume Wenzek, Francisco ceedings of the 1st Conference of the Asia-Pacific Guzmán, Edouard Grave, Myle Ott, Luke Zettle- Chapter of the Association for Computational Lin- moyer, and Veselin Stoyanov. 2019. Unsupervised guistics and the 10th International Joint Conference cross-lingual representation learning at scale. arXiv on Natural Language Processing, AACL/IJCNLP preprint arXiv:1911.02116. 2020, Suzhou, China, December 4-7, 2020, pages 12–17. Association for Computational Linguistics. Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Zewen Chi, Li Dong, Furu Wei, Nan Yang, Sak- Guzmán, Edouard Grave, Myle Ott, Luke Zettle- sham Singhal, Wenhui Wang, Xia Song, - moyer, and Veselin Stoyanov. 2020a. Unsupervised Mao, Heyan Huang, and Ming Zhou. 2021a. In- cross-lingual representation learning at scale. In foXLM: An information-theoretic framework for Proceedings of the 58th Annual Meeting of the As- cross-lingual language model pre-training. In Pro- sociation for Computational Linguistics, ACL 2020, ceedings of the 2021 Conference of the North Amer- Online, July 5-10, 2020, pages 8440–8451. Associa- ican Chapter of the Association for Computational tion for Computational Linguistics. Linguistics: Human Language Technologies, pages 3576–3588, Online. Association for Computational Alexis Conneau and Guillaume Lample. 2019. Cross- Linguistics. lingual language model pretraining. In Advances in Neural Information Processing Systems 32: An- Zewen Chi, Li Dong, Furu Wei, Nan Yang, Saksham nual Conference on Neural Information Processing Singhal, Wenhui Wang, Xia Song, Xian-Ling Mao, Systems 2019, NeurIPS 2019, 8-14 December 2019, Heyan Huang, and Ming Zhou. 2021b. Infoxlm: An Vancouver, BC, Canada, pages 7057–7067. Alexis Conneau, Ruty Rinott, Guillaume Lample, Ad- Philipp Dufter and Hinrich Schütze. 2020. Identifying ina Williams, Samuel R. Bowman, Holger Schwenk, elements essential for BERT’s multilinguality. In and Veselin Stoyanov. 2018. Xnli: Evaluating cross- Proceedings of the 2020 Conference on Empirical lingual sentence representations. In Proceedings of Methods in Natural Language Processing (EMNLP), the 2018 Conference on Empirical Methods in Natu- pages 4423–4437, Online. Association for Computa- ral Language Processing. Association for Computa- tional Linguistics. tional Linguistics. Chris Dyer, Victor Chahuneau, and Noah A. Smith. Alexis Conneau, Shijie Wu, Haoran Li, Luke Zettle- 2013. A simple, fast, and effective reparameter- moyer, and Veselin Stoyanov. 2020b. Emerging ization of IBM model 2. In Proceedings of the cross-lingual structure in pretrained language mod- 2013 Conference of the North American Chapter of els. In Proceedings of the 58th Annual Meeting the Association for Computational Linguistics: Hu- of the Association for Computational Linguistics, man Language Technologies, pages 644–648, At- pages 6022–6034, Online. Association for Compu- lanta, Georgia. Association for Computational Lin- tational Linguistics. guistics. Diego de Vargas Feijó and Viviane Pereira Moreira. Julian Eisenschlos, Sebastian Ruder, Piotr Czapla, 2020. Mono vs multilingual transformer-based mod- Marcin Kardas, Sylvain Gugger, and Jeremy els: a comparison across several language tasks. Howard. 2019. Multifit: Efficient multi-lingual lan- CoRR, abs/2007.09757. guage model fine-tuning. In Proceedings of the Wietse de Vries, Andreas van Cranenburgh, and Malv- 2019 Conference on Empirical Methods in Natu- ina Nissim. 2020. What’s so special about bert’s lay- ral Language Processing and the 9th International ers? A closer look at the NLP pipeline in monolin- Joint Conference on Natural Language Processing, gual and multilingual models. In Proceedings of the EMNLP-IJCNLP 2019, Hong Kong, China, Novem- 2020 Conference on Empirical Methods in Natural ber 3-7, 2019, pages 5701–5706. Association for Language Processing: Findings, EMNLP 2020, On- Computational Linguistics. line Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 4339–4350. Associ- Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen ation for Computational Linguistics. Arivazhagan, and Wei Wang. 2020. Language- agnostic BERT sentence embedding. CoRR, Pieter Delobelle, Thomas Winters, and Bettina Berendt. abs/2007.01852. 2020. Robbert: a dutch roberta-based language model. In Proceedings of the 2020 Conference on Jinlan Fu, Pengfei Liu, and Graham Neubig. 2020. In- Empirical Methods in Natural Language Process- terpretable multi-dataset evaluation for named entity ing: Findings, EMNLP 2020, Online Event, 16-20 recognition. arXiv preprint arXiv:2011.06854. November 2020, volume EMNLP 2020 of Findings of ACL, pages 3255–3265. Association for Compu- Vikrant Goyal, Anoop Kunchukuttan, Rahul Kejriwal, tational Linguistics. Siddharth Jain, and Amit Bhagwat. 2020. Con- tact relatedness can help improve multilingual NMT: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Microsoft STCI-MT @ WMT20. In Proceedings Kristina Toutanova. 2018. Bert: Pre-training of deep of the Fifth Conference on Machine Translation, bidirectional transformers for language understand- pages 202–206, Online. Association for Computa- ing. arXiv preprint arXiv:1810.04805. tional Linguistics. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Joseph H Greenberg. 1966. Language universals. Kristina Toutanova. 2019. BERT: Pre-training of Mouton The Hague. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and of the North American Chapter of the Association Ross Girshick. 2020. Momentum contrast for unsu- for Computational Linguistics: Human Language pervised visual representation learning. Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Associ- Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, ation for Computational Linguistics. Bruna Morrone, Quentin De Laroussilhe, Andrea William B Dolan and Chris Brockett. 2005. Automati- Gesmundo, Mona Attariyan, and Sylvain Gelly. cally constructing a corpus of sentential paraphrases. 2019. Parameter-efficient transfer learning for NLP. In Proceedings of the Third International Workshop In Proceedings of the 36th International Conference on Paraphrasing (IWP2005). on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. Zi-Yi Dou and Graham Neubig. 2021. Word alignment PMLR. by fine-tuning embeddings on parallel corpora. In Proceedings of the 16th Conference of the European Junjie Hu, Melvin Johnson, Orhan Firat, Aditya Sid- Chapter of the Association for Computational Lin- dhant, and Graham Neubig. 2021. Explicit align- guistics: Main Volume, pages 2112–2128, Online. ment objectives for multilingual bidirectional en- Association for Computational Linguistics. coders. Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, ham Neubig, Orhan Firat, and Melvin Johnson. Savya Khosla, Atreyee Dey, Balaji Gopalan, 2020. XTREME: A massively multilingual multi- Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja task benchmark for evaluating cross-lingual gener- Nagipogu, Shachi Dave, Shruti Gupta, Subhash alisation. In Proceedings of the 37th International Chandra Bose Gali, Vish Subramanian, and Partha Conference on Machine Learning, ICML 2020, 13- Talukdar. 2021. Muril: Multilingual representations 18 July 2020, Virtual Event, volume 119 of Proceed- for indian languages. ings of Machine Learning Research, pages 4411– 4421. PMLR. Taku Kudo and John Richardson. 2018a. Sentence- piece: A simple and language independent subword Haoyang Huang, Yaobo Liang, Nan Duan, Ming Gong, tokenizer and detokenizer for neural text processing. Linjun Shou, Daxin Jiang, and Ming Zhou. 2019. In Proceedings of the 2018 Conference on Empirical Unicoder: A universal language encoder by pre- Methods in Natural Language Processing, EMNLP training with multiple cross-lingual tasks. In Pro- 2018: System Demonstrations, Brussels, Belgium, ceedings of the 2019 Conference on Empirical Meth- October 31 - November 4, 2018, pages 66–71. As- ods in Natural Language Processing and the 9th In- sociation for Computational Linguistics. ternational Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, Taku Kudo and John Richardson. 2018b. Sentence- China, November 3-7, 2019, pages 2485–2494. As- Piece: A simple and language independent subword sociation for Computational Linguistics. tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Kenji Imamura and Eiichiro Sumita. 2019. Recycling a Methods in Natural Language Processing: System pre-trained BERT encoder for neural machine trans- Demonstrations, pages 66–71, Brussels, Belgium. lation. In Proceedings of the 3rd Workshop on Neu- Association for Computational Linguistics. ral Generation and Translation, pages 23–31, Hong Kong. Association for Computational Linguistics. Anne Lauscher, Vinit Ravishankar, Ivan Vulic, and Masoud Jalili Sabet, Philipp Dufter, François Yvon, Goran Glavas. 2020. From zero to hero: On the and Hinrich Schütze. 2020. SimAlign: High qual- limitations of zero-shot cross-lingual transfer with ity word alignments without parallel training data us- multilingual transformers. CoRR, abs/2005.00633. ing static and contextualized embeddings. In Find- ings of the Association for Computational Linguis- Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Max- tics: EMNLP 2020, pages 1627–1643, Online. As- imin Coavoux, Benjamin Lecouteux, Alexandre Al- sociation for Computational Linguistics. lauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. 2020. Flaubert: Unsupervised language Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika model pre-training for french. In Proceedings of Bali, and Monojit Choudhury. 2020a. The state and The 12th Language Resources and Evaluation Con- fate of linguistic diversity and inclusion in the NLP ference, pages 2479–2490, Marseille, France. Euro- world. In Proceedings of the 58th Annual Meet- pean Language Resources Association. ing of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computa- Patrick S. H. Lewis, Barlas Oguz, Ruty Rinott, Sebas- tional Linguistics. Riedel, and Holger Schwenk. 2020. MLQA: evaluating cross-lingual extractive question answer- Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika ing. In Proceedings of the 58th Annual Meeting of Bali, and Monojit Choudhury. 2020b. The state and the Association for Computational Linguistics, ACL fate of linguistic diversity and inclusion in the nlp 2020, Online, July 5-10, 2020, pages 7315–7330. world. arXiv preprint arXiv:2004.09095. Association for Computational Linguistics. Karthikeyan K, Zihan Wang, Stephen Mayhew, and Yaobo Liang, Nan Duan, Yeyun Gong, Ning Wu, Fen- Dan Roth. 2020. Cross-lingual ability of multi- fei Guo, Weizhen , Ming Gong, Linjun Shou, lingual BERT: an empirical study. In 8th Inter- Daxin Jiang, Guihong Cao, Xiaodong Fan, Ruofei national Conference on Learning Representations, Zhang, Rahul Agrawal, Edward Cui, Sining Wei, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, Taroon Bharti, Ying Qiao, Jiun-Hung Chen, Winnie 2020. OpenReview.net. Wu, Shuguang Liu, Fan Yang, Daniel Campos, Ran- Divyanshu Kakwani, Anoop Kunchukuttan, Satish gan Majumder, and Ming Zhou. 2020. XGLUE: A Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. new benchmark datasetfor cross-lingual pre-training, Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: understanding and generation. In Proceedings of Monolingual Corpora, Evaluation Benchmarks and the 2020 Conference on Empirical Methods in Nat- Pre-trained Multilingual Language Models for In- ural Language Processing, EMNLP 2020, Online, dian Languages. In Findings of EMNLP. November 16-20, 2020, pages 6008–6018. Associ- ation for Computational Linguistics. Phillip Keung, Yichao Lu, Julian Salazar, and Vikas Bhardwaj. 2020. On the evaluation of contex- Jindrichˇ Libovicky,` Rudolf Rosa, and Alexander Fraser. tual embeddings for zero-shot cross-lingual transfer 2019. How language-neutral is multilingual bert? learning. CoRR, abs/2004.15001. arXiv preprint arXiv:1911.03310. Jindrichˇ Libovický, Rudolf Rosa, and Alexander Fraser. Louis Martin, Benjamin Muller, Pedro Javier Or- 2020. On the language neutrality of pre-trained mul- tiz Suárez, Yoann Dupont, Laurent Romary, Éric tilingual representations. In Findings of the Associ- de la Clergerie, Djamé Seddah, and Benoît Sagot. ation for Computational Linguistics: EMNLP 2020, 2020. CamemBERT: a tasty French language model. pages 1663–1674, Online. Association for Computa- In Proceedings of the 58th Annual Meeting of the tional Linguistics. Association for Computational Linguistics, pages 7203–7219, Online. Association for Computational Tomasz Limisiewicz, David Marecek, and Rudolf Rosa. Linguistics. 2020. Universal dependencies according to BERT: both more specific and more general. In Proceed- Tomas Mikolov, Quoc V Le, and Ilya Sutskever. 2013. ings of the 2020 Conference on Empirical Methods Exploiting similarities among languages for ma- in Natural Language Processing: Findings, EMNLP chine translation. arXiv preprint arXiv:1309.4168. 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2710–2722. Taesun Moon, Parul Awasthy, Jian Ni, and Radu Flo- Association for Computational Linguistics. rian. 2019. Towards lingua franca named entity recognition with BERT. CoRR, abs/1912.01389. Chi-Liang Liu, Tsung-Yuan Hsu, Yung-Sung Chuang, and Hung-yi Lee. 2020a. A study of cross-lingual Benjamin Müller, Antonis Anastasopoulos, Benoît ability and language-specific information in multilin- Sagot, and Djamé Seddah. 2020. When being un- gual BERT. CoRR, abs/2004.09205. seen from mbert is just the beginning: Handling Pengfei Liu, Jinlan Fu, Yang Xiao, Weizhe Yuan, new languages with multilingual language models. Shuaicheng Chang, Junqi Dai, Yixin Liu, Zihui- CoRR, abs/2010.12858. wen Ye, and Graham Neubig. 2021. Explainaboard: An explainable leaderboard for nlp. arXiv preprint Benjamin Muller, Benoit Sagot, and Djamé Seddah. arXiv:2104.06387. 2020. Can multilingual language models transfer to an unseen dialect? a case study on north african ara- Qianchu Liu, Diana McCarthy, Ivan Vulic,´ and Anna bizi. Korhonen. 2019a. Investigating cross-lingual align- ment methods for contextualized embeddings with Minh Nguyen, Viet Lai, Amir Pouran Ben Veyseh, and token-level evaluation. In Proceedings of the Thien Huu Nguyen. 2021. Trankit: A light-weight 23rd Conference on Computational Natural Lan- transformer-based toolkit for multilingual natural guage Learning (CoNLL), pages 33–43, Hong Kong, language processing. CoRR, abs/2101.03289. China. Association for Computational Linguistics. Toan Q. Nguyen and David Chiang. 2017. Trans- Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey fer learning across low-resource, related languages Edunov, Marjan Ghazvininejad, Mike Lewis, and for neural machine translation. In Proceedings of Luke Zettlemoyer. 2020b. Multilingual denoising the Eighth International Joint Conference on Natu- pre-training for neural machine translation. Transac- ral Language Processing (Volume 2: Short Papers), tions of the Association for Computational Linguis- pages 296–301, Taipei, Taiwan. Asian Federation of tics, 8:726–742. Natural Language Processing.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Joakim Nivre, Mitchell Abrams, Željko Agic,´ Lars Mandar Joshi, Danqi Chen, Omer Levy, M. Lewis, Ahrenberg, Lene Antonsen, Katya Aplonova, Luke Zettlemoyer, and Veselin Stoyanov. 2019b. Maria Jesus Aranzabe, Gashaw Arutie, Masayuki Roberta: A robustly optimized bert pretraining ap- Asahara, Luma Ateyah, Mohammed Attia, and proach. ArXiv, abs/1907.11692. et. al. 2018a. Universal dependencies 2.3. LINDAT/CLARIAH-CZ digital library at the Insti- Zihan Liu, Genta Indra Winata, Andrea Madotto, and Pascale Fung. 2020c. Exploring fine-tuning tech- tute of Formal and Applied Linguistics (ÚFAL), Fac- niques for pre-trained cross-lingual models via con- ulty of Mathematics and Physics, Charles Univer- tinual learning. CoRR, abs/2004.14218. sity.

Fuli Luo, Wei Wang, Jiahao Liu, Yijia Liu, Bin Bi, Joakim Nivre, Rogier Blokland, Niko Partanen, and Songfang Huang, Fei Huang, and Luo Si. 2021. Michael Rießler. 2018b. Universal dependencies {VECO}: Variable encoder-decoder pre-training for 2.2. cross-lingual understanding and generation. Franz Josef Och and Hermann Ney. 2003. A systematic Shuming Ma, Jian Yang, Haoyang Huang, Zewen Chi, comparison of various statistical alignment models. Li Dong, Dongdong Zhang, Hany Hassan Awadalla, Computational Linguistics, 29(1):19–51. Alexandre Muzio, Akiko Eriguchi, Saksham Sing- hal, Xia Song, Arul Menezes, and Furu Wei. 2020. Robert Östling and Jörg Tiedemann. 2016. Efficient XLM-T: scaling up multilingual machine translation word alignment with markov chain monte carlo. with pretrained cross-lingual transformer encoders. The Prague Bulletin of Mathematical Linguistics, CoRR, abs/2012.15547. 106(1):125–146. Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Youngbin Ro, Yukyung Lee, and Pilsung Kang. 2020. Hao Tian, Hua Wu, and Haifeng Wang. 2021. Ernie- Multi2oie: Multilingual open information extraction m: Enhanced multilingual representation by align- based on multi-head attention with bert. ing cross-lingual semantics with monolingual cor- pora. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2020. A primer in bertology: What we know about Jonas Pfeiffer, Andreas Rücklé, Clifton Poth, Aish- how bert works. Transactions of the Association for warya Kamath, Ivan Vulic, Sebastian Ruder, Computational Linguistics, 8:842–866. Kyunghyun Cho, and Iryna Gurevych. 2020a. Samuel Rönnqvist, Jenna Kanerva, Tapio Salakoski, Adapterhub: A framework for adapting transform- and Filip Ginter. 2019. Is multilingual BERT flu- ers. CoRR, abs/2007.07779. ent in language generation? In Proceedings of the First NLPL Workshop on Deep Learning for Natural Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Se- Language Processing, pages 29–36, Turku, Finland. bastian Ruder. 2020b. MAD-X: an adapter-based Linköping University Electronic Press. framework for multi-task cross-lingual transfer. In Proceedings of the 2020 Conference on Empirical Uma Roy, Noah Constant, Rami Al-Rfou, Aditya Methods in Natural Language Processing, EMNLP Barua, Aaron Phillips, and Yinfei Yang. 2020. 2020, Online, November 16-20, 2020, pages 7654– LAReQA: Language-agnostic answer retrieval from 7673. Association for Computational Linguistics. a multilingual pool. In Proceedings of the 2020 Con- ference on Empirical Methods in Natural Language Jonas Pfeiffer, Ivan Vulic, Iryna Gurevych, and Sebas- Processing (EMNLP), pages 5919–5930, Online. As- tian Ruder. 2020c. Unks everywhere: Adapting mul- sociation for Computational Linguistics. tilingual language models to new scripts. CoRR, abs/2012.15562. Sebastian Ruder, Noah Constant, Jan Botha, Aditya Siddhant, Orhan Firat, Jinlan Fu, Pengfei Liu, Jason Phang, Iacer Calixto, Phu Mon Htut, Yada Junjie Hu, Graham Neubig, and Melvin John- Pruksachatkun, Haokun Liu, Clara Vania, Katha- son. 2021. XTREME-R: towards more challeng- rina Kann, and Samuel R. Bowman. 2020. English ing and nuanced multilingual evaluation. CoRR, intermediate-task training improves zero-shot cross- abs/2104.07412. lingual transfer too. In Proceedings of the 1st Con- M. Schuster and Kaisuke Nakajima. 2012. Japanese ference of the Asia-Pacific Chapter of the Associa- and korean voice search. 2012 IEEE International tion for Computational Linguistics and the 10th In- Conference on Acoustics, Speech and Signal Pro- ternational Joint Conference on Natural Language cessing (ICASSP), pages 5149–5152. Processing, AACL/IJCNLP 2020, Suzhou, China, December 4-7, 2020, pages 557–575. Association Holger Schwenk and Xian Li. 2018. A corpus for mul- for Computational Linguistics. tilingual document classification in eight languages. In Proceedings of the Eleventh International Confer- Telmo Pires, Eva Schlinger, and Dan Garrette. 2019. ence on Language Resources and Evaluation (LREC How multilingual is multilingual bert? In Pro- 2018), Paris, France. European Language Resources ceedings of the 57th Conference of the Association Association (ELRA). for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Pa- Rico Sennrich, Barry Haddow, and Alexandra Birch. pers, pages 4996–5001. Association for Computa- 2016a. Improving neural machine translation mod- tional Linguistics. els with monolingual data.

Edoardo M. Ponti, Goran Glavaš, Olga Majewska, Rico Sennrich, Barry Haddow, and Alexandra Birch. Qianchu Liu, Ivan Vulic,´ and Anna Korhonen. 2020. 2016b. Neural machine translation of rare words XCOPA: A multilingual dataset for causal common- with subword units. In Proceedings of the 54th An- sense reasoning. In Proceedings of the 2020 Con- nual Meeting of the Association for Computational ference on Empirical Methods in Natural Language Linguistics (Volume 1: Long Papers), pages 1715– Processing (EMNLP). 1725, Berlin, Germany. Association for Computa- tional Linguistics. Sello Ralethe. 2020. Adaptation of deep bidirectional Jasdeep Singh, Bryan McCann, Richard Socher, and transformers for Afrikaans language. In Proceed- Caiming Xiong. 2019. Bert is not an interlingua and ings of The 12th Language Resources and Eval- the bias of tokenization. In Proceedings of the 2nd uation Conference, pages 2475–2478, Marseille, Workshop on Deep Learning Approaches for Low- France. European Language Resources Association. Resource NLP (DeepLo 2019), pages 47–55. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Richard Socher, Alex Perelygin, Jean Wu, Jason Vedaldi. 2017. Learning multiple visual domains Chuang, Christopher D Manning, Andrew Ng, and with residual adapters. In Advances in Neural Infor- Christopher Potts. 2013. Recursive deep models mation Processing Systems, volume 30. Curran As- for semantic compositionality over a sentiment tree- sociates, Inc. bank. In Proceedings of the 2013 conference on empirical methods in natural language processing, and Interpreting Neural Networks for NLP, Black- pages 1631–1642. boxNLP@EMNLP 2018, Brussels, Belgium, Novem- ber 1, 2018, pages 353–355. Association for Com- Yi Tay, Mostafa Dehghani, Dara Bahri, and Donald putational Linguistics. Metzler. 2020. Efficient transformers: A survey. arXiv preprint arXiv:2009.06732. Shuohuan Wang, Jiaxiang Liu, Xuan Ouyang, and Yu Sun. 2020a. Galileo at SemEval-2020 task 12: Erik F. Tjong Kim Sang. 2002. Introduction to the Multi-lingual learning for offensive language identi- CoNLL-2002 shared task: Language-independent fication using pre-trained language models. In Pro- named entity recognition. In COLING-02: The ceedings of the Fourteenth Workshop on Semantic 6th Conference on Natural Language Learning 2002 Evaluation, pages 1448–1455, Barcelona (online). (CoNLL-2002). International Committee for Computational Linguis- tics. Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Language-independent named entity recognition. In Fei Huang, and Kewei Tu. 2020b. Structure-level Proceedings of the Seventh Conference on Natu- knowledge distillation for multilingual sequence la- ral Language Learning at HLT-NAACL 2003, pages beling. In Proceedings of the 58th Annual Meet- 142–147. ing of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 3317– Ahmet Üstün, Arianna Bisazza, Gosse Bouma, and 3330. Association for Computational Linguistics. Gertjan van Noord. 2020. Udapter: Language adap- tation for truly universal dependency parsing. CoRR, Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, abs/2004.14327. and Ting Liu. 2019. Cross-lingual BERT trans- formation for zero-shot dependency parsing. In Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Proceedings of the 2019 Conference on Empirical 2019. Representation learning with contrastive pre- Methods in Natural Language Processing and the dictive coding. 9th International Joint Conference on Natural Lan- guage Processing, EMNLP-IJCNLP 2019, Hong Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Kong, China, November 3-7, 2019 Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz , pages 5720– Kaiser, and Illia Polosukhin. 2017. Attention is all 5726. Association for Computational Linguistics. you need. CoRR, abs/1706.03762. Zihan Wang, Karthikeyan K, Stephen Mayhew, and Marko Vidoni, Ivan Vulic, and Goran Glavas. 2020. Dan Roth. 2020c. Extending multilingual BERT to Orthogonal language and task adapters in zero-shot low-resource languages. In Proceedings of the 2020 cross-lingual transfer. CoRR, abs/2012.06460. Conference on Empirical Methods in Natural Lan- guage Processing: Findings, EMNLP 2020, Online Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Event, 16-20 November 2020, volume EMNLP 2020 Juhani Luotolahti, Tapio Salakoski, Filip Ginter, and of Findings of ACL, pages 2649–2656. Association Sampo Pyysalo. 2019. Multilingual is not enough: for Computational Linguistics. BERT for finnish. CoRR, abs/1912.07076. Zirui Wang, Jiateng Xie, Ruochen Xu, Yiming Yang, Elena Voita, David Talbot, Fedor Moiseev, Rico Sen- Graham Neubig, and Jaime G. Carbonell. 2020d. nrich, and Ivan Titov. 2019. Analyzing multi-head Cross-lingual alignment vs joint training: A com- self-attention: Specialized heads do the heavy lift- parative study and A simple unified framework. In ing, the rest can be pruned. In Proceedings of the 8th International Conference on Learning Represen- 57th Annual Meeting of the Association for Com- tations, ICLR 2020, Addis Ababa, Ethiopia, April putational Linguistics, pages 5797–5808, Florence, 26-30, 2020. OpenReview.net. Italy. Association for Computational Linguistics. Xiangpeng Wei, Rongxiang Weng, Yue Hu, Luxi Xing, Alex Wang, Amanpreet Singh, Julian Michael, Fe- Heng Yu, and Weihua Luo. 2021. On learning uni- lix Hill, Omer Levy, and Samuel Bowman. 2018a. versal representations across languages. In 9th Inter- GLUE: A multi-task benchmark and analysis plat- national Conference on Learning Representations, form for natural language understanding. In Pro- ICLR 2021, Virtual Event, Austria, May 3-7, 2021. ceedings of the 2018 EMNLP Workshop Black- OpenReview.net. boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Adina Williams, Nikita Nangia, and Samuel Bowman. Association for Computational Linguistics. 2018. A broad-coverage challenge corpus for sen- tence understanding through inference. In Proceed- Alex Wang, Amanpreet Singh, Julian Michael, Fe- ings of the 2018 Conference of the North American lix Hill, Omer Levy, and Samuel R. Bowman. Chapter of the Association for Computational Lin- 2018b. GLUE: A multi-task benchmark and guistics: Human Language Technologies, Volume analysis platform for natural language understand- 1 (Long Papers), pages 1112–1122. Association for ing. In Proceedings of the Workshop: Analyzing Computational Linguistics. Shijie Wu and Mark Dredze. 2019. Beto, bentz, becas: Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie The surprising cross-lingual effectiveness of BERT. Zhou. 2019. Hardness-aware deep metric learn- In Proceedings of the 2019 Conference on Empiri- ing. In Proceedings of the IEEE/CVF Conference on cal Methods in Natural Language Processing and Computer Vision and Pattern Recognition (CVPR). the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Jinhua Zhu, Yingce Xia, Lijun Wu, Di He, Tao Qin, Kong, China, November 3-7, 2019, pages 833–844. Wengang Zhou, Houqiang Li, and Tie-Yan Liu. Association for Computational Linguistics. 2020. Incorporating BERT into neural machine translation. CoRR, abs/2002.06823. Shijie Wu and Mark Dredze. 2020. Are all languages created equal in multilingual bert? In Proceedings Pierre Zweigenbaum, Serge Sharoff, and Reinhard of the 5th Workshop on Representation Learning for Rapp. 2017. Overview of the second BUCC shared NLP, RepL4NLP@ACL 2020, Online, July 9, 2020, task: Spotting parallel sentences in comparable cor- pages 120–130. Association for Computational Lin- pora. In Proceedings of the 10th Workshop on Build- guistics. ing and Using Comparable Corpora, pages 60–67, Vancouver, Canada. Association for Computational Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Linguistics. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin John- son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rud- nick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144. Linting Xue, Noah Constant, Adam Roberts, Mi- hir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mt5: A massively multilingual pre-trained text-to-text transformer. Jian Yang, Shuming Ma, Dongdong Zhang, ShuangZhi Wu, Zhoujun Li, and Ming Zhou. 2020. Alternat- ing language modeling for cross-lingual pre-training. Proceedings of the AAAI Conference on Artificial In- telligence, 34(05):9386–9393. Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. PAWS-X: A Cross-lingual Adver- sarial Dataset for Paraphrase Identification. In Proc. of EMNLP. Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, and Çagri Çöltekin. 2020. Semeval-2020 task 12: Multilingual offensive language identification in social media (offenseval 2020). In Proceedings of the Fourteenth Workshop on Semantic Evaluation, SemEval@COLING 2020, Barcelona (online), December 12-13, 2020, pages 1425–1447. International Committee for Computa- tional Linguistics. Thomas Zenkel, Joern Wuebker, and John DeNero. 2020. End-to-end neural word alignment outper- forms GIZA++. In Proceedings of the 58th Annual Meeting of the Association for Computational Lin- guistics, pages 1605–1617, Online. Association for Computational Linguistics. Wei Zhao, Steffen Eger, Johannes Bjerva, and Isabelle Augenstein. 2020. Inducing language-agnostic mul- tilingual representations. CoRR, abs/2008.09112.