Arxiv:1812.10464V2 [Cs.CL] 25 Sep 2019 While the Recent Advent of Deep Learning Has Led to Sentations for 93 Languages (See Table1)
Total Page:16
File Type:pdf, Size:1020Kb
Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Mikel Artetxe Holger Schwenk University of the Basque Country (UPV/EHU)∗ Facebook AI Research [email protected] [email protected] Abstract Pennington et al., 2014), but has recently been su- perseded by sentence-level representations (Peters We introduce an architecture to learn joint et al., 2018; Devlin et al., 2019). Nevertheless, all multilingual sentence representations for 93 these works learn a separate model for each lan- languages, belonging to more than 30 dif- guage and are thus unable to leverage information ferent families and written in 28 differ- across different languages, greatly limiting their ent scripts. Our system uses a single BiLSTM encoder with a shared BPE vo- potential performance for low-resource languages. cabulary for all languages, which is cou- In this work, we are interested in universal pled with an auxiliary decoder and trained on publicly available parallel corpora. This language agnostic sentence embeddings, that is, enables us to learn a classifier on top of vector representations of sentences that are gen- the resulting embeddings using English an- eral with respect to two dimensions: the input lan- notated data only, and transfer it to any guage and the NLP task. The motivations for such of the 93 languages without any modifi- representations are multiple: the hope that lan- cation. Our experiments in cross-lingual guages with limited resources benefit from joint natural language inference (XNLI dataset), training over many languages, the desire to per- cross-lingual document classification (ML- Doc dataset) and parallel corpus mining form zero-shot transfer of an NLP model from (BUCC dataset) show the effectiveness of one language (typically English) to another, and our approach. We also introduce a new the possibility to handle code-switching. To that test set of aligned sentences in 112 lan- end, we train a single encoder to handle multiple guages, and show that our sentence em- languages, so that semantically similar sentences beddings obtain strong results in multilin- in different languages are close in the embedding gual similarity search even for low-resource space. languages. Our implementation, the pre- trained encoder and the multilingual test set While previous work in multilingual NLP has are available at https://github.com/ been limited to either a few languages (Schwenk facebookresearch/LASER. and Douze, 2017; Yu et al., 2018) or specific appli- cations like typology prediction (Malaviya et al., 1 Introduction 2017) or machine translation (Neubig and Hu, 2018), we learn general purpose sentence repre- arXiv:1812.10464v2 [cs.CL] 25 Sep 2019 While the recent advent of deep learning has led to sentations for 93 languages (see Table1). Using a impressive progress in Natural Language Process- single pre-trained BiLSTM encoder for all the 93 ing (NLP), these techniques are known to be par- languages, we obtain very strong results in various ticularly data hungry, limiting their applicability in scenarios without any fine-tuning, including cross- many practical scenarios. An increasingly popular lingual natural language inference (XNLI dataset), approach to alleviate this issue is to first learn gen- cross-lingual classification (MLDoc dataset), bi- eral language representations on unlabeled data, text mining (BUCC dataset) and a new multilin- which are then integrated in task-specific down- gual similarity search dataset we introduce cover- stream systems. This approach was first popular- ing 112 languages. To the best of our knowledge, ized by word embeddings (Mikolov et al., 2013b; this is the first exploration of general purpose mas- This∗ work was performed during an internship at Face- sively multilingual sentence representations across book AI Research. a large variety of tasks. 2 Related work length vector representation, which is used by the decoder to create the target sequence. This de- Following the success of word embeddings coder is then discarded, and the encoder is kept to (Mikolov et al., 2013b; Pennington et al., 2014), embed sentences in any of the training languages. there has been an increasing interest in learn- While some proposals use a separate encoder for ing continuous vector representations of longer each language (Schwenk and Douze, 2017), shar- linguistic units like sentences (Le and Mikolov, ing a single encoder for all languages also gives 2014; Kiros et al., 2015). These sentence embed- strong results (Schwenk, 2018). dings are commonly obtained using a Recurrent Nevertheless, most existing work is either lim- Neural Network (RNN) encoder, which is typi- ited to few, rather close languages (Schwenk and cally trained in an unsupervised way over large Douze, 2017; Yu et al., 2018) or, more commonly, collections of unlabelled corpora. For instance, consider pairwise joint embeddings with English the skip-thought model of Kiros et al.(2015) cou- and one foreign language (España-Bonet et al., ple the encoder with an auxiliary decoder, and 2017; Guo et al., 2018). To the best of our knowl- train the entire system to predict the surrounding edge, existing work on learning multilingual rep- sentences over a collection of books. It was later resentations for a large number of languages is shown that more competitive results could be ob- limited to word embeddings (Ammar et al., 2016; tained by training the encoder over labeled Natu- Dufter et al., 2018) or specific applications like ty- ral Language Inference (NLI) data (Conneau et al., pology prediction (Malaviya et al., 2017) or ma- 2017). This was later extended to multitask learn- chine translation (Neubig and Hu, 2018), ours be- ing, combining different training objectives like ing the first paper exploring general purpose mas- that of skip-thought, NLI and machine translation sively multilingual sentence representations. (Cer et al., 2018; Subramanian et al., 2018). All the previous approaches learn a fixed-length While the previous methods consider a single representation for each sentence. A recent re- language at a time, multilingual representations search line has obtained very strong results using have attracted a large attention in recent times. variable-length representations instead, consisting Most of this research focuses on cross-lingual of contextualized embeddings of the words in the word embeddings (Ruder et al., 2017), which are sentence (Dai and Le, 2015; Peters et al., 2018; commonly learned jointly from parallel corpora Howard and Ruder, 2018; Devlin et al., 2019). For (Gouws et al., 2015; Luong et al., 2015). An al- that purpose, these methods train either an RNN ternative approach that is becoming increasingly or self-attentional encoder over unnanotated cor- popular is to separately train word embeddings for pora using some form of language modeling. A each language, and map them to a shared space classifier can then be learned on top of the result- based on a bilingual dictionary (Mikolov et al., ing encoder, which is commonly further fine-tuned 2013a; Artetxe et al., 2018a) or even in a fully un- during this supervised training. Concurrent to supervised manner (Conneau et al., 2018a; Artetxe our work, Lample and Conneau(2019) propose a et al., 2018b). Cross-lingual word embeddings are cross-lingual extension of these models, and report often used to build bag-of-word representations of strong results in cross-lingual natural language in- longer linguistic units by taking their respective ference, machine translation and language mod- (IDF-weighted) average (Klementiev et al., 2012; eling. In contrast, our focus is on scaling to a Dufter et al., 2018). While this approach has the large number of languages, for which we argue advantage of requiring weak or no cross-lingual that fixed-length approaches provide a more ver- signal, it has been shown that the resulting sen- satile and compatible representation form.1 Also, tence embeddings work poorly in practical cross- our approach achieves strong results without task- lingual transfer settings (Conneau et al., 2018b). specific fine-tuning, which makes it interesting for A more competitive approach that we follow tasks with limited resources. here is to use a sequence-to-sequence encoder- decoder architecture (Schwenk and Douze, 2017; 1For instance, there is not always a one-to-one correspon- Hassan et al., 2018). The full system is trained dence among words in different languages (e.g. a single word end-to-end on parallel corpora akin to multilingual of a morphologically complex language might correspond to several words of a morphologically simple language), so hav- neural machine translation (Johnson et al., 2017): ing a separate vector for each word might not transfer as well the encoder maps the source sequence into a fixed- across languages. ENCODER sent emb DECODER max pooling y1 y2 </s> BiLSTM BiLSTM … BiLSTM softmax softmax … softmax … … … W LSTM LSTM … LSTM BiLSTM BiLSTM … BiLSTM sent BPE L sent BPE L … sent BPE L BPE emb BPE emb … BPE emb id id id <s> y1 yn x1 x2 </s> Figure 1: Architecture of our system to learn multilingual sentence embeddings. 3 Proposed method resulting sentence representations (after concate- nating both directions) are 1024 dimensional. The We use a single, language agnostic BiLSTM en- decoder has always one layer of dimension 2048. coder to build our sentence embeddings, which The input embedding size is set to 320, while the is coupled with an auxiliary decoder and trained language ID embedding has 32 dimensions. on parallel corpora. From Section 3.1 to 3.3, we describe its architecture, our training strategy to 3.2 Training strategy scale to 93 languages, and the training data used In preceding work (Schwenk and Douze, 2017; for that purpose. Schwenk, 2018), each input sentence was jointly 3.1 Architecture translated into all other languages. However, this approach has two obvious drawbacks when trying Figure1 illustrates the architecture of the proposed to scale to a large number of languages. First, it system, which is based on Schwenk(2018).