Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond Mikel Artetxe Holger Schwenk University of the Basque Country (UPV/EHU)∗ Facebook AI Research
[email protected] [email protected] Abstract Pennington et al., 2014), but has recently been su- perseded by sentence-level representations (Peters We introduce an architecture to learn joint et al., 2018; Devlin et al., 2019). Nevertheless, all multilingual sentence representations for 93 these works learn a separate model for each lan- languages, belonging to more than 30 dif- guage and are thus unable to leverage information ferent families and written in 28 differ- across different languages, greatly limiting their ent scripts. Our system uses a single BiLSTM encoder with a shared BPE vo- potential performance for low-resource languages. cabulary for all languages, which is cou- In this work, we are interested in universal pled with an auxiliary decoder and trained on publicly available parallel corpora. This language agnostic sentence embeddings, that is, enables us to learn a classifier on top of vector representations of sentences that are gen- the resulting embeddings using English an- eral with respect to two dimensions: the input lan- notated data only, and transfer it to any guage and the NLP task. The motivations for such of the 93 languages without any modifi- representations are multiple: the hope that lan- cation. Our experiments in cross-lingual guages with limited resources benefit from joint natural language inference (XNLI dataset), training over many languages, the desire to per- cross-lingual document classification (ML- Doc dataset) and parallel corpus mining form zero-shot transfer of an NLP model from (BUCC dataset) show the effectiveness of one language (typically English) to another, and our approach.