Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction

Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction Ivan Vulic´ and Marie-Francine Moens Department of Computer Science KU Leuven, Belgium ivan.vulic|marie-francine.moens @cs.kuleuven.be { } Abstract curred recently (Klementiev et al., 2012; Zou et al., 2013; Mikolov et al., 2013b; Hermann and We propose a simple yet effective Blunsom, 2014a; Hermann and Blunsom, 2014b; approach to learning bilingual word Gouws et al., 2014; Chandar et al., 2014; Soyer embeddings (BWEs) from non-parallel et al., 2015; Luong et al., 2015). When operat- document-aligned data (based on the ing in multilingual settings, it is highly desirable to omnipresent skip-gram model), and its learn embeddings for words denoting similar con- application to bilingual lexicon induction cepts that are very close in the shared inter-lingual (BLI). We demonstrate the utility of embedding space (e.g., the representations for the the induced BWEs in the BLI task by English word school and the Spanish word es- reporting on benchmarking BLI datasets cuela should be very similar). These shared inter- for three language pairs: (1) We show lingual embedding spaces may then be used in a that our BWE-based BLI models signifi- myriad of multilingual natural language process- cantly outperform the MuPTM-based and ing tasks, such as fundamental tasks of comput- context-counting models in this setting, ing cross-lingual and multilingual semantic word and obtain the best reported BLI results similarity and bilingual lexicon induction (BLI), for all three tested language pairs; (2) etc. However, all these models critically require at We also show that our BWE-based BLI least sentence-aligned parallel data and/or readily- models outperform other BLI models available translation dictionaries to induce bilin- based on recently proposed BWEs that gual word embeddings (BWEs) that are consistent require parallel data for bilingual training. and closely aligned over languages in the same se- 1 Introduction mantic space. Dense real-valued vectors known as distributed Contributions In this work, we alleviate the re- representations of words or word embeddings quirements: (1) We present the first model that (WEs) (Bengio et al., 2003; Collobert and We- is able to induce bilingual word embeddings from ston, 2008; Mikolov et al., 2013a; Pennington non-parallel data without any other readily avail- et al., 2014) have been introduced recently as able translation resources such as pre-given bilin- part of neural network architectures for statisti- gual lexicons; (2) We demonstrate the utility of cal language modeling. Recent studies (Levy and BWEs induced by this simple yet effective model Goldberg, 2014; Levy et al., 2015) have show- in the BLI task from comparable Wikipedia data cased a direct link and comparable performance to on benchmarking datasets for three language pairs “more traditional” distributional models (Turney (Vulic´ and Moens, 2013b). Our BLI model based and Pantel, 2010), but the skip-gram model with on our novel BWEs significantly outperforms a se- negative sampling (SGNS) (Mikolov et al., 2013c) ries of strong baselines that reported previous best is still established as the state-of-the-art word rep- scores on these datasets in the same learning set- resentation model, due to its simplicity, fast train- ting, as well as other BLI models based on re- ing, as well as its solid and robust performance cently proposed BWE induction models (Gouws across a wide variety of semantic tasks (Baroni et et al., 2014; Chandar et al., 2014). The focus of al., 2014; Levy et al., 2015). the work is on learning lexicons from document- A natural extension of interest from monolin- aligned comparable corpora (e.g., Wikipedia arti- gual to multilingual word embeddings has oc- cles aligned through inter-wiki links). 719 Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 719–725, Beijing, China, July 26-31, 2015. c 2015 Association for Computational Linguistics Figure 1: The architecture of our BWE Skip-Gram model for learning bilingual word embeddings from document-aligned comparable data. Source language words and documents are drawn as gray boxes, while target language words and documents are drawn as blue boxes. The right side of the figure (sepa- rated by a vertical dashed line) illustrates how a pseudo-bilingual document is constructed from a pair of two aligned documents; two documents are first merged, and then words in the pseudo-bilingual document are randomly shuffled to ensure that both source and target language words occur as context words. 2 Model Architecture fig. 1) is to assure that each word w, regardless of its actual language, obtains word collocates In the following architecture description, we as- from both vocabularies. The idea of having bilin- sume that the reader is familiar with the main gual contexts for each pivot word in each pseudo- assumptions and training procedure of SGNS bilingual document will steer the final model to- (Mikolov et al., 2013a; Mikolov et al., 2013c). wards constructing a shared inter-lingual embed- We extend the SGNS model to work with bilingual ding space. Since the model depends on the align- document-aligned comparable data. An overview ment at the document level, in order to ensure of our architecture for learning BWEs from such the bilingual contexts instead of monolingual con- comparable data is given in fig. 1. texts, it is intuitive to assume that larger window Let us assume that we possess a document- sizes will lead to better bilingual embeddings. We aligned comparable corpus which is defined as test this hypothesis and the effect of window size = d , d , . , d = (dS, dT ), (dS, dT ), C { 1 2 N } { 1 1 2 2 in sect. 4. ..., (dS , dT ) , where d = (dS, dT ) denotes a N D } j j j The final model called BWE Skip-Gram pair of aligned documents in the source language (BWESG) then relies on the monolingual vari- LS and the target language LT , respectively, and ant of skip-gram trained on the shuffled pseudo- N is the number of documents in the corpus. bilingual documents.2 The model learns word em- S T V and V are vocabularies associated with lan- beddings for source and target language words guages LS and LT . The goal is to learn word em- that are aligned over the d embedding dimen- S T beddings for all words in both V and V that will sions and may be represented in the same shared be semantically coherent and closely aligned over cross-lingual embedding space. The BWESG- languages in a shared cross-lingual word embed- based representation of word w, regardless of its ding space. actual language, is then a d-dimensional vector: S In the first step, we merge two documents dj ~w = [f1, . , fk, . , fd], where fk R denotes T ∈ and dj from the aligned document pair dj into the score for the k-th inter-lingual feature within a single “pseudo-bilingual” document dj0 and re- the d-dimensional shared embedding space. Since move sentence boundaries. Following that, we all words share the embedding space, semantic randomly shuffle the newly constructed pseudo- similarity between words may be computed both bilingual document. The intuition behind this pre- outputs of the procedure if the window size is large enough. 1 training completely random shuffling step (see As one line of future work, we plan to investigate other, more systematic and deterministic shuffling algorithms. 1In this paper, we investigate only the random shuffling 2We were also experimenting with GloVe and CBOW, but procedure and show that the model is fairly robust to different they were falling behind SGNS on average. 720 monolingually and across languages. Given w, art BWEs from (Gouws et al., 2014; Chandar et the most similar word cross-lingually should be its al., 2014). Moreover, in order to test the effect of one-to-one translation, and we may use this intu- window size on final results, we have varied the ition to induce one-to-one bilingual lexicons from maximum window size cs from 4 to 60 in steps of comparable data. 4.3 Since cosine is used for all similarity compu- In another interpretation, BWESG actually tations in the BLI task, we call our new BLI model builds BWEs based on (pseudo-bilingual) docu- BWESG+cos. ment level co-occurrence. The window size pa- Baseline BLI Models We compare BWESG+cos rameter then just controls the amount of random to a series of state-of-the-art BLI models from data dropout. With larger windows, the model document-aligned comparable data: becomes prohibitively computationally expensive, (1) BiLDA-BLI - A BLI model that relies on the but in sect. 4 we show that the BLI performance induction of latent cross-lingual topics (Mimno et flattens out for “reasonably large” windows. al., 2009) by the bilingual LDA model and represents words as probability distributions over these 3 Experimental Setup topics (Vulic´ et al., 2011). (2) Assoc-BLI - A BLI model that represents Training Data We use comparable Wikipedia data words as vectors of association norms (Roller and introduced in (Vulic´ and Moens, 2013a; Vulic´ and Schulte im Walde, 2013) over both vocabularies, Moens, 2013b) available in three language pairs where these norms are computed using a multilin- to induce bilingual word embeddings: (i) a collec- gual topic model (Vulic´ and Moens, 2013a). tion of 13, 696 Spanish-English Wikipedia article (3) PPMI+cos - A standard distributional model pairs (ES-EN), (ii) a collection of 18, 898 Italian- for BLI relying on positive pointwise mutual infor- English Wikipedia article pairs (IT-EN), and (iii) a mation and cosine similarity (Bullinaria and Levy, collection of 7, 612 Dutch-English Wikipedia arti- 2007).

Bilingual Word Embeddings from Non-Parallel Document-Aligned Data Applied to Bilingual Lexicon Induction

Cubism in America

Bilingual Distributed Word Representations from Document-Aligned Comparable Data

Francis Poulenc and Surrealism

CUBISM and ABSTRACTION Background

Cubist Painting Related to the Culture from Which It Came and Its Validity Today in the High School Curriculum

Information to Users

Present Picasso): Portraiture and Self- Portraiture in Poetry and Art

Open Posey Kyle 105Decisiveworks.Pdf

Cubists and Futurists” by Daniel Robbins, 1961

Newsletter & Review

University of Cincinnati

Semantic Domains in Picasso's Poetry