Learning Multilingual Word Embeddings in Latent Metric Space: a Geometric Approach
Total Page:16
File Type:pdf, Size:1020Kb
Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach Pratik Jawanpuria1, Arjun Balgovind2∗, Anoop Kunchukuttan1, Bamdev Mishra1 1Microsoft, India 2IIT Madras, India 1fpratik.jawanpuria,ankunchu,[email protected] [email protected] Abstract different languages are mapped close to each other in a common embedding space. Hence, We propose a novel geometric approach for they are useful for joint/transfer learning and learning bilingual mappings given monolin- sharing annotated data across languages in gual embeddings and a bilingual dictionary. Our approach decouples the source-to-target different NLP applications such as machine language transformation into (a) language- translation (Gu et al., 2018), building bilingual specific rotations on the original embeddings dictionaries (Mikolov et al., 2013b), mining par- to align them in a common, latent space, and allel corpora (Conneau et al., 2018), text clas- (b) a language-independent similarity met- sification (Klementiev et al., 2012), sentiment ric in this common space to better model analysis (Zhou et al., 2015), and dependency the similarity between the embeddings. Over- parsing (Ammar et al., 2016). all, we pose the bilingual mapping prob- lem as a classification problem on smooth Mikolov et al. (2013b) empirically show that a Riemannian manifolds. Empirically, our ap- linear transformation of embeddings from one proach outperforms previous approaches on language to another preserves the geometric ar- the bilingual lexicon induction and cross- rangement of word embeddings. In a supervised lingual word similarity tasks. setting, the transformation matrix, W, is learned We next generalize our framework to rep- given a small bilingual dictionary and their resent multiple languages in a common latent corresponding monolingual embeddings. Subse- space. Language-specific rotations for all the quently, many refinements to the bilingual map- languages and a common similarity metric ping framework have been proposed (Xing et al., in the latent space are learned jointly from 2015; Smith et al., 2017b; Conneau et al., 2018; bilingual dictionaries for multiple language Artetxe et al., 2016, 2017, 2018a,b). pairs. We illustrate the effectiveness of joint In this work, we propose a novel geometric ap- learning for multiple languages in an indirect proach for learning bilingual embeddings. We ro- word translation setting. tate the source and target language embeddings from their original vector spaces to a common latent space via language-specific orthogonal 1 Introduction transformations. Furthermore, we define a sim- ilarity metric, the Mahalanobis metric, in this Bilingual word embeddings are a useful tool common space to refine the notion of similarity in natural language processing (NLP) that has between a pair of embeddings. We achieve the attracted a lot of interest lately due to a fun- above by learning the transformation matrix as damental property: similar concepts/words across > follows: W = UtBUs, where Ut and Us are the orthogonal transformations for target and source language embeddings, respectively, and ∗This work was carried out during the author’s internship B is a positive definite matrix representing the at Microsoft, India. Mahalanobis metric. 107 Transactions of the Association for Computational Linguistics, vol. 7, pp. 107–120, 2019. Action Editor: Doug Downey. Submission batch: 10/2018; Revision batch: 11/2018; Published 4/2019. c 2019 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. The proposed formulation has the following duced latent space, and (d) formulating the benefits: tasks as a classification problem. • The learned similarity metric allows for a • We evaluate our multilingual model on an more effective similarity comparison of em- indirect word translation task: translation beddings based on evidence from the data. between a language pair that does not have a bilingual dictionary, but the source and target • A common latent space decouples the source languages each possess a bilingual dictionary and target language transformations, and with a third, common pivot language. Our naturally enables representation of word em- multilingual model outperforms a strong beddings from both languages in a single unsupervised baseline as well as methods vector space. based on adapting bilingual methods for this • We also show that the proposed method can indirect translation task. be easily generalized to jointly learn mul- tilingual embeddings, given bilingual dic- • Lastly, we propose a semi-supervised exten- tionaries of multiple language pairs. We sion of our approach that further improves map multiple languages into a single vector performance over the supervised approaches. space by learning the characteristics common The rest of the paper is organized as follows. across languages (the similarity metric) as Section 2 discusses related work. The proposed well as language-specific attributes (the framework, including problem formulations for orthogonal transformations). bilingual and multilingual mappings, is presented The optimization problem resulting from our in Section 3. The proposed Riemannian opti- formulation involves orthogonal constraints on mization algorithm is described in Section 4. In language-specific transformations (Ui for lan- Section 5, we discuss our experimental setup. guage i) as well as the symmetric positive- Section 6 presents the results of experiments on definite constraint on the metric B. Instead of direct translation with our algorithms and analyzes solving the optimization problem in the Euclidean the results. Section 7 presents experiments on in- space with constraints, we view it as an optimi- direct translation using our generalized multi- zation problem in smooth Riemannian manifolds, lingual algorithm. We discuss a semi-supervised which are well-studied topological spaces (Lee, extension to our framework in Section 8. 2003). The Riemannian optimization frame- Section 9 concludes the paper. work embeds the given constraints into the search space and conceptually views the problem as 2 Related Work an unconstrained optimization problem over the Bilingual Embeddings. Mikolov et al. (2013b) manifold. show that a linear transformation from embed- We evaluate our approach on different bilingual dings of one language to another can be learned as well as multilingual tasks across multiple lan- from a bilingual dictionary and corresponding guages and datasets. The following is a summary monolingual embeddings by performing linear of our findings: least-squares regression. A popular modification • Our approach outperforms state-of-the-art to this formulation constrains the transformation supervised and unsupervised bilingual map- matrix to be orthogonal (Xing et al., 2015; Smith ping methods on the bilingual lexicon in- et al., 2017b; 2018a). This is known as the orthog- duction as well as the cross-lingual word onal Procrustes problem (Schonemann,¨ 1966). similarity tasks. Orthogonality preserves monolingual distances and ensures the transformation is reversible. • An ablation analysis reveals that the fol- Lazaridou et al. (2015) and Joulin et al. (2018) lowing contribute to our model’s improved optimize alternative loss functions in this frame- performance: (a) aligning the embedding work. Artetxe et al. (2018a) improves on the spaces of different languages, (b) learning Procrustes solution and propose a multi-step a similarity metric which induces a latent framework consisting of a series of linear trans- space, (c) performing inference in the in- formations to the data. Faruqui and Dyer (2014) 108 use Canonical Correlation Analysis (CCA) to proaches for representing embeddings of multiple learn linear projections from the source and languages in a common vector space by desig- target languages to a common space such that nating one of the languages as a pivot language. correlations between the embeddings projected to In this simple approach, bilingual mappings are this space are maximized. Procrustes solution– learned independently from all other languages based approaches have been shown to perform to the pivot language. A GPA-based method better than CCA-based approaches (Artetxe et al., (Kementchedjhieva et al., 2018) may also be 2016, 2018a). used to jointly transform multiple languages to We view the problem of mapping the source a common latent space. However, this requires and target languages word embeddings as (a) an n-way dictionary to represent n languages. aligning the two language spaces and (b) learning In contrast, the proposed approach requires only a similarity metric in this (learned) common pairwise bilingual dictionaries such that every space. We accomplish this by learning suitable language under consideration is represented in at language-specific orthogonal transformations (for least one bilingual dictionary. alignment) and a symmetric positive-definite ma- The above-mentioned approaches are referred trix (as Mahalanobis metric). The similarity metric to as offline since the monolingual and bilingual is useful in addressing the limitations of mapping embeddings are learned separately. In contrast, to a common latent space under orthogonality online approaches directly learn a bilingual/ constraints, an issue discussed by Doval et al. multilingual embedding from parallel corpora (2018). Whereas Doval et al. (2018) learn a second (Hermann and Blunsom, 2014; Huang et al., 2015; correction transformation by assuming the average Duong