Are All Good Word Vector Spaces Isomorphic?
Total Page:16
File Type:pdf, Size:1020Kb
Are All Good Word Vector Spaces Isomorphic? Ivan Vulic´1∗ Sebastian Ruder2∗ Anders Søgaard3;4∗ 1 Language Technology Lab, University of Cambridge 2 DeepMind 3 Department of Computer Science, University of Copenhagen 4 Google Research, Berlin [email protected] [email protected] [email protected] Abstract es fr 0.8 pt it de Existing algorithms for aligning cross-lingual nl 0.7 word vector spaces assume that vector spaces id ca pl sv are approximately isomorphic. As a result, ro cs 0.6 da hu they perform poorly or fail completely on non- ja ru isomorphic spaces. Such non-isomorphism 0.5 el tr fi ms he zh has been hypothesised to result from typolog- sk hr vi 0.4 fa ar uk ical differences between languages. In this BLI scores (Accuracy) et hi sl ko work, we ask whether non-isomorphism is also 0.3 crucially a sign of degenerate word vector lv th lt spaces. We present a series of experiments 0.2 10M 20M 50M 100M 200M 500M across diverse languages which show that vari- # of tokens in Wikipedia ance in performance across language pairs is not only due to typological differences, but can Figure 1: Performance of a state-of-the-art BLI model mostly be attributed to the size of the mono- mapping from English to a target language and the size lingual resources available, and to the proper- of the target language Wikipedia are correlated. Linear ties and duration of monolingual training (e.g. fit shown as a blue line (log scale). “under-training”). 1 Introduction alignment methods for cross-lingual word embed- dings. It furthermore provides us with an expla- Word embeddings have been argued to reflect how nation for reported failures to align word vector language users organise concepts (Mandera et al., spaces in different languages (Søgaard et al., 2018; 2017; Torabi Asr et al., 2018). The extent to which Artetxe et al., 2018a), which has so far been largely they really do so has been evaluated, e.g., using attributed only to inherent typological differences. semantic word similarity and association norms In fact, the amount of data used to induce the (Hill et al., 2015; Gerz et al., 2016), and word monolingual embeddings is predictive of the qual- analogy benchmarks (Mikolov et al., 2013c). If ity of the aligned cross-lingual word embeddings, word embeddings reflect more or less language- as evaluated on bilingual lexicon induction (BLI). independent conceptual organisations, word em- arXiv:2004.04070v2 [cs.CL] 20 Oct 2020 Consider, for motivation, Figure1; it shows the per- beddings in different languages can be expected to formance of a state-of-the-art alignment method— be near-isomorphic. Researchers have exploited RCSLS with iterative normalisation (Zhang et al., this to learn linear transformations between such 2019)—on mapping English embeddings onto em- spaces (Mikolov et al., 2013a; Glavasˇ et al., 2019), beddings in other languages, and its correlation which have been used to induce bilingual dictionar- (ρ = 0:72) with the size of the tokenised target lan- ies, as well as to facilitate multilingual modeling guage Polyglot Wikipedia (Al-Rfou et al., 2013). and cross-lingual transfer (Ruder et al., 2019). We investigate to what extent the amount of data In this paper, we show that near-isomorphism available for some languages and corresponding arises only with sufficient amounts of training. This training conditions provide a sufficient explanation is of practical interest for applications of linear for the variance in reported results; that is, whether ∗All authors contributed equally to this work. it is the full story or not: The answer is ’almost’, that is, its interplay with inherent typological differ- tant indicator for performance has been the degree ences does have a crucial impact on the ‘alignabil- of isomorphism, that is, how (topologically) similar ity’ of monolingual vector spaces. the structures of the two vector spaces are. We first discuss current standard methods of Mapping-based approaches The prevalent way to quantifying the degree of near-isomorphism be- learn a cross-lingual embedding space, especially tween word vector spaces (§2.1). We then outline in low-data regimes, is to learn a mapping between training settings that may influence isomorphism a source and a target embedding space (Mikolov (§2.2) and present a novel experimental protocol et al., 2013a). Such mapping-based approaches for learning cross-lingual word embeddings that assume that the monolingual embedding spaces simulates a low-resource environment, and also are isomorphic, i.e., that one can be transformed controls for topical skew and differences in morpho- into the other via a linear transformation (Xing logical complexity (§3). We focus on two groups et al., 2015; Artetxe et al., 2018a). Recent unsu- of languages: 1) Spanish, Basque, Galician, and pervised approaches rely even more strongly on Quechua, and 2) Bengali, Tamil, and Urdu, as these this assumption: They assume that the structures are arguably spoken in culturally related regions, of the embedding spaces are so similar that they but have very different morphology. Our experi- can be aligned by minimising the distance between ments, among other findings, indicate that a low- the transformed source language and the target lan- resource version of Spanish is as difficult to align guage embedding space (Zhang et al., 2017; Con- to English as Quechua, challenging the assumption neau et al., 2018; Xu et al., 2018; Alvarez-Melis from prior work that the primary issue to resolve in and Jaakkola, 2018; Hartmann et al., 2019). cross-lingual word embedding learning is language dissimilarity (instead of, e.g., procuring additional 2.1 Quantifying Isomorphism raw data for embedding training). We also show We employ measures that quantify isomorphism that by controlling for different factors, we reduce in three distinct ways—based on graphs, metric the gap between aligning Spanish and Basque to spaces, and vector similarity. English from 0.291 to 0.129. Similarly, under these controlled circumstances, we do not observe any Eigenvector similarity (Søgaard et al., 2018) substantial performance difference between align- Eigenvector similarity (EVS) estimates the degree ing Spanish and Galician to English, or between of isomorphism based on properties of the near- aligning Bengali and Tamil to English. est neighbour graphs of the two embedding spaces. We first length-normalise embeddings in both em- We also investigate the learning dynamics of bedding spaces and compute the nearest neighbour monolingual word embeddings and their impact graphs on a subset of the top most frequent N on BLI performance and near-isomorphism of the words. We then calculate the Laplacian matrices resulting word vector spaces (§4), finding training L and L of each graph. For L , we find the duration, amount of monolingual resources, prepro- 1 2 1 smallest k such that the sum of its k largest eigen- cessing, and self-learning all to have a large impact. 1 1 values Pk1 λ is at least 90% of the sum of all The findings are verified across a set of typolog- i=1 1i its eigenvalues. We proceed analogously for k ically diverse languages, where we pair English 2 and set k = min(k ; k ). The eigenvector sim- with Spanish, Arabic, and Japanese. 1 2 ilarity metric ∆ is now the sum of the squared We will release our new evaluation dictionar- differences of the k largest Laplacian eigenvalues: ies and subsampled Wikipedias controlling for ∆ = Pk (λ − λ )2. The lower ∆, the more topical skew and morphological differences to fa- i=1 1i 2i similar are the graphs and the more isomorphic are cilitate future research at: https://github. the embedding spaces. com/cambridgeltl/iso-study. Gromov-Hausdorff distance (Patra et al., 2019) 2 Isomorphism of Vector Spaces The Hausdorff distance is a measure of the worst case distance between two metric spaces X and Y Studies analyzing the qualities of monolingual with a distance function d: word vector spaces have focused on intrinsic tasks (Baroni et al., 2014), correlations (Tsvetkov et al., H(X ; Y) = maxf sup inf d(x; y); x2X y2Y 2015), and subspaces (Yaghoobzadeh and Schutze¨ , sup inf d(x; y)g 2016). In the cross-lingual setting, the most impor- y2Y x2X Intuitively, it measures the distance between the Corpus size It has become standard to align mono- nearest neighbours that are farthest apart. The lingual word embeddings trained on Wikipedia Gromov-Hausdorff distance (GH) in turn min- (Glavasˇ et al., 2019; Zhang et al., 2019). As can be imises this distance over all isometric transforms seen in Figure1, and also in Table1, Wikipedias (orthogonal transforms in our case as we apply of low-resource languages are more than a mag- mean centering) X and Y as follows: nitude smaller than Wikipedias of high-resource languages.3 Corpus size has been shown to play GH(X ; Y) = inf H(f(X ); g(Y)) a role in the performance of monolingual embed- f;g dings (Sahlgren and Lenci, 2016), but it is unclear In practice, GH is calculated by computing the Bot- how it influences their structure and isomorphism. tleneck distance between the metric spaces (Chazal Training duration As it is generally too expen- et al., 2009; Patra et al., 2019). sive to tune hyper-parameters separately for each Relational similarity As an alternative, we con- language, monolingual embeddings are typically sider a simpler measure inspired by Zhang et al. trained for the same number of epochs in large- (2019). This measure, dubbed RSIM, is based scale studies. As a result, word embeddings of on the intuition that the similarity distributions of low-resource languages may be “under-trained”. translations within each language should be similar.