Word Embedding Subspaces

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Getting in Shape: Word Embedding SubSpaces Tianyuan Zhou1 , Joao˜ Sedoc2 and Jordan Rodu1∗ 1Department of Statistics, University of Virginia 2Department of Computer and Information Science, University of Pennsylvania [email protected], [email protected], [email protected] Abstract Despite a large body of work in each of the two areas, the link between subspace alignment and manifold isotropy Many tasks in natural language processing require has not been fully explored. Given that word embeddings the alignment of word embeddings. Embedding contain some form of “meaning”, presumably common to all alignment relies on the geometric properties of the word embeddings, different representations should align to manifold of word vectors. This paper focuses on some degree. But alignment necessarily prioritizes resolving supervised linear alignment and studies the relation- disparity in larger singular values, which could be problematic ship between the shape of the target embedding. We for two word embeddings that encode information differently assess the performance of aligned word vectors on across their spectrums. When two word embeddings represent semantic similarity tasks and find that the isotropy information similarly, they can be successfully aligned. For of the target embedding is critical to the alignment. instance, Artetxe [2016] shows orthogonal transformations for Furthermore, aligning with an isotropic noise can alignment are superior for retaining performance on analogy deliver satisfactory results. We provide a theoretical tasks. However, orthogonal transformations do not allow for framework and guarantees which aid in the under- the alignment of more disparate methods such as distributional standing of empirical results. and non-distributional word embeddings. In this paper, we present a theoretical framework for un- 1 Introduction derstanding the alignment between word embeddings- when they can work, and when they might fail. Our theoretical Mono-lingual and multi-lingual alignment of word embed- results show that the singular value structure of the source dings is important for domain adaptation, word embedding embeddings is completely discarded, and when information is assessment, and machine translation [Ben-David et al., 2007; encoded differently in the source and target embeddings, the Blitzer et al., 2011; Tsvetkov et al., 2015; Lample et al., distortion of the spectrum from source to aligned embeddings 2018]. Fundamentally, this is a subspace alignment problem could drastically effect downstream results. which has seen much interest in machine learning commu- nities, including computer vision and nature language processing (NLP) [Fernando et al., 2013; Xing et al., 2015; 2 Theoretical Results Wang and Mahadevan, 2013; Lample et al., 2018; Mikolov In this section we provide some theoretical underpinnings for et al., 2013a] and can be either supervised or unsupervised, the phenomena observed in our experiments. Lemma 1 shows depending on the setting. that when a source representation is aligned to a target repre- Simultaneously, some work has focused on uncovering the sentation using a linear transformation, the column space of structure of word embedding manifolds [Mimno and Thomp- the aligned representation is determined by the column space son, 2017a; Hartmann et al., 2018; Shin et al., 2018; Mu et al., of the source, but the singular value structure of the source 2017], where in this paper we focus their isotropy/anisotropy. is entirely discarded. Theorem 1 guarantees the existence of Word embeddings are known to not be isotropic [Andreas a lower-bounded singular value. Since the summation of the and Klein, 2015], but Arora [2015] argue that isotropic singular values is upper-bounded by the target embedding, if word embeddings mitigate the effect of approximation er- the lower bound of the largest singular value is relatively large ror. Indeed, recent work in post-processing word embed- in comparison, then the singular value structure will have high dings has shown that increasing isotropy increases seman- eccentricity. [ tic task performance Mu et al., 2017; Liu et al., 2019b; Finally, proposition 1 shows that the correlation of measure- ] Liu et al., 2019a . ments between words (Euclidean distance between vectors or the cosine of their angles) in a vector cloud, and those mea- ∗Corresponding Author surements after the vector cloud has been stretched by unequal Link of Supplementary Materials and Source codes: https:// amounts in different directions, is small. Combined with the github.com/NoahZhouTianyuan/ConceptorOnNondistEmbedding results of theorem 1, this implies that alignment can greatly 5478 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) impact the performance of aligned word representations on Further, for a subset I ⊂ f1; : : : ; pg define Xi;I to be the semantic similarity tasks. We expand on this below. vector Xi with indices not in I set to 0. Then We include the proof of the theorem here, but defer the 2 2 proof of the lemma and the proposition to the supplementary corr(jjXiS−XjSjj2; jjXi;I − Xj;I jj2) material. We also provide additional theorems in the supple- P p i2I si= jIj mentary material that, while not critical for the main focus = q 2 2 2 of this work, provide justification for some minor experimen- s1 + s2 + ··· + sp tal results and contribute more broadly to our fundamental theoretical understanding of word representation alignment. To see the implications of this proposition, consider the We begin by stating our word representation alignment alignment process loosely as 1) establishing the directions of model. the aligned singular vectors (which could be by an orthogonal rotation of the source singular vectors, or through linear Alignment Model: Let Y 2 Rn×p be the target word combinations of the source embeddings if the information representations, and X 2 Rn×p the source representations. content is not encoded similarly in the source and target em- To align X to Y , we seek a linear mapping W such that beddings), 2) stretching the singular vectors according to the Y = XW + . The least squares solution for W yields ∗ T −1 T T singular values of the source embeddings (or functions of the Yb = XW = X(X X) X Y = UX UX Y . singular values that distribute their information appropriately Lemma 1. Let UX and UY be the left singular vectors of X if the transformation is not orthogonal), and 3) adjusting the and Y respectively, and ΣY a diagonal matrix containing the stretched singular vectors through the S matrix of proposition T 1 according to the spectrum of the target embeddings (see singular values of Y , then σ(Yb) = σ(UX UY ΣY ). lemma 1). For two word embeddings that encode information ^ Corollary 1. If UX = UY , then σ(Y ) = σ(Y ). similarly (say two distributional word embeddings) the entries th For the following theorem, define σi(M) to be the i sin- of the S matrix will all be roughly equal. However, for two gular value of the matrix M, and UMi to be the singular vector word embeddings that do not, it is likely that significant ad- associated with the ith singular value of M. justment will be required, with some entries in the S large to stretch the singular vectors, and some small to shrink them. Rn×p ∗ Theorem 1. Suppose X; Y 2 , and let Yb = XW be The results of our experiments support this idea. the least squares solution of Y = XW + . For singular value σi(Y ) and corresponding left singular vector Uyi, let > 3 Empirical Results ci = jjUX Uyijj2, then there exists a singular value in Yb at least as large as ciσi(Y ). In this section, we show some empirical results of word representation alignment. Our key finding suggests that isotropy ^ T Proof. By Lemma 1, σ(Y ) = σ(UX UY ΣY ). Recall- is important to successful alignment. In fact, aligning with T ing the definition of singular values, σ1(UX UY ΣY ) = isotropic noise can even yield satisfactory intrinsic evaluation T T maxjjujj2=1;jjvjj2=1 u UX UY ΣY v where u; v are unit vec- results. T tors. Define ei = (0; 0; ··· ; 0; 1; 0; ··· ; 0) where Conceptor Negation (CN). It is worth noting that the ge- ei is a unit vector whose elements are 0 except the th ometry of the distributional word embedding has been stud- i element which is 1. Then for any 1 ≤ i ≤ ied carefully [Mimno and Thompson, 2017b]. Mimno et p, σ (U >U Σ ) ≥ max uT U T U Σ e = 1 X Y Y jjujj2=1 X Y Y i al. [2017b] note that for word2vec word embedding point > > > maxjjujj2=1 u UX σi(Y )Uyi = σi(Y )jjUX Uyijj2 = clouds are concentrated within a narrow cone, which may lead ciσi(Y ). Therefore, the largest singular value must be greater to bias. ABTT [Mu et al., 2017] and conceptor negation [Liu than or equal to ciσi(Y ), for all i. et al., 2019b] are two methods used to correct this bias by Proposition 1. Suppose X 2 Rp is a random vector with all damping the larger singular values. Hence we suggest that entries distributed i.i.d with mean zero and a bounded fourth conceptor negation should be used post alignment in order to p p p control the eccentricity of the resulting aligned representation. moment. Let S be a matrix with s1; s2; ··· ; sp) along the diagonal and 0 everywhere else. Then for realizations X i 3.1 Experimental Setup and Xj of X, we have the following two results: We perform multiple experiments using distributional word 2 2 corr(jjXi−Xjjj2; jjXiS − XjSjj2) representations (each 300-dimensional) including word2vec p (s1 + s2 + ··· + sp)= p [Mikolov et al., 2013b] (Google News), GloVe [Pennington = q et al., 2014] (840 billion Common Crawl) and FastText [Bo- 2 2 2 s1 + s2 + ··· + sp janowski et al., 2017] (Common Crawl without subword), as our source embeddings, and align them through linear re- and gression to various target representations.

Load more