Word Embeddings as Metric Recovery in Semantic Spaces

Tatsunori B. Hashimoto, David Alvarez-Melis and Tommi S. Jaakkola CSAIL, Massachusetts Institute of Technology thashim, davidam, tommi @csail.mit.edu { }

Abstract named entity recognition (Turian et al., 2010) and parsing (Socher et al., 2013). Continuous word representations have been remarkably useful across NLP tasks but re- The empirical success of word embeddings has main poorly understood. We ground word prompted a parallel body of work that seeks to better embeddings in semantic spaces studied in understand their properties, associated estimation al- the cognitive-psychometric literature, taking gorithms, and explore possible revisions. Recently, these spaces as the primary objects to recover. Levy and Goldberg (2014a) showed that linear lin- To this end, we relate log co-occurrences of guistic regularities first observed with words in large corpora to semantic similarity extend to other embedding methods. In particu- assessments and show that co-occurrences are indeed consistent with an Euclidean semantic lar, explicit representations of words in terms of co- space hypothesis. Framing occurrence counts can be used to solve analogies in as metric recovery of a semantic space uni- the same way. In terms of algorithms, Levy and fies existing word embedding algorithms, ties Goldberg (2014b) demonstrated that the global min- them to manifold learning, and demonstrates imum of the skip-gram method with negative sam- that existing algorithms are consistent metric pling of Mikolov et al. (2013b) implicitly factorizes recovery methods given co-occurrence counts a shifted version of the pointwise mutual informa- from random walks. Furthermore, we propose a simple, principled, direct metric recovery al- tion (PMI) matrix of word-context pairs. Arora et gorithm that performs on par with the state-of- al. (2015) explored links between random walks and the-art word embedding and manifold learning word embeddings, relating them to contextual (prob- methods. Finally, we complement recent fo- ability ratio) analogies, under specific (isotropic) as- cus on analogies by constructing two new in- sumptions about word vectors. ductive reasoning datasets—series completion In this work, we take semantic spaces stud- and classification—and demonstrate that word embeddings can be used to solve them as well. ied in the cognitive-psychometric literature as the prototypical objects that word embedding algo- rithms estimate. Semantic spaces are vector spaces 1 Introduction over concepts where Euclidean distances between Continuous space models of words, objects, and sig- points are assumed to indicate semantic similar- nals have become ubiquitous tools for learning rich ities. We link such semantic spaces to word representations of data, from natural language pro- co-occurrences through semantic similarity assess- cessing to computer vision. Specifically, there has ments, and demonstrate that the observed co- been particular interest in word embeddings, largely occurrence counts indeed possess statistical proper- due to their intriguing semantic properties (Mikolov ties that are consistent with an underlying Euclidean et al., 2013b) and their success as features for down- space where distances are linked to semantic simi- stream natural language processing tasks, such as larity.

273

Transactions of the Association for Computational Linguistics, vol. 4, pp. 273–286, 2016. Action Editor: Scott Yih. Submission batch: 12/2015; Revision batch: 3/2016; Published 7/2016. c 2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.

Analogy Analogy Analogy Series CompletionSeries CompletionSeries CompletionClassificationClassificationClassification

I I I D4 D4 D4 D1 D1 A D1 A C A C C B B I B I I D3 D3 D3 B B D2 B D2 D2 D D D 1 1 1 I I D2 I D2 D2 D D D D3 D3 D3 D1 D1 D1 2 2A A2 A D3 D3 D3 D4 D4 D4 D D D 4 4 4 B B B A A C A C C

Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices. The correct answer is shaded green.

Formally, we view word embedding methods as We propose and evaluate a new principled di- • performing metric recovery. This perspective is sig- rect metric recovery algorithm that performs nificantly different from current approaches. Instead comparably to the existing state of the art on of aiming for representations that exhibit specific se- both word embedding and manifold learning mantic properties or that perform well at a particu- tasks, and show that GloVe (Pennington et lar task, we seek methods that recover the underly- al., 2014) is closely related to the second-order ing metric of the hypothesized semantic space. The Taylor expansion of our objective. clearer foundation afforded by this perspective en- We construct and make available two new in- • ables us to analyze word embedding algorithms in ductive reasoning datasets1—series completion a principled task-independent fashion. In particu- and classification—to extend the evaluation of lar, we ask whether word embedding algorithms are word representations beyond analogies, and able to recover the metric under specific scenarios. demonstrate that these tasks can be solved with To this end, we unify existing word embedding al- vector operations on word embeddings as well gorithms as statistically consistent metric recovery (Examples in Table 1). methods under the theoretical assumption that co- occurrences arise from (metric) random walks over 2 Word vectors and semantic spaces semantic spaces. The new setting also suggests a Most current word embedding algorithms build on simple and direct recovery algorithm which we eval- the distributional hypothesis (Harris, 1954) where uate and compare against other embedding methods. similar contexts imply similar meanings so as to tie The main contributions of this work can be sum- co-occurrences of words to their underlying mean- marized as follows: 1 1 1 ings. The relationship between semantics and co- We ground word embeddings in semantic occurrences has also been studied in psychomet- • spaces via log co-occurrence counts. We show rics and cognitive science (Rumelhart and Abraham- that PMI (pointwise mutual information) re- son, 1973; Sternberg and Gardner, 1983), often by lates linearly to human similarity assessments, means of free word association tasks and seman- and that nearest-neighbor statistics (centrality tic spaces. The semantic spaces, in particular, pro- and reciprocity) are consistent with an Eu- vide a natural conceptual framework for continu- clidean space hypothesis (Sections 2 and 3). ous representations of words as vector spaces where In contrast to prior work (Arora et al., 2015), semantically related words are close to each other. • we take metric recovery as the key object of For example, the observation that word embeddings study, unifying existing algorithms as consis- can solve analogies was already shown by Rumel- tent metric recovery methods based on co- hart and Abrahamson (1973) using vector represen- occurrence counts from simple Markov random tations of words derived from surveys of pairwise walks over graphs and manifolds. This strong word similarity judgments. link to manifold estimation opens a promis- A fundamental question regarding vector space ing direction for extensions of word embedding models of words is whether an Euclidean vector methods (Sections and 4 and 5). 1http://web.mit.edu/thashim/www/supplement materials.zip

274 Task Prompt Answer Analogy king:man::queen:? woman Series penny:nickel:dime:? quarter Classification horse:zebra: deer, dog, fish deer { } Table 1: Examples of the three inductive reasoning tasks proposed by Sternberg and Gardner (1983). Figure 2: Normalized log co-occurrence (PMI) space is a valid representation of semantic concepts. linearly correlates with human semantic similarity There is substantial empirical evidence in favor of judgments (MEN survey). this hypothesis. For example, Rumelhart and Abra- counts has a strong linear relationship with seman- hamson (1973) showed experimentally that analog- tic similarity judgments from survey data (Pearson’s ical problem solving with fictitious words and hu- r=0.75).2 However, this suggestive linear relation- man mistake rates were consistent with an Euclidean ship does not by itself demonstrate that log co- space. Sternberg and Gardner (1983) provided fur- occurrences (with normalization) can be used to de- ther evidence supporting this hypothesis, proposing fine an Euclidean metric space. that general inductive reasoning was based upon op- Earlier psychometric studies have asked whether erations in metric embeddings. Using the analogy, human semantic similarity evaluations are consis- series completion and classification tasks shown in tent with an Euclidean space. For example, Tver- Table 1 as testbeds, they proposed that subjects solve sky and Hutchinson (1986) investigate whether con- these problems by finding the word closest (in se- cept representations are consistent with the geomet- mantic space) to an ideal point: the vertex of a par- ric sampling (GS) model: a generative model in allelogram for analogies, a displacement from the which points are drawn independently from a con- last word in series completion, and the centroid in tinuous distribution in an Euclidean space. They the case of classification (Figure 1). use two nearest neighbor statistics to test agreement We use semantic spaces as the prototypical struc- with this model, and conclude that certain hierarchi- tures that word embedding methods attempt to un- cal vocabularies are not consistent with an Euclidean cover, and we investigate the suitability of word co- embedding. Similar results are observed by Griffiths occurrence counts for doing so. In the next section, et al. (2007). We extend this embeddability analy- we show that co-occurrences from large corpora in- sis to lexical co-occurrences and show that semantic deed relate to semantic similarity assessments, and similarity estimates derived from these are mostly that the resulting metric is consistent with an Eu- consistent with an Euclidean space hypothesis. clidean semantic space hypothesis. The first test statistic for the GS model, the cen- trality C, is defined as 3 The semantic space of log co-occurrences n n 1 2 Most word embedding algorithms are based on word C = N n ij co-occurrence counts. In order for such meth- i=1 j=1 X X  ods to uncover an underlying Euclidean semantic where Nij = 1 iff i is j’s nearest neighbor. Under space, we must demonstrate that co-occurrences the GS model (i.e. when the words are consistent themselves are indeed consistent with some seman- with a Euclidean space representation), C 2 with ≤ tic space. We must relate co-occurrences to semantic high probability as the number of words n re- → ∞ similarity assessments, on one hand, and show that gardless of the dimension or the underlying density they can be embedded into a Euclidean metric space, (Tversky and Hutchinson, 1986). For metrically em- on the other. We provide here empirical evidence for beddable data, typical non-asymptotic values of C both of these premises. 2 We commence by demonstrating in Figure 2 Normalizing the log co-occurrence with the unigram fre- quency taken to the 3/4th power maximizes the linear correla- that the pointwise mutual information (Church tion in Figure 2, explaining this choice of normalization in prior and Hanks, 1990) evaluated from co-occurrence work (Levy and Goldberg, 2014a; Mikolov et al., 2013b).

275 Corpus CRf that they are empirically related to a human notion of Free association 1.51 0.48 semantic similarity and that they are metrically em- corpus 2.21 0.63 Word2vec corpus 2.24 0.73 beddable, a desirable condition if we expect word GloVe corpus 2.66 0.62 vectors derived from them to truly behave as ele- ments of a metric space. This, however, does not Table 2: Semantic similarity data derived from mul- yet fully justify their use to derive semantic repre- tiple sources show evidence of embeddability sentations. The missing piece is to formalize the connection between these co-occurrence counts and range between 1 and 3, while non-embeddable hier- some intrinsic notion of semantics, such as the se- archical structures have C > 10. mantic spaces described in Section 2. In the next The second statistic, the reciprocity fraction R f two sections, we establish this connection by fram- (Schwarz and Tversky, 1980; Tversky and Hutchin- ing word embedding algorithms that operate on co- son, 1986), is defined as occurrences as metric recovery methods. n n 1 R = N N f n ij ji 4 Semantic spaces and manifolds Xi=1 Xj=1 and measures the fraction of words that are their We take a broader, unified view on metric recov- nearest neighbor’s nearest neighbor. Under the GS ery of semantic spaces since the notion of semantic model, this fraction should be greater than 0.5.3 spaces and the associated parallelogram rule for ana- Table 2 shows the two statistics computed on logical reasoning extend naturally to objects other three popular large corpora and a free word associ- than words. For example, images can be approxi- ation dataset (see Section 6 for details). The near- mately viewed as points in an Euclidean semantic est neighbor calculations are based on PMI. The space by representing them in terms of their under- results show a surprisingly high agreement on all lying degrees of freedom (e.g. orientation, illumina- tion). Thus, questions about the underlying seman- three statistics for all corpora, with C and Rf con- tained in small intervals: C [2.21, 2.66] and tic spaces and how they can be recovered should be ∈ R [0.62, 0.73]. These results are consistent related. f ∈ with Euclidean semantic spaces and the GS model The problem of recovering an intrinsic Euclidean coordinate system over objects has been specifically in particular. The largest violators of C and Rf are consistent with Tversky’s analysis: the word with addressed in manifold learning. For example, meth- the largest centrality in the non-stopword Wikipedia ods such as Isomap (Tenenbaum et al., 2000) re- corpus is ‘the’, whose inclusion would increase C to constitute an Euclidean space over objects (when 3.46 compared to 2.21 without it. Tversky’s original possible) based on local comparisons. Intuitively, analysis of semantic similarities argued that certain these methods assume that naive distance metrics words, such as superordinate and function words, such as the L2 distance over pixels in an image could not be embedded. Despite such specific ex- may be meaningful only when images are very sim- ceptions, we find that for an appropriately normal- ilar. Longer distances between objects are evaluated ized corpus, the majority of words are consistent through a series of local comparisons. These longer with the GS model, and therefore can be represented distances—geodesic distances over the manifold— meaningfully as vectors in Euclidean space. can be approximated by shortest paths on a neigh- The results of this section are an important step borhood graph. If we view the geodesic distances towards justifying the use of word co-occurrence on the manifold (represented as a graph) as seman- counts as the central object of interest for seman- tic distances, then the goal is to isometrically embed tic vector representations of words. We have shown these distances into an Euclidean space. Tenenbaum (1998) showed that such isometric embeddings of 3 Both R and C are asymptotically dimension independent image manifolds can be used to solve “visual analo- because they rely only on the single nearest neighbor. Esti- mating the latent dimensionality requires other measures and gies” via the parallelogram rule. assumptions (Kleindessner and von Luxburg, 2015). Typical approaches to manifold learning as dis-

276 m,n cussed above differ from word embedding in terms vocabulary of size n, let Cij (tn) be the number of how the semantic distances between objects are of times word j occurs tn steps after word i in the extracted. Word embeddings approximate semantic corpus.5 We can show that there exist unigram nor- m,n m,n distances between words using the negative log co- malizers ai , bi such that the following holds: occurrence counts (Section 3), while manifold learn- Lemma 1. Given a corpus generated by Equation 1 ing approximates semantic distances using neigh- there exists ai and bj such that simultaneously over borhood graphs built from local comparisons of the all i, j: original, high-dimensional points. Both views seek m,n m,n m,n 2 to estimate a latent geodesic distance. lim log(C (tn)) a b xi xj . m,n − ij − i − j → || − ||2 In order to study the problem of metric recov- →∞ ery from co-occurrence counts, and to formalize the We defer the precise statement and conditions of connection between word embedding and manifold Lemma 1 to Corollary 6. Conceptually, this lim- learning, we introduce a simple random walk model iting6 result captures the intuition that while one- over the underlying objects (e.g. words or images). step transitions in a sentence may be complex and This toy model permits us to establish clean consis- include non-metric structure expressed in h, co- tency results for recovery algorithms. We emphasize occurrences over large windows relate directly to that while the random walk is introduced over the the latent semantic metric. For ease of notation, words, it is not intended as a model of language but we henceforth omit the corpus and vocabulary size rather as a tool to understand the recovery problem. descriptors m, n (using Cij, ai, and bj in place of m,n m,n m,n Cij (tn), ai , and bj ), since in practice the cor- 4.1 Random walk model pus is large but fixed.

Consider now a simple metric random walk Xt over Lemma 1 serves as the basis for establishing con- words where the probability of transitioning from sistency of recovery for word embedding algorithms word i to word j is given by (next section). It also allows us to establish a precise link between between manifold learning and word 1 2 P (Xt = j Xt 1 = i) = h( σ xi xj 2) (1) embedding, which we describe in the remainder of | − || − || this section. Here x x 2 is the Euclidean distance between || i − j||2 words in the underlying semantic space to be recov- 4.2 Connection to manifold learning D ered, and h is some unknown, sub-Gaussian func- Let v1 . . . vn R be points drawn i.i.d. from { } ∈ tion linking semantic similarity to co-occurrence.4 a density p, where D is the dimension of observed Under this model, the log frequency of occur- inputs (e.g. number of pixels, in the case of im- rences of word j immediately after word i will be ages), and suppose that these points lie on a mani- 2 D proportional to log(h( xi xj /σ)) as the corpus fold R that is isometrically embeddable into || − ||2 M ⊂ size grows large. Here we make the surprising ob- d < D dimensions, where d is the intrinsic dimen- servation that if we consider co-occurrences over a sionality of the data (e.g. coordinates representing sufficiently large window, the log co-occurrence in- illumination or camera angle in the case of images). 2 stead converges to xi xj 2/σ, i.e. it directly re- The problem of manifold learning consists of finding −|| − || d lates to the underlying metric. Intuitively, this result an embedding of v1 . . . vn into R that preserves the is an analog of the central limit theorem for random structure of by approximately preserving the dis- M walks. Note that, for this reason, we do not need to tances between points along this manifold. In light know the link function h. 5 The window-size tn depends on the vocabulary size to en- Formally, given an m-token corpus consisting of sure that all word pairs have nonzero co-occurrence counts in sentences generated according to Equation 1 from a the limit of large vocabulary and corpus. For details see the definition of gn in Appendix A. 4This toy model ignores the role of syntax and function 6In Lemma 1, we take m (growing corpus size) to → ∞ words, but these factors can be included as long as the moment ensure all word pairs appear sufficiently often, and n → ∞ bounds originally derived in Hashimoto et al. (2015b) remain (growing vocabulary) to ensure that every point in the semantic fulfilled. space has a nearby word.

277 of Lemma 1, this problem can be solved with word with weights f(Cij). Splitting the word vector xi embedding algorithms in the following way: and context vector ci is helpful in practice but not 1. Construct a neighborhood graph (e.g. connect- necessary under the assumptions of Lemma 1 sinceb the true embedding xi = ci = xi/σ and ai, bi = 0 is ing points within a distance ε) over v1 . . . vn . b 2. Record the vertex sequence of a simple{ random} a global minimum whenever dim(x) = d. In other walk over these graphs as a sentence, and con- words, GloVe can recoverb b the true metricb providedb catenate these sequences initialized at different that we set d correctly. b nodes into a corpus. word2vec The skip-gram model of word2vec 3. Use a word embedding method on this cor- approximates a softmax objective: pus to generate d-dimensional vector represen- tations of the data. exp( xi, cj ) min Cij log n h i . x,c exp( x , c ) Under the conditions of Lemma 1, the negative i,j  k=1 h i ki  X b b log co-occurrences over the vertices of the neigh- b b P Without loss of generality, we can rewrite the above borhood graph will converge, as n , to the b b → ∞ with a bias term b by making dim(x) = d + 1 geodesic distance (squared) over the manifold. In j and setting one of the dimensions of x to 1. By re- this case we will show that the globally optimal so- defining the bias b = b c 2/2, we see that lutions of word embedding algorithms recover the j j − || j||2 b word2vec solves low dimensional embedding (Section 5).7 b b 1 b 2 exp( xi cj + bj) min C log − 2 || − ||2 . 5 Recovering semantic distances with ij n 1 x,c,b exp( x c 2 + b ) word embeddings i,j k=1 2 i k 2 k ! X −b|| b− || b b bb P n We now show that, under the conditions of Lemma Since according to Lemma 1 Cij/ k=1bCik 2 2 b b 1, three popular word embedding methods can be exp( xi xj 2/σ ) approaches n −k| − || 2 2 , this is the exp( xi x /σ ) P viewed as doing metric recovery from co-occurrence k=1 −k| − k||2 stochastic neighbor embedding (SNE) (Hinton and counts. We use this observation to derive a new, sim- P Roweis, 2002) objective weighted by n C . ple word embedding method inspired by Lemma 1. k=1 ik The global optimum is achieved by xi = ci = P 5.1 Word embeddings as metric recovery xi(√2/σ) and bj = 0 (see Theorem 8). The neg- ative sampling approximation used in practice be- GloVe The Global Vectors (GloVe) (Pennington b b haves much likeb the SVD approach of Levy and et al., 2014) method for word embedding optimizes Goldberg (2014b), and by applying the same sta- the objective function tionary point analysis as they do, we show that in 2 min f(Cij)(2 xi, cj + ai + bj log(Cij)) the absence of a bias term the true embedding is x,c,a,b h i − i,j a global minimum under the additional assumption X 2 2 that xi (2/σ ) = log( Cij/√ Cij) (Hin- b b b b 3/4 || ||2 j ij with f(Cij) = min(Cij, 100) . If we rewrite the ton and Roweis, 2002). bias terms as a = a x 2 and b = b c 2, P P i i − || i||2 j j − || j||2 we obtain the equivalent representation: SVD The method of Levy and Goldberg (2014b) b uses the log PMI matrix defined in terms of the uni- b b 2 b 2 min f(Cij)( log(Cij) xi cj 2+ai+bj)) . gram frequency Ci as: x,c,a,b − −|| − || Xi,j b Mij = log(Cij) log(Ci) log(Cj) + log Cj b b b b b b b − − Together with Lemma 1, we recognize this as a j weighted multidimensional scaling (MDS) objective X  and computes the SVD of the shifted and truncated 7This approach of applying random walks and word embed- matrix: (Mij + τ)+ where τ is a truncation param- dings to general graphs has already been shown to be surpris- eter to keep M finite. Under the limit of Lemma ingly effective for social networks (Perozzi et al., 2014), and ij demonstrates that word embeddings serve as a general way to 1, the corpus is sufficiently large that no truncation connect metric random walks to embeddings. is necessary (i.e. τ = min(M ) < ). We will − ij ∞

278 recover the underlying embedding if we additionally log(λij) log(Cij), we have 1 2 √ ≈ − assume σ xi 2 = log(Ci/ j Cj) via the law of || || Cij θ 2 3 large numbers since Mij xi, xj (see Theorem LL(x, a, b, θ) = kij (uij) +o (uij) , →P h i − 2(Cij +θ) 7). Centering the matrix M before obtaining the ij ij X  SVD would relax the norm assumption, resulting ex- where u = (log λ log C ) and k does ij ij − ij ij actly in classical MDS (Sibson, 1979). not depend on x. Note the similarity of the sec- ond order term with the GloVe objective. As C 5.2 Metric regression from log co-occurrences ij Cij θ grows, the weight functions and f(Cij) = We have shown that by simple reparameterizations 2(Cij +θ) 3/4 and use of Lemma 1, existing embedding algo- max(Cij, xmax) converge to θ/2 and xmax re- rithms can be interpreted as consistent metric recov- spectively, down-weighting large co-occurrences. ery methods. However, the same Lemma suggests a 6 Empirical validation more direct regression method for recovering the la- tent coordinates, which we propose here. This new We will now experimentally validate two aspects of embedding algorithm serves as a litmus test for our our theory: the semantic space hypothesis (Sections metric recovery paradigm. 2 and 3), and the correspondence between word em- Lemma 1 describes a log-linear relationship be- bedding and manifold learning (Sections 4 and 5). tween distance and co-occurrences. The canonical Our goal with this empirical validation is not to find way to fit this relationship would be to use a general- the absolute best method and evaluation metric for ized linear model, where the co-occurrences follow a word embeddings, which has been studied before negative binomial distribution Cij NegBin(θ, p), (e.g. Levy et al. (2015)). Instead, we provide empir- 1 ∼ 2 where p = θ/[θ + exp( xi xj + ai + bj)]. ical evidence in favor of the semantic space hypoth- − 2 || − ||2 Under this overdispersed log linear model, esis, and show that our simple algorithm for metric

1 2 recovery is competitive with the state-of-the-art on E[Cij] = exp( xi xj + ai + bj) − 2 || − ||2 both semantic induction tasks and manifold learn- 2 Var(Cij) = E[Cij] /θ + E[Cij] ing. Since metric regression naturally operates over integer co-occurrences, we use co-occurrences over θ Here, the parameter controls the contribution of unweighted windows for this and—for fairness—for C GloVe f(C ) large ij, and is akin to ’s ij weight the other methods (see Appendix C for details). function. Fitting this model is straightforward if we define the log-likelihood in terms of the expected 6.1 Datasets rate λ = exp( 1 x x 2 + a + b ) as: ij − 2 || i − j||2 i j Corpus and training: We used three different LL(x, a, b, θ) = θ log(θ) θ log(λ + θ)+ corpora for training: a Wikipedia snapshot of − ij i,j 03/2015 (2.4B tokens), the original word2vec cor- X pus (Mikolov et al., 2013a) (6.4B tokens), and a θ Γ(Cij +θ) Cij log 1 + log − λij +θ Γ(θ)Γ(Cij +1) combination of Wikipedia with Gigaword5 emulat-     ing GloVe’s corpus (Pennington et al., 2014) (5.8B To generate word embeddings, we minimize the tokens). We preprocessed all corpora by removing negative log-likelihood using stochastic gradient de- punctuation, numbers and lower-casing all the text. scent. The implementation mirrors that of GloVe The vocabulary was restricted to the 100K most fre- and randomly selects word pairs i, j and attracts or quent words in each corpus. We trained embeddings repulses the vectors x and c in order to achieve the using four methods: word2vec, GloVe, random- relationship in Lemma 1. Implementation details are ized SVD,8 and metric regression (referred to as re- provided in Appendix C. b b gression). Full implementation details are provided Relationship to GloVe The overdispersion pa- in the Appendix. rameter θ in our metric regression model sheds light 8 We used randomized, rather than full SVD due to the diffi- on the role of GloVe’s weight function f(Cij). Tak- culty of scaling SVD to this problem size. For performance of ing the Taylor expansion of the log-likelihood at full SVD factorizations see Levy et al. (2015).

279 Google Semantic Google Syntactic Google Total SAT Classification Sequence

Method L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos L2 Cos Regression 75.5 78.4 70.9 70.8 72.6 73.7 39.2 37.8 87.6 84.6 58.3 59.0 GloVe 71.1 76.4 68.6 71.9 69.6 73.7 36.9 35.5 74.6 80.1 53.0 58.9 SVD 50.9 58.1 51.4 52.0 51.2 54.3 32.7 24.0 71.6 74.1 49.4 47.6 word2vec 71.4 73.4 70.9 73.3 71.1 73.3 42.0 42.0 76.4 84.6 54.4 56.2

Table 3: Accuracies on Google, SAT analogies and on two new inductive reasoning tasks. Manifold Learning Word Embedding the lack of publicly available datasets, we generated Semantic 83.3 70.7 our own questions using WordNet (Fellbaum, 1998) Syntactic 8.2 76.9 relations and word-word PMI values. These datasets Total 51.4 73.4 were constructed before training the embeddings, so Table 4: Semantic similarity alone can solve the as to avoid biasing them towards any one method. Google analogy tasks For the classification task, we created in-category words by selecting words from WordNet relations For fairness we fix all hyperparameters, and de- associated to root words, from which we pruned velop and test the code for metric regression exclu- to four words based on PMI-similarity to the other sively on the first 1GB subset of the wiki dataset. words in the class. Additional options for the mul- For open-vocabulary tasks, we restrict the set of an- tiple choice questions were created searching over swers to the top 30K words, since this improves per- words related to the root by a different relation type, formance while covering the majority of the ques- and selecting those most similar to the root. tions. In the following, we show performance for the For the sequence completion task, we obtained GloVe corpus throughout but include results for all WordNet trees of various relation types, and pruned corpora along with our code package. these based on similarity to the root word to obtain the sequence. For the multiple-choice questions, we Evaluation tasks: We test the quality of the word proceeded as before to select additional (incorrect) embeddings on three types of inductive tasks: analo- options of a different relation type to the root. gies, sequence completion and classification (Figure After pruning, we obtain 215 classification ques- 1). For the analogies, we used the standard open- tions and 220 sequence completion questions, of vocabulary analogy task of Mikolov et al. (2013a) which 51 are open-vocabulary and 169 are multiple (henceforth denoted Google), consisting of 19,544 choice. These two new datasets are available9. semantic and syntactic questions. In addition, we use the more difficult SAT analogy dataset (ver- 6.2 Results on inductive reasoning tasks sion 3) (Turney and Littman, 2005), which contains Solving analogies using survey data alone: We 374 questions from actual exams and guidebooks. demonstrate that, surprisingly, word embeddings Each question consists of 5 exemplar pairs of words trained directly on semantic similarity derived from word1:word2, where the same relation holds for all survey data can solve analogy tasks. Extending a pairs. The task is to pick from among another five study by Rumelhart and Abrahamson (1973), we pairs of words the one that best fits the category im- use a free-association dataset (Nelson et al., 2004) plicitly defined by the exemplars. to construct a similarity graph, where vertices cor- Inspired by Sternberg and Gardner (1983), we respond to words and the weights w are given by propose two new difficult inductive reasoning tasks ij the number of times word j was considered most beyond analogies to verify the semantic space hy- similar to word i in the survey. We take the largest pothesis: sequence completion and classification. connected component of this graph (consisting of As described in Section 2, the former involves 4845 words and 61570 weights) and embed it us- choosing the next step in a semantically coherent ing Isomap for which squared edge distances are de- sequence of words (e.g. hour, minute, . . .), and fined as log(w / max (w )). We use the result- the latter consists of selecting an element within the − ij kl kl same category out of five possible choices. Given 9http://web.mit.edu/thashim/www/supplement materials.zip

280 Figure 3: Dimensionality reduction using word embedding and manifold learning. Performance is quantified by percentage of 5-nearest neighbors sharing the same digit label. ing vectors as word embeddings to solve the Google tion by performing nonlinear dimensionality reduc- analogy task. The results in Table 4 show that em- tion on the MNIST digit dataset, reducing from D = beddings obtained with Isomap on survey data can 256 to two dimensions. Using a four-thousand im- outperform the corpus based metric regression vec- age subset, we construct a k-nearest neighbor graph tors on semantic, but not syntactic tasks. We hypoth- (k = 20) and generate 10 simple random walks of esize that free-association surveys capture semantic, length 200 starting from each vertex in the graph, re- but not syntactic similarity between words. sulting in 40,000 sentences of length 200 each. We compare the four word embedding methods against The results on the Google analogies Analogies: standard dimensionality reduction methods: PCA, shown in Table 3 demonstrate that our proposed Isomap, SNE and, t-SNE. We evaluate the meth- framework of metric regression and L distance is 2 ods by clustering the resulting low-dimensional data competitive with the baseline of word2vec with and computing cluster purity, measured using the cosine distance. The performance gap across meth- percentage of 5-nearest neighbors having the same ods is small and fluctuates across corpora, but met- digit label. The resulting embeddings, shown in ric regression consistently outperforms GloVe on Fig. 3, demonstrate that metric regression is highly most tasks and outperforms all methods on seman- effective at this task, outperforming metric SNE and tic analogies, while word2vec does better on syn- beaten only by t-SNE (91% cluster purity), which is tactic categories. For the SAT dataset, the L dis- 2 a visualization method specifically designed to pre- tance performs better than the cosine similarity, and serve cluster separation. All word embedding meth- we find word2vec to perform best, followed by ods including SVD (68%) embed the MNIST digits metric regression. The results on these two analogy remarkably well and outperform baselines of PCA datasets show that directly embedding the log co- (48%) and Isomap (49%). occurrence metric and taking L2 distances between vectors is competitive with current approaches to 7 Discussion analogical reasoning. Our work recasts word embedding as a metric recov- Sequence and classification tasks: As predicted ery problem pertaining to the underlying semantic by the semantic field hypothesis, word embeddings space. We use co-occurrence counts from random perform well on the two novel inductive reasoning walks as a theoretical tool to demonstrate that exist- tasks (Table 3). Again, we observe that the metric ing word embedding algorithms are consistent met- recovery with metric regression coupled with L2 dis- ric recovery methods. Our direct regression method tance consistently performs as well as and often bet- is competitive with the state of the art on various se- ter than the current state-of-the-art word embedding mantics tasks, including two new inductive reason- methods on these two additional semantic datasets. ing problems of series completion and classification. Our framework highlights the strong interplay 6.3 Word embeddings can embed manifolds and common foundation between word embedding In Section 4 we proposed a reduction for solving methods and manifold learning, suggesting several manifold learning problems with word embeddings avenues for recovering vector representations of which we show achieves comparable performance to phrases and sentences via properly defined Markov manifold learning methods. We now test this rela- processes and their generalizations.

281 Appendix following holds: A Metric recovery from Markov processes Theorem 3 ((Hashimoto et al., 2015b; Ting et al., The simple random walk Xn on G con- on graphs and manifolds 2011)). t n verges in Skorokhod space D([0, ), D) after a time 2 ∞ Consider an infinite sequence of points n = scaling t = tgn to the Itoˆ process Yt valued in X n x1, . . . , xn , where xi are sampled i.i.d. from a C([0, ), D) as X 2 Yt. The process Yt is de- { } ∞ tgn− → density p(x) over a compact Riemannian manifold fined overb the normal coordinates of theb manifold equipped with a geodesic metric ρ. For our pur- (D, g) with reflectingb boundaryb conditions onb D as poses, p(x) should have a bounded log-gradient and 2 p dY = log(p(Y ))σ(Y ) dt + σ(Y )dW (2) a strict lower bound 0 over the manifold. The ran- t ∇ t t t t dom walks we consider are over unweighted spatial graphs defined as Theb equicontinuityb constraintb b on theb marginalb densities of the random walk implies that the tran- Definition 2 (Spatial graph). Let σn : n R>0 X → sition density for the random walk converges to its be a local scale function and h : R 0 [0, 1] ≥ → continuum limit. a piecewise continuous function with sub-Gaussian Lemma 4 (Convergence of marginal densities). tails. A spatial graph Gn corresponding to σn and (Hashimoto et al., 2015a) Let x0 be some point in h is a random graph with vertex set n and a di- X our domain n and define the marginal densities rected edge from xi to xj with probability pij = X n 2 2 qt(x) = P(Yt = x Y0 = x0) and qtn (x) = P(Xt = h(ρ(xi, xj) /σn(xi) ). | x Xn = x ). If t g2 = t = Θ(1), then under | 0 0 n n Simple examples of spatial graphs where the con- conditionb (?) and the results of Theorem 3 such that nectivity is not random include the ε ball graph Xn Y n weakly, we have b t → t (σn(x) = ε) and the k-nearest neighbor graph 1 (σ (x) =distance to k-th neighbor). lim nqtn (x) = q (x)p(x)− . n n t Log co-occurrences and the geodesic will be con- →∞ nected in two steps. (1) we use known results to (2) Log transition probabilitybb as a metric We may show that a simple random walk over the spatial now use the stochastic process Yt to connect the log graph, properly scaled, behaves similarly to a dif- transition probability to the geodesic distance using b fusion process; (2) the log-transition probability of Varadhan’s large deviation formula. a diffusion process will be related to the geodesic Theorem 5 ((Varadhan, 1967; Molchanov, 1975)). metric on a manifold. Let Yt be a Itoˆ process defined over a complete (1) The limiting random walk on a graph: Just Riemann manifold (D, g) with geodesic distance as the simple random walk over the integers con- ρ(xi, xj) then verges to a Brownian motion, we may expect that 2 under specific constraints the simple random walk lim t log(P(Yt = xj Y0 = xi)) ρ(xi, xj) . t 0 − | → n → Xt over the graph Gn will converge to some well- defined continuous process. We require that the This estimate holds more generally for any space scale functions converge to a continuous function σ¯ admitting a diffusive stochastic process (Saloff- a.s. (σ (x)g 1 σ¯(x)); the size of a single step van- Coste, 2010). Taken together, we finally obtain: n n− −−→ ish (gn 0) but contain at least a polynomial num- Corollary 6 (Varadhan’s formula on graphs). For → 1 1 ber of points within σn(x) (gnn d+2 log(n)− d+2 any δ,γ,n0 there exists some t, n > n0, and se- → n ). Under this limit, our assumptions about the quence bj such that the following holds for the sim- ∞ 10 n density p(x), and regularity of the transitions , the ple random walk Xt : b

10 2 For t = Θ(gn− ), the marginal distribution nP(Xt X0) n n | sup t log( (X 2 = xj X = xi)) must be a.s. uniformly equicontinuous. For undirected spatial P P tgn− 0 xi,xj n | graphs, this is always true (Croydon and Hambly, 2008), but for  ∈X 0 directed graphs it is an open conjecture from (Hashimoto et al., n b 2 btb ρσ(x)(xi, xj) > δ < γ 2015b). − j − 

b 282 Where ρσ(x) is the geodesic defined as Then there exists a τ such that this embedding is 1 close to the true embedding under the same equiva- ρσ(x)(xi, xj) = min σ(f(t))dt lence class as Lemma S7 f C1:f(0)=x ,f(1)=x ∈ i j 0 Z 2 2 P Axi/σ xj > δ < ε. Proof. The proof is in two parts. First, by Varad- || − ||2 han’s formula (Theorem 5, (Molchanov, 1975, Eq.  Xi  1.7)) for any δ1 > 0 there exists some t such that: Proof. By Corollaryb 6 for any δ1 > 0 and ε1 > 0 2 there exists a m such that sup t log(P(Yt = y0 Y0 = y)) ρσ(x)(y0, y) < δ1 y,y0 D |− | − b | ∈ 2 2 The uniform equicontinuityb of the marginals implies P (sup log(Cij) ( xi xj 2/σ ) b i,j | − − || − || their uniform convergence (Lemma S4), so for any log(mc) > δ1) < ε1. δ2 > 0 and γ0, there exists a n such that − | 2 2 P( sup P(Yt = xj Y0 = xi) Now additionally, if Ci/ j Cj = xi /σ then x ,x n | | || || j i∈X 0 we can rewrite the aboveq bound as n b n P np(xj)P(X 2 = xj X0 = xi) > δ2) < γ0 − gn− t | | P (sup log(C ) log(C ) log(C )+log( C ) By the lower bound on p and compactness of D, | ij − i − j i b i,j i P(Y Y0) is lower bounded by some strictly positive X t| 2 x , x /σ2 log(mc) > δ ) < ε . constant c and we can apply uniform continuity of − h i ji − | 1 1 log(bx) over (c, ) to get that for some δ and γ, ∞ 3 and therefore, sup log( (Y = xj Y0 = xi)) log(np(xj)) 2 P P t P (sup Mij 2 xi, xj /σ log(mc) > δ1) < ε1. x ,x n | | − j i∈X 0 i,j | − h i − | n b n log(P(X 2 = xj X0 = xi)) > δ3 < γ. (3) − gn− t | | Given that the dot product matrix has error at most Finally we haveb the bound,  δ1, the resulting embedding it known to have at most n n √ sup t log( (X 2 = xj X = xi)) δ1 error (Sibson, 1979). P P gn− t 0 x ,x n | − | i j ∈X 0 This completes the proof, since we can pick τ = 2 2 t log(np(x )) ρb (x , x ) b > δ +tδ < γ log(mc), δ1 = δ and ε1 = ε. − j − σ(x) i j | 1 3 − To combine the bounds, given some δ and γ, set Theorem 8 (Consistency of softmax/word2vec). nb b bj = log(np(xj)), pick t such that δ1 < δ/2, then Define the softmax objective function with bias as pick n such that the bound in Eq. 3 holds with prob- 2 ability γ and error δ3 < δ/b (2t). exp( xi cj 2 + bj) g(x, c, b) = Cij log −|| − || n 2 ij k=1 exp( xi ck 2 + bk) B Consistency proofs forb word embedding X −||b b− || b b b b P Lemma 7 (Consistency of SVD). Assume the norm Define xm, cm, bm as the global minima ofb the aboveb b of the latent embedding is proportional to the uni- objective function for a co-occurrence Cij over a gram frequency corpus of size m. For any ε > 0 and δ > 0 there 2 1 exists some m such that xi /σ = Ci/( Cj) 2 || || x x j P( g( , , 0) g(xm, cm, bm) > δ) < ε X | σ σ − | Under these conditions, Let X be the embedding de- Proof. By differentiation, any objective of the form rived from the SVD of Mij as b exp( λij) T min Cij log − 2XX = Mij = log(Cij) log Ci λij exp( λ ) −  k − ik    log Cj + log Ci + τ. has the minima λ =P log(C ) + a up to b b − ij∗ − ij i    Xi  un-identifiable ai with objective function value

283 Cij log(Cij/ k Cik). This gives a global function word2vec. We used the skip-gram version with lower bound 5 negative samples, 10 iterations, α = 0.025 and fre- P 3 Cij quent word sub-sampling with a parameter of 10− . g(xm, cm, bm) Cij log C (4) ≥ k ik ij X  P  GloVe. We disabled GloVe’s corpus weighting, Now consider the function value of the true embed- since this generally produced superior results. The ding x ; σ 1 2 default step-sizes results in NaN-valued embed- x x exp( σ2 xi xj 2) − || − || dings, so we reduced them. We used XMAX = 100, g( σ , σ , 0) = Cij log 1 2 k exp( σ2 xi xk 2) η = 0.01 and 10 iterations. Xij − || − || exp(log(P Cij) + δij + ai) = Cij log . SVD. For the SVD algorithm of Levy and Gold- k exp(log(Cik) + δik + ai) Xij   berg (2014b), we use the GloVe co-occurrence We can bound theP error variables δij using Corol- counter combined with a parallel randomized pro- lary 6 as supij δij < δ0 with probability ε0 jection SVD factorizer, based upon the redsvd li- | | 13 for sufficiently large m with a = log(m ) brary due to memory and runtime constraints. Fol- i i − log( n exp( x x 2/σ2)). lowing Levy et al. (2015), we used the square root k=1 −|| i − k||2 Taking the Taylor expansion at δij = 0, we have factorization, no negative shifts (τ = 0 in our nota- P tion), and 50,000 random projections. x x g( σ , σ , 0) = n Regression Embedding. We use standard SGD Cij Cil 2 Cij log C + C δ + o( δ 2) with two differences. First, we drop co-occurrence k ik k ik il || || ij l=1 values with probability proportional to 1 C /10 X P X P − ij By the law of large numbers of Cij, when Cij < 10, and scale the gradient, which re-

x x Cij sulted in training time speedups with no loss in ac- P g( σ , σ , 0) Cij log C > nδ0 < ε0 − k ik curacy. Second, we use an initial line search step ij   X P  combined with a linear step size decay by epoch. We which combined with (4) yields use θ = 50 and η is line-searched starting at η = 10. x x P( g( σ , σ , 0) g(x, c, b) > nδ0) < ε0. | − | C.2 Solving inductive reasoning tasks To obtain the original theorem statement, take m to fulfil δ0 = δ/n and ε0 = ε. The ideal point for a task is defined below: Note that for word2vec with negative-sampling, Analogies: Given A:B::C, the ideal point is • applying the stationary point analysis of Levy and given by B A + C (parallelogram rule). − Goldberg (2014b) combined with the analysis in Analogies (SAT): Given prototype A:B and • Lemma S7 shows that the true embedding is a global candidates C1 : D1 ...Cn : Dn, we compare minimum. D C to the ideal point B A. i − i − Categories: Given a category implied by C Empirical evaluation details • 1 n w1, . . . , wn, the ideal point is I = n i=1 wi. C.1 Implementation details Sequence: Given sequence w1 : : wn we • ···P compute the ideal as I = w + 1 (w w ). We used off-the-shelf implementations of n n n − 1 word2vec11 and GloVe12. The two other methods (randomized) SVD and regression embed- Once we have the ideal point I, we pick the answer ding are both implemented on top of the GloVe as the word closest to I among the options, using L2 codebase. We used 300-dimensional vectors and or cosine distance. For the latter, we normalize I to window size 5 in all models. Further details are unit norm before taking the cosine distance. For L2 provided below. we do not apply any normalization. 11http://code.google.com/p/word2vec 12http://nlp.stanford.edu/projects/glove 13https://github.com/ntessore/redsvd-h

284 References SA Molchanov. 1975. Diffusion processes and rie- mannian geometry. Russian Mathematical Surveys, Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, 30(1):1. and Andrej Risteski. 2015. Random walks on con- Douglas L Nelson, Cathy L McEvoy, and Thomas A text spaces: Towards an explanation of the myster- Schreiber. 2004. The university of south florida free ies of semantic word embeddings. arXiv preprint association, rhyme, and word fragment norms. Be- arXiv:1502.03520. havior Research Methods, Instruments, & Computers, Kenneth Ward Church and Patrick Hanks. 1990. Word 36(3):402–407. association norms, mutual information, and lexicogra- Jeffrey Pennington, Richard Socher, and Christopher D phy. Computational Linguistics, 16(1):22–29. Manning. 2014. Glove: Global vectors for word rep- David A Croydon and Ben M Hambly. 2008. Local resentation. In Proceedings of the Empiricial Methods limit theorems for sequences of simple random walks in Natural Language Processing, pages 1532–43. on graphs. Potential Analysis, 29(4):351–389. Christiane Fellbaum, editor. 1998. WordNet: an elec- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. tronic lexical database. Deepwalk: Online learning of social representations. Thomas L Griffiths, Mark Steyvers, and Joshua B Tenen- In Proceedings of the 20th ACM SIGKDD Interna- baum. 2007. Topics in semantic representation. Psy- tional Conference on Knowledge Discovery and Data chological review, 114(2):211. Mining, pages 701–710. ACM. Zellig S Harris. 1954. Distributional structure. Word, David E Rumelhart and Adele A Abrahamson. 1973. A 10(23):146–162. model for analogical reasoning. Cognitive Psychol- Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. ogy, 5(1):1–28. 2015a. From random walks to distances on un- Laurent Saloff-Coste. 2010. The heat kernel and its es- weighted graphs. In Advances in Neural Information timates. Probabilistic approach to geometry, 57:405– Processing Systems. 436. Tatsunori Hashimoto, Yi Sun, and Tommi Jaakkola. Gideon Schwarz and Amos Tversky. 1980. On the reci- 2015b. Metric recovery from directed unweighted procity of proximity relations. Journal of Mathemati- graphs. In Artificial Intelligence and Statistics, pages cal Psychology, 22(3):157–175. 342–350. Robin Sibson. 1979. Studies in the robustness of multidi- Geoffrey E Hinton and Sam T Roweis. 2002. Stochastic mensional scaling: Perturbational analysis of classical neighbor embedding. In Advances in Neural Informa- scaling. Journal of the Royal Statistical Society. Series tion Processing Systems, pages 833–840. B (Methodological), pages 217–229. Matthaus¨ Kleindessner and Ulrike von Luxburg. 2015. Richard Socher, John Bauer, Christopher D Manning, and Dimensionality estimation without distances. In Pro- Andrew Y Ng. 2013. Parsing with compositional vec- ceedings of the Artificial Intelligence and Statistics tor grammars. In Proceedings of the ACL conference. conference (AISTATS). Robert J Sternberg and Michael K Gardner. 1983. Uni- Omer Levy and Yoav Goldberg. 2014a. Linguistic ties in inductive reasoning. Journal of Experimental Regularities in Sparse and Explicit Word Representa- Psychology: General, 112(1):80. tions. In Proceedings 18th Conference on Computa- Joshua B Tenenbaum, Vin De Silva, and John C tional Natural Language Learning, pages 171–180. Langford. 2000. A global geometric framework Omer Levy and Yoav Goldberg. 2014b. Neural word for nonlinear dimensionality reduction. Science, embedding as implicit matrix factorization. In Ad- 290(5500):2319–2323. vances in Neural Information Processing Systems, Joshua B Tenenbaum. 1998. Mapping a manifold of per- pages 2177–2185. ceptual observations. In Advances in Neural Informa- Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- tion Processing Systems, pages 682–688. proving distributional similarity with lessons learned Daniel Ting, Ling Huang, and Michael Jordan. 2011. An from word embeddings. Transactions of the Associa- analysis of the convergence of graph laplacians. arXiv tion for Computational Linguistics, 3:211–225. preprint arXiv:1101.5435. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Dean. 2013a. Efficient estimation of word representa- Word representations: a simple and general method for tions in vector space. In ICLR Workshop. semi-supervised learning. In Proceedings of the 48th Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- annual meeting of the Association for Computational rado, and Jeff Dean. 2013b. Distributed represen- Linguistics, pages 384–394. ACL. tations of words and phrases and their composition- Peter D Turney and Michael L Littman. 2005. Corpus- ality. In Advances in Neural Information Processing based learning of analogies and semantic relations. Systems, pages 3111–3119. , 60(1-3):251–278.

285 Amos Tversky and J Hutchinson. 1986. Nearest neigh- bor analysis of psychological spaces. Psychological Review, 93(1):3. Srinivasa RS Varadhan. 1967. Diffusion processes in a small time interval. Communications on Pure and Applied Mathematics, 20(4):659–685.

286