Word Embeddings As Metric Recovery in Semantic Spaces
Total Page:16
File Type:pdf, Size:1020Kb
Word Embeddings as Metric Recovery in Semantic Spaces Tatsunori B. Hashimoto, David Alvarez-Melis and Tommi S. Jaakkola CSAIL, Massachusetts Institute of Technology thashim, davidam, tommi @csail.mit.edu { } Abstract named entity recognition (Turian et al., 2010) and parsing (Socher et al., 2013). Continuous word representations have been remarkably useful across NLP tasks but re- The empirical success of word embeddings has main poorly understood. We ground word prompted a parallel body of work that seeks to better embeddings in semantic spaces studied in understand their properties, associated estimation al- the cognitive-psychometric literature, taking gorithms, and explore possible revisions. Recently, these spaces as the primary objects to recover. Levy and Goldberg (2014a) showed that linear lin- To this end, we relate log co-occurrences of guistic regularities first observed with word2vec words in large corpora to semantic similarity extend to other embedding methods. In particu- assessments and show that co-occurrences are indeed consistent with an Euclidean semantic lar, explicit representations of words in terms of co- space hypothesis. Framing word embedding occurrence counts can be used to solve analogies in as metric recovery of a semantic space uni- the same way. In terms of algorithms, Levy and fies existing word embedding algorithms, ties Goldberg (2014b) demonstrated that the global min- them to manifold learning, and demonstrates imum of the skip-gram method with negative sam- that existing algorithms are consistent metric pling of Mikolov et al. (2013b) implicitly factorizes recovery methods given co-occurrence counts a shifted version of the pointwise mutual informa- from random walks. Furthermore, we propose a simple, principled, direct metric recovery al- tion (PMI) matrix of word-context pairs. Arora et gorithm that performs on par with the state-of- al. (2015) explored links between random walks and the-art word embedding and manifold learning word embeddings, relating them to contextual (prob- methods. Finally, we complement recent fo- ability ratio) analogies, under specific (isotropic) as- cus on analogies by constructing two new in- sumptions about word vectors. ductive reasoning datasets—series completion In this work, we take semantic spaces stud- and classification—and demonstrate that word embeddings can be used to solve them as well. ied in the cognitive-psychometric literature as the prototypical objects that word embedding algo- rithms estimate. Semantic spaces are vector spaces 1 Introduction over concepts where Euclidean distances between Continuous space models of words, objects, and sig- points are assumed to indicate semantic similar- nals have become ubiquitous tools for learning rich ities. We link such semantic spaces to word representations of data, from natural language pro- co-occurrences through semantic similarity assess- cessing to computer vision. Specifically, there has ments, and demonstrate that the observed co- been particular interest in word embeddings, largely occurrence counts indeed possess statistical proper- due to their intriguing semantic properties (Mikolov ties that are consistent with an underlying Euclidean et al., 2013b) and their success as features for down- space where distances are linked to semantic simi- stream natural language processing tasks, such as larity. 273 Transactions of the Association for Computational Linguistics, vol. 4, pp. 273–286, 2016. Action Editor: Scott Yih. Submission batch: 12/2015; Revision batch: 3/2016; Published 7/2016. c 2016 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Analogy Series Completion Classification I D4 D1 A C B I D3 B D2 D 1 I D2 D D3 D1 2 A D3 D4 D 4 B A C Figure 1: Inductive reasoning in semantic space proposed in Sternberg and Gardner (1983). A, B, C are given, I is the ideal point and D are the choices. The correct answer is shaded green. Formally, we view word embedding methods as We propose and evaluate a new principled di- • performing metric recovery. This perspective is sig- rect metric recovery algorithm that performs nificantly different from current approaches. Instead comparably to the existing state of the art on of aiming for representations that exhibit specific se- both word embedding and manifold learning mantic properties or that perform well at a particu- tasks, and show that GloVe (Pennington et lar task, we seek methods that recover the underly- al., 2014) is closely related to the second-order ing metric of the hypothesized semantic space. The Taylor expansion of our objective. clearer foundation afforded by this perspective en- We construct and make available two new in- • ables us to analyze word embedding algorithms in ductive reasoning datasets1—series completion a principled task-independent fashion. In particu- and classification—to extend the evaluation of lar, we ask whether word embedding algorithms are word representations beyond analogies, and able to recover the metric under specific scenarios. demonstrate that these tasks can be solved with To this end, we unify existing word embedding al- vector operations on word embeddings as well gorithms as statistically consistent metric recovery (Examples in Table 1). methods under the theoretical assumption that co- occurrences arise from (metric) random walks over 2 Word vectors and semantic spaces semantic spaces. The new setting also suggests a Most current word embedding algorithms build on simple and direct recovery algorithm which we eval- the distributional hypothesis (Harris, 1954) where uate and compare against other embedding methods. similar contexts imply similar meanings so as to tie The main contributions of this work can be sum- co-occurrences of words to their underlying mean- marized as follows: 1 ings. The relationship between semantics and co- We ground word embeddings in semantic occurrences has also been studied in psychomet- • spaces via log co-occurrence counts. We show rics and cognitive science (Rumelhart and Abraham- that PMI (pointwise mutual information) re- son, 1973; Sternberg and Gardner, 1983), often by lates linearly to human similarity assessments, means of free word association tasks and seman- and that nearest-neighbor statistics (centrality tic spaces. The semantic spaces, in particular, pro- and reciprocity) are consistent with an Eu- vide a natural conceptual framework for continu- clidean space hypothesis (Sections 2 and 3). ous representations of words as vector spaces where In contrast to prior work (Arora et al., 2015), semantically related words are close to each other. • we take metric recovery as the key object of For example, the observation that word embeddings study, unifying existing algorithms as consis- can solve analogies was already shown by Rumel- tent metric recovery methods based on co- hart and Abrahamson (1973) using vector represen- occurrence counts from simple Markov random tations of words derived from surveys of pairwise walks over graphs and manifolds. This strong word similarity judgments. link to manifold estimation opens a promis- A fundamental question regarding vector space ing direction for extensions of word embedding models of words is whether an Euclidean vector methods (Sections and 4 and 5). 1http://web.mit.edu/thashim/www/supplement materials.zip 274 Task Prompt Answer Analogy king:man::queen:? woman Series penny:nickel:dime:? quarter Classification horse:zebra: deer, dog, fish deer { } Table 1: Examples of the three inductive reasoning tasks proposed by Sternberg and Gardner (1983). Figure 2: Normalized log co-occurrence (PMI) space is a valid representation of semantic concepts. linearly correlates with human semantic similarity There is substantial empirical evidence in favor of judgments (MEN survey). this hypothesis. For example, Rumelhart and Abra- counts has a strong linear relationship with seman- hamson (1973) showed experimentally that analog- tic similarity judgments from survey data (Pearson’s ical problem solving with fictitious words and hu- r=0.75).2 However, this suggestive linear relation- man mistake rates were consistent with an Euclidean ship does not by itself demonstrate that log co- space. Sternberg and Gardner (1983) provided fur- occurrences (with normalization) can be used to de- ther evidence supporting this hypothesis, proposing fine an Euclidean metric space. that general inductive reasoning was based upon op- Earlier psychometric studies have asked whether erations in metric embeddings. Using the analogy, human semantic similarity evaluations are consis- series completion and classification tasks shown in tent with an Euclidean space. For example, Tver- Table 1 as testbeds, they proposed that subjects solve sky and Hutchinson (1986) investigate whether con- these problems by finding the word closest (in se- cept representations are consistent with the geomet- mantic space) to an ideal point: the vertex of a par- ric sampling (GS) model: a generative model in allelogram for analogies, a displacement from the which points are drawn independently from a con- last word in series completion, and the centroid in tinuous distribution in an Euclidean space. They the case of classification (Figure 1). use two nearest neighbor statistics to test agreement We use semantic spaces as the prototypical struc- with this model, and conclude that certain hierarchi- tures that word embedding methods attempt to un- cal vocabularies are not consistent with an Euclidean cover, and we investigate the suitability of word co- embedding.