Glove: Global Vectors for Word Representation

GloVe: Global Vectors for Word Representation Jeffrey Pennington, Richard Socher, Christopher D. Manning Computer Science Department, Stanford University, Stanford, CA 94305 [email protected], [email protected], [email protected] Abstract the finer structure of the word vector space by ex- amining not the scalar distance between word vec- Recent methods for learning vector space tors, but rather their various dimensions of dif- representations of words have succeeded ference. For example, the analogy “king is to in capturing fine-grained semantic and queen as man is to woman” should be encoded syntactic regularities using vector arith- in the vector space by the vector equation king − metic, but the origin of these regularities queen = man − woman. This evaluation scheme has remained opaque. We analyze and favors models that produce dimensions of mean- make explicit the model properties needed ing, thereby capturing the multi-clustering idea of for such regularities to emerge in word distributed representations (Bengio, 2009). vectors. The result is a new global log- bilinear regression model that combines The two main model families for learning word the advantages of the two major model vectors are: 1) global matrix factorization meth- families in the literature: global matrix ods, such as latent semantic analysis (LSA) (Deer- factorization and local context window wester et al., 1990) and 2) local context window methods. Our model efficiently leverages methods, such as the skip-gram model of Mikolov statistical information by training only on et al. (2013c). Currently, both families suffer sig- the nonzero elements in a word-word co- nificant drawbacks. While methods like LSA ef- occurrence matrix, rather than on the en- ficiently leverage statistical information, they do tire sparse matrix or on individual context relatively poorly on the word analogy task, indi- windows in a large corpus. The model pro- cating a sub-optimal vector space structure. Meth- duces a vector space with meaningful sub- ods like skip-gram may do better on the analogy structure, as evidenced by its performance task, but they poorly utilize the statistics of the cor- of 75% on a recent word analogy task. It pus since they train on separate local context win- also outperforms related models on simi- dows instead of on global co-occurrence counts. larity tasks and named entity recognition. In this work, we analyze the model properties necessary to produce linear directions of meaning 1 Introduction and argue that global log-bilinear regression mod- Semantic vector space models of language repre- els are appropriate for doing so. We propose a spe- sent each word with a real-valued vector. These cific weighted least squares model that trains on vectors can be used as features in a variety of ap- global word-word co-occurrence counts and thus plications, such as information retrieval (Manning makes efficient use of statistics. The model pro- et al., 2008), document classification (Sebastiani, duces a word vector space with meaningful sub- 2002), question answering (Tellex et al., 2003), structure, as evidenced by its state-of-the-art per- named entity recognition (Turian et al., 2010), and formance of 75% accuracy on the word analogy parsing (Socher et al., 2013). dataset. We also demonstrate that our methods Most word vector methods rely on the distance outperform other current methods on several word or angle between pairs of word vectors as the pri- similarity tasks, and also on a common named en- mary method for evaluating the intrinsic quality tity recognition (NER) benchmark. of such a set of word representations. Recently, We provide the source code for the model as Mikolov et al. (2013c) introduced a new evalua- well as trained word vectors at http://nlp. tion scheme based on word analogies that probes stanford.edu/projects/glove/. 2 Related Work the way for Collobert et al. (2011) to use the full context of a word for learning the word represen- Matrix Factorization Methods. Matrix factor- tations, rather than just the preceding context as is ization methods for generating low-dimensional the case with language models. word representations have roots stretching as far Recently, the importance of the full neural net- back as LSA. These methods utilize low-rank ap- work structure for learning useful word repre- proximations to decompose large matrices that sentations has been called into question. The capture statistical information about a corpus. The skip-gram and continuous bag-of-words (CBOW) particular type of information captured by such models of Mikolov et al. (2013a) propose a sim- matrices varies by application. In LSA, the ma- ple single-layer architecture based on the inner trices are of “term-document” type, i.e., the rows product between two word vectors. Mnih and correspond to words or terms, and the columns Kavukcuoglu (2013) also proposed closely-related correspond to different documents in the corpus. vector log-bilinear models, vLBL and ivLBL, and In contrast, the Hyperspace Analogue to Language Levy et al. (2014) proposed explicit word embed- (HAL) (Lund and Burgess, 1996), for example, dings based on a PPMI metric. utilizes matrices of “term-term” type, i.e., the rows In the skip-gram and ivLBL models, the objec- and columns correspond to words and the entries tive is to predict a word’s context given the word correspond to the number of times a given word itself, whereas the objective in the CBOW and occurs in the context of another given word. vLBL models is to predict a word given its con- A main problem with HAL and related meth- text. Through evaluation on a word analogy task, ods is that the most frequent words contribute a these models demonstrated the capacity to learn disproportionate amount to the similarity measure: linguistic patterns as linear relationships between the number of times two words co-occur with the the word vectors. or and, for example, will have a large effect on Unlike the matrix factorization methods, the their similarity despite conveying relatively little shallow window-based methods suffer from the about their semantic relatedness. A number of disadvantage that they do not operate directly on techniques exist that addresses this shortcoming of the co-occurrence statistics of the corpus. Instead, HAL, such as the COALS method (Rohde et al., these models scan context windows across the en- 2006), in which the co-occurrence matrix is first tire corpus, which fails to take advantage of the transformed by an entropy- or correlation-based vast amount of repetition in the data. normalization. An advantage of this type of transformation is that the raw co-occurrence counts, 3 The GloVe Model which for a reasonably sized corpus might span 8 or 9 orders of magnitude, are compressed so as The statistics of word occurrences in a corpus is to be distributed more evenly in a smaller inter- the primary source of information available to all val. A variety of newer models also pursue this unsupervised methods for learning word represen- approach, including a study (Bullinaria and Levy, tations, and although many such methods now ex- 2007) that indicates that positive pointwise mu- ist, the question still remains as to how meaning tual information (PPMI) is a good transformation. is generated from these statistics, and how the re- More recently, a square root type transformation sulting word vectors might represent that meaning. in the form of Hellinger PCA (HPCA) (Lebret and In this section, we shed some light on this ques- Collobert, 2014) has been suggested as an effec- tion. We use our insights to construct a new model tive way of learning word representations. for word representation which we call GloVe, for Shallow Window-Based Methods. Another Global Vectors, because the global corpus statis- approach is to learn word representations that aid tics are captured directly by the model. in making predictions within local context win- First we establish some notation. Let the matrix dows. For example, Bengio et al. (2003) intro- of word-word co-occurrence counts be denoted by duced a model that learns word vector representa- X, whose entries Xi j tabulate the number of times tions as part of a simple neural network architec- word j occurs in the context of word i. Let Xi = P ture for language modeling. Collobert and Weston k Xik be the number of times any word appears (2008) decoupled the word vector training from in the context of word i. Finally, let Pi j = P( jji) = the downstream training objectives, which paved Xi j =Xi be the probability that word j appear in the Table 1: Co-occurrence probabilities for target words ice and steam with selected context words from a 6 billion token corpus. Only in the ratio does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam. Probability and Ratio k = solid k = gas k = water k = fashion P(kjice) 1:9 × 10−4 6:6 × 10−5 3:0 × 10−3 1:7 × 10−5 P(kjsteam) 2:2 × 10−5 7:8 × 10−4 2:2 × 10−3 1:8 × 10−5 P(kjice)=P(kjsteam) 8:9 8:5 × 10−2 1:36 0:96 context of word i. the information present the ratio Pik =Pjk in the We begin with a simple example that showcases word vector space. Since vector spaces are inher- how certain aspects of meaning can be extracted ently linear structures, the most natural way to do directly from co-occurrence probabilities. Con- this is with vector differences. With this aim, we sider two words i and j that exhibit a particular as- can restrict our consideration to those functions F pect of interest; for concreteness, suppose we are that depend only on the difference of the two target interested in the concept of thermodynamic phase, words, modifying Eqn.

Glove: Global Vectors for Word Representation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support