Word2vec: What and Why

Word2vec: What and Why Jiaqi Mu University of Illinois at Urbana-Champaign [email protected] Abstract—Words have been studied for decades as the basic To address this challenge, different word embedding algo- unit in natural language. Although words are traditionally rithms have been proposed in recent years. Neural network modeled as atomic units, a real-valued representation can wield language models represent the word by its context words [1], power in many application domains. The state-of-the-art in such real-valued representations is word2vec, known for its efficiency in [2], [3], [4], where the context word with respect to the current handling large datasets and its ability to capture multiple degrees word is defined to be any word within a window ρ to the left of similarities. In this report, we will present the training model and right of the current word, excluding the current word itself. of word2vec and summarize the evolution of word2vec, ranging For example, in the following piece of text where the current from its canonical form to state-of-the-art implementations. We word is “genres” and ρ = 3, show that word2vec can be understood as a low-dimensional factorization of the so-called word-context matrix, whose cells are ... a series of many genres, including fantasy, drama, the pointwise mutual information (PMI) of the respective word coming of age ... and context pairs, shifted by a global constant. Following this insight, we explain the ability of word2vec of modeling similarity “series”, “of”, “many”, “including”, “fantasy”, and “dra- by a probabilistic interpretation of the “distributional hypothesis” ma” are the context words. The nonlinearity and non- from linguistics. convexity when constructing the representations leads to Keywords—word, vector representations, multiple degrees of computationally-intensive training stages. To ameliorate the similarities. computational difficulties, [5] simplified the deep neural network models by using a single layer feedforward neural I. INTRODUCTION network architecture to allow an efficient training process. Most applications in natural language processing (NLP) Due to this simplification, word vector representations can be take words as the basic unit. Understanding the interaction of benefit from a word being modeled by a larger window of words in a corpus of documents is one of the most important contexts in sentences and being able to handle larger training research areas in NLP. The fundamental difficulty is the large corpus. This work has made a large impact on the NLP cardinality of the vocabulary (for example, the size of the community and is popularly known as word2vec. vocabulary in English is around 106). Learning marginal word It is not clear if word2vec is the best representation, nor distribution, i.e., unigram distribution, is a difficult task – the is it clear exactly what properties of words and sentences underlying distribution follows Zipf-law, where the majority the representations should model. Different properties of the of words appear only a few times in the corpus. If one representations are useful in different applications. There is would like to model the joint distribution of two words, i.e., one property – similar words should have similar representa- bigram distribution, there are potentially 1012 free parameters, tions – is important in all tasks, since it captures the basic which are too many to infer from current datasets. Due difference between continuous representations and discrete to the grammatical and semantical structures in documents, ones. What makes word2vec a successful algorithm is the neither unigram distribution nor bigram distribution is enough property that the representations obtained from word2vec can to model them. Higher order statistics, however, are almost capture multiple degrees of similarity, e.g., between words impossible to infer. or between pairs of words, in an unsupervised fashion. The An alternative solution to the cardinality problem that arises similarity between words is captured by the distance between from representing words as atomic units is to represent them in the words’ corresponding vectors. The similarity between pairs real space by using an appropriate function (example: neural of words is captured by the distances between the pairs’ cor- network) to model the interaction between words. Such an responding difference vectors. For example, the closest vector approach is particularly useful because similarities between to vector(\king00) − vector(\man00) + vector(\woman00) is words can now be captured by the distances between their the vector of the word “queen”. representations, which is not possible when we take words as After making this surprising observation, one would ask a discrete items. Moreover, we would like to consider functions natural question: “why can word2vec capture multiple degrees (in the neural network) that operate on these real-valued of similarity?” Even though this question largely remain- representations in a smooth fashion: that is, similar input word s satisfactorily not answered, several recent works provide sequences should map to similar outputs. The primary chal- insightful understanding of word2vec. The understanding of lenge, then, lies in defining an appropriate mapping between similarity between words comes from a linguistic hypothesis, words and their real-valued representations. i.e., the distributional hypothesis [6]: “a word is characterized NS X T T 0 0 LSG(u; v) = #(w; c) log σ(u(w) v(c)) + kEc ∼pC log σ −u(w) v(c ) : (1) (w;c) genres distribution pCjW (cjw) in (2) is defined as, W(t) exp(u(w)Tv(c)) p (cjw) = ; CjW P T 0 (3) c02V exp(u(w) v(c )) where u and v are the mapping from word to its word- and context-vector representations respectively, and V is the W(t-3) W(t-2) W(t-1) W(t+1) W(t+2) W(t+3) vocabulary as a list of words. In (3), the summation over w (i.e., the vocabulary) is series of many including fantasy drama used to normalize the vectors; however, this summation is Fig. 1. The architectures in word2vec to predict the context words given the computationally intensive. To solve this issue, word2vec in- current one (SG). troduces the negative sampling (NS) mechanism to further reduce the computation complexity. Let p(D = 1jw; c) be the probability that the word/context pair (w; c) comes from the by the company it keeps.” According to this hypothesis, [7] data, and p(D = 0jw; c) be the probability that it does not. interpreted the similarity between two words w1 and w2 as The probability distribution is defined via a logistic regression, the similarity between the empirical conditional distributions p(D = 1jw; c) = σ u(w)Tv(c) ; of the context word given w1 and w2 respectively. In parallel, [8] showed that word2vec is implicitly doing a weighted low p(D = 0jw; c) = σ −u(w)Tv(c) ; dimensional factorization on the cooccurrence statistics of the −x word and the words around it with some preprocessing and where σ(x) = 1=(e + 1) is the sigmoid function. Let a careful choice of hyperparameters [8]. More recently, [9] #(w; c) be the number of occurrence for word/context pair showed that the distances between word vectors capture the (w; c) in the training corpus. We define the marginal statistics by #(w) = P #(w; c), #(c) = P #(w; c) and similarity between the empirical distributions of their contexts P c2V w2V via a generative model, and thus capture the similarities jDj = w;c #(w; c). The training objectives for SG then between words. turns into maximizing p(D = 1jw; c) for observed pair (w; c) and maximizing p(D = 0jw; c0) for randomly generated pairs (w; c0) c0 II. ALGORITHM (this pair is called a “negative” sample), where is randomly generated from an empirical unigram distribution The skip-gram (SG) model of word2vec aims at efficiently pC , which is defined as: predicting the context words given the current word. The P #(w; c) #(c) architecture in SG is presented in in Figure 1, where w(t) p (c) = w = : C P #(w; c0) jDj is the t-th word in the corpus. The goal of word2vec is to w;c0 predict w(t − ρ); :::; w(t + ρ) (e.g. series, of, many, including, The optimization objective for SG with NS is formulated fantasy, drama) given the current word w(t) (e.g. genres). in (1), where k is a hyperparameter controlling the number The complexity in prior work [1], [2], [3], [4] stemmed from of negative samples. One can implement a parallel training nonlinear operations in the training methods’ hidden layers. algorithm using mini-batch asynchronous gradient descent Word2vec addressed this by changing nonlinear operations with AdaGrad [10]. to more efficient bilinear ones, while also training on larger datasets to compensate for the loss of nonlinearity. To allow III. THEORETICAL ANALYSIS efficient computation, word2vec further makes independence It is the ability to capture multiple degrees of similarity, e.g., assumptions on the context words. The training objective for between words or between pairs of words, in an unsupervised SG is to maximize the probability of the context words given fashion that makes makes word2vec a successful algorithm. the current one, We will theoretically justify this ability in the remaining of X this section. LSG = log p(wt−ρ; :::; wt−1; wt+1; :::; wt+ρjwt) Let n be the size of the vocabulary and let d be the t ρ dimension of the vector representations. Word2vec represents X X both words and their contexts in a dense low dimension space = log p (w jw ): (2) CjW t+c t in Rd by the mappings u and v defined in Section II. We t c6=0;c=−ρ embed u and v to a word- and context-matrix U 2 Rn×d and To avoid ambiguity, we use w to represent the current word V 2 Rn×d respectively, where the i-th row of each matrix is and use c to represent its context word.

Load more