Deconstructing Word Embedding Algorithms

Deconstructing word embedding algorithms Kian Kenyon-Deany∗ Edward Newell∗, Jackie Chi Kit Cheung BMO AI Capabilities Team Mila - Quebec´ AI Institute Bank of Montreal - Toronto, Ontario McGill University - Montreal,´ Quebec´ [email protected] [email protected] [email protected] Abstract find that, despite their many different hyperparam- eters, the algorithms collectively intersect upon the Word embeddings are reliable feature repre- following two key design features. First, vector- sentations of words used to obtain high qual- ity results for various NLP applications. Un- covector dot products are learned to approximate contextualized word embeddings are used in PMI statistics in the corpus. Second, modulation of many NLP tasks today, especially in resource- the loss gradient, directly or indirectly, is necessary limited settings where high memory capacity to balance weak and strong signals arising from the and GPUs are not available. Given the histor- highly imbalanced distribution of corpus statistics. ical success of word embeddings in NLP, we These findings can provide an informed basis propose a retrospective on some of the most for future development of both new embedding well-known word embedding algorithms. In algorithms and deep contextualized models. this work, we deconstruct Word2vec, GloVe, and others, into a common form, unveiling 2 Fundamental concepts some of the common conditions that seem to be required for making performant word em- We begin by formally defining embeddings, their beddings. We believe that the theoretical find- vectors and covectors (also known as “input” and ings in this paper can provide a basis for more “output” vectors (Rong, 2014; Nalisnick et al., informed development of future models. 2016)), and pointwise mutual information (PMI). 1 Introduction Embedding. In general topology, an embedding The advent of efficient uncontextualized word em- is understood as an injective structure preserving bedding algorithms (e.g., Word2vec (Mikolov et al., map, f : X ! Y , between two mathematical struc- 2013) and GloVe (Pennington et al., 2014)) marked tures X and Y . A word embedding algorithm (f) a historical breakthrough in NLP. Countless re- learns an inner-product space (Y ) to preserve a lin- searchers employed word embeddings in new mod- guistic structure within a reference corpus of text, els to improve results on a multitude of NLP prob- D (X), based on a vocabulary, V. The structure in lems. In this work, we provide a retrospective anal- D is analyzed in terms of the relationships between ysis of these groundbreaking models of the past, words induced by their co-appearances, according which simultaneously offers theoretical insights for to a certain definition of context. In such an analy- how future models can be developed and under- sis, each word figures dually: (1) as a focal element stood. We build on the theoretical work of Levy inducing a local context; and (2) as elements of the and Goldberg(2014), proving that their findings local contexts induced by focal elements. To make on the relationship between pointwise mutual in- these dual roles explicit, we distinguish two copies formation (PMI) and word embeddings go beyond of the vocabulary: the focal, or term, words VT , Word2vec and singular value decomposition. and the context words VC . In particular, we generalize several word embed- Word embedding consists of two maps: ding algorithms into a common form by proposing 1×d d×1 VC −! R VT −! R the low rank embedder framework. We decon- i 7−! hij j 7−! jji: struct each algorithm into its constituent parts, and ∗ Kian and Edward contributed equally. y This work was We use Dirac notation to distinguish vectors jji, pursued while Kian was a member of Mila. associated to focal words, from covectors hij, asso- 8479 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8479–8484, November 16–20, 2020. c 2020 Association for Computational Linguistics ciated to context words. In matrix notation, jji cor- fij satisfy the property: responds to a column vector and hij to a row vector. @fij Their inner product is hijji. We later demonstrate = 0 at ij = φij: (3) that many word embedding algorithms, intention- @ ij ally or not, learn a vector space where the inner The design variable φij is some function of product between a focal word j and context word corpus statistics, and its purpose is to quantita- i aims to approximate their PMI in the reference tively measure some relationship between words corpus: hijji ≈ PMI(i; j). i and j. The design variable ij is a function of model parameters that aims to approximate φ ; Pointwise mutual information (PMI). PMI is ij i.e., an embedder’s fundamental objective is to a commonly used measure of association in com- learn ≈ φ , and thus to train embeddings that putational linguistics, and has been shown to be ij ij capture the statistical relationships measured by φ . consistent and reliable for many tasks (Terra and ij The simplest choice for the kernel function , is Clarke, 2003). It measures the deviation of the ij to take = hijji. But the framework allows cooccurrence probability between two words i and ij any function that is symmetric and positive defi- j from the product of their marginal probabilities: nite, allowing the inclusion of bias parameters (e.g. pij NNij in GloVe) and subword parameterization (e.g. in PMI(i; j) := ln = ln ; (1) FastText). We later demostrate that skip-gram with pipj NiNj negative sampling takes φij := PMI(i; j) − ln k where pij is the probability of word i and word j and ij := hijji, and then learns parameter values cooccurring (for some notion of cooccurrence), and that approximate hijji ≈ PMI(i; j) − ln k. where pi and pj are marginal probabilities of words To understand the range of models encompassed, i and j occurring. The empirical PMI can be found it is helpful to see how the framework relates (but by replacing probabilities with corpus statistics. is not limited) to matrix factorization. Consider φij Words are typically considered to cooccur if they as providing the entries of a matrix: M := [φij]ij. are separated by no more than w words; Nij is For models that take ij = hijji, we can write the number of counted cooccurrences between a M^ = WV, where W is defined as having row i context i and a term j; Ni, Nj, and N are computed equal to hij, and V as having column j equal to jji. by marginalizing over the Nij statistics. Then, the loss function can be rewritten as: X 3 Word embedding algorithms L = fij (WV)ij; Mij : (i;j)2VC×VT We will now introduce the low rank embedder framework for deconstructing word embedding al- This loss function can be interpreted as matrix re- gorithms, inspired by the theory of generalized low construction error, because the constraint in Eq.3 rank models (Udell et al., 2016). We unify several means that the gradient goes to zero as WV ≈ M. word embedding algorithms by observing them all Selecting a particular low rank embedder in- from the common vantage point of their global loss stance requires key design choices to be made: we function. Note that this framework is used for theo- must chose the embedding dimension d, the form retical analysis, not necessarily implementation. of the loss terms fij, the kernel function ij, and The global loss function for a low rank embedder the association function φij. The derivative of fij takes the following form: with respect to ij, which we call the characteristic gradient, helps compare models because it exhibits X the action of the gradient yet is symmetric in the L = fij (hij; jji); φ(i; j) ; (2) parameters. In the Appendix we show how this (i;j)2VC×VT derivative relates to gradient descent. where (hij; jji) is a kernel function of learned In the following subsections, we present the @f φ(i; j) derivations of ij , , and φ for SVD (Levy model parameters, and is some scalar func- @ ij ij ij tion (such as a measure of association based on and Goldberg, 2014; Levy et al., 2015), SGNS how often i and j appear in the corpus); we denote (Mikolov et al., 2013), FastText (Joulin et al., 2017), these with ij and φij for brevity. As well, fij are GloVe (Pennington et al., 2014), and LDS (Arora loss functions that take ij and φij as inputs; all et al., 2016). The derivation for Swivel (Shazeer 8480 Model @fij φ hijji ≈ @ ij ij ij SVD 2 · ij − φij hijji PMI(i; j) PMI(i; j) − Nij SGNS (Nij + Nij ) · σ( ij) − σ(φij) hijji ln − PMI(i; j) − ln k Nij GloVe 2h(Nij) · ij − φij hijji + bi + bj ln Nij PMI(i; j) h i 2 LDS 4h(Nij) · ij − φij + C khij + jji|k ln Nij dPMI(i; j) − dγ p h i Nij · ij − φij PMI(i; j) PMI(i; j) Swivel hijji ∗ ∗ 1 · σ ij − φij PMI (i; j) PMI (i; j) Table 1: Comparison of low rank embedders. Final column shows the value of hijji at @fij = 0. GloVe and @ ij LDS set fij = 0 when Nij = 0; h(Nij) is a weighting function sublinear in Nij. Swivel takes one form when − Nij > 0 (first row) and another when Nij = 0 (second row). Nij is the number of negative samples; in SGNS, − − Nij / NiNj, and both Nij and Nij are tempered by undersampling and unigram smoothing. | et al., 2016) as a low rank embedder is trivial, as it where ij = ui Σvj. Allowing the square matrix of is already posed as a matrix factorization of PMI singular values Σ to be absorbed into the vectors (as statistics.

Load more