Deconstructing embedding algorithms

Kian Kenyon-Dean†∗ Edward Newell∗, Jackie Chi Kit Cheung BMO AI Capabilities Team Mila - Quebec´ AI Institute Bank of Montreal - Toronto, Ontario McGill University - Montreal,´ Quebec´ [email protected] [email protected] [email protected]

Abstract find that, despite their many different hyperparam- eters, the algorithms collectively intersect upon the Word embeddings are reliable feature repre- following two key design features. First, vector- sentations of used to obtain high qual- ity results for various NLP applications. Un- covector dot products are learned to approximate contextualized word embeddings are used in PMI statistics in the corpus. Second, modulation of many NLP tasks today, especially in resource- the loss gradient, directly or indirectly, is necessary limited settings where high memory capacity to balance weak and strong signals arising from the and GPUs are not available. Given the histor- highly imbalanced distribution of corpus statistics. ical success of word embeddings in NLP, we These findings can provide an informed basis propose a retrospective on some of the most for future development of both new embedding well-known word embedding algorithms. In algorithms and deep contextualized models. this work, we deconstruct , GloVe, and others, into a common form, unveiling 2 Fundamental concepts some of the common conditions that seem to be required for making performant word em- We begin by formally defining embeddings, their beddings. We believe that the theoretical find- vectors and covectors (also known as “input” and ings in this paper can provide a basis for more “output” vectors (Rong, 2014; Nalisnick et al., informed development of future models. 2016)), and pointwise mutual information (PMI). 1 Introduction Embedding. In general topology, an embedding The advent of efficient uncontextualized word em- is understood as an injective structure preserving bedding algorithms (e.g., Word2vec (Mikolov et al., map, f : X → Y , between two mathematical struc- 2013) and GloVe (Pennington et al., 2014)) marked tures X and Y . A word embedding algorithm (f) a historical breakthrough in NLP. Countless re- learns an inner-product space (Y ) to preserve a lin- searchers employed word embeddings in new mod- guistic structure within a reference corpus of text, els to improve results on a multitude of NLP prob- D (X), based on a vocabulary, V. The structure in lems. In this work, we provide a retrospective anal- D is analyzed in terms of the relationships between ysis of these groundbreaking models of the past, words induced by their co-appearances, according which simultaneously offers theoretical insights for to a certain definition of context. In such an analy- how future models can be developed and under- sis, each word figures dually: (1) as a focal element stood. We build on the theoretical work of Levy inducing a local context; and (2) as elements of the and Goldberg(2014), proving that their findings local contexts induced by focal elements. To make on the relationship between pointwise mutual in- these dual roles explicit, we distinguish two copies formation (PMI) and word embeddings go beyond of the vocabulary: the focal, or term, words VT , Word2vec and singular value decomposition. and the context words VC . In particular, we generalize several word embed- Word embedding consists of two maps: ding algorithms into a common form by proposing 1×d d×1 VC −→ R VT −→ R the low rank embedder framework. We decon- i 7−→ hi| j 7−→ |ji. struct each algorithm into its constituent parts, and ∗ Kian and Edward contributed equally. † This work was We use Dirac notation to distinguish vectors |ji, pursued while Kian was a member of Mila. associated to focal words, from covectors hi|, asso-

8479 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 8479–8484, November 16–20, 2020. c 2020 Association for Computational ciated to context words. In matrix notation, |ji cor- fij satisfy the property: responds to a column vector and hi| to a row vector. ∂fij Their inner product is hi|ji. We later demonstrate = 0 at ψij = φij. (3) that many word embedding algorithms, intention- ∂ψij ally or not, learn a where the inner The design variable φij is some function of product between a focal word j and context word corpus statistics, and its purpose is to quantita- i aims to approximate their PMI in the reference tively measure some relationship between words corpus: hi|ji ≈ PMI(i, j). i and j. The design variable ψij is a function of model parameters that aims to approximate φ ; Pointwise mutual information (PMI). PMI is ij i.e., an embedder’s fundamental objective is to a commonly used measure of association in com- learn ψ ≈ φ , and thus to train embeddings that putational linguistics, and has been shown to be ij ij capture the statistical relationships measured by φ . consistent and reliable for many tasks (Terra and ij The simplest choice for the kernel function ψ , is Clarke, 2003). It measures the deviation of the ij to take ψ = hi|ji. But the framework allows cooccurrence probability between two words i and ij any function that is symmetric and positive defi- j from the product of their marginal probabilities: nite, allowing the inclusion of bias parameters (e.g.

pij NNij in GloVe) and subword parameterization (e.g. in PMI(i, j) := ln = ln , (1) FastText). We later demostrate that skip-gram with pipj NiNj negative sampling takes φij := PMI(i, j) − ln k where pij is the probability of word i and word j and ψij := hi|ji, and then learns parameter values cooccurring (for some notion of cooccurrence), and that approximate hi|ji ≈ PMI(i, j) − ln k. where pi and pj are marginal probabilities of words To understand the range of models encompassed, i and j occurring. The empirical PMI can be found it is helpful to see how the framework relates (but by replacing probabilities with corpus statistics. is not limited) to matrix factorization. Consider φij Words are typically considered to cooccur if they as providing the entries of a matrix: M := [φij]ij. are separated by no more than w words; Nij is For models that take ψij = hi|ji, we can write the number of counted cooccurrences between a Mˆ = WV, where W is defined as having row i context i and a term j; Ni, Nj, and N are computed equal to hi|, and V as having column j equal to |ji. by marginalizing over the Nij statistics. Then, the loss function can be rewritten as: X   3 Word embedding algorithms L = fij (WV)ij, Mij .

(i,j)∈VC×VT We will now introduce the low rank embedder framework for deconstructing word embedding al- This loss function can be interpreted as matrix re- gorithms, inspired by the theory of generalized low construction error, because the constraint in Eq.3 rank models (Udell et al., 2016). We unify several means that the gradient goes to zero as WV ≈ M. word embedding algorithms by observing them all Selecting a particular low rank embedder in- from the common vantage point of their global loss stance requires key design choices to be made: we function. Note that this framework is used for theo- must chose the embedding dimension d, the form retical analysis, not necessarily implementation. of the loss terms fij, the kernel function ψij, and The global loss function for a low rank embedder the association function φij. The derivative of fij takes the following form: with respect to ψij, which we call the characteristic gradient, helps compare models because it exhibits X   the action of the gradient yet is symmetric in the L = fij ψ(hi|, |ji), φ(i, j) , (2) parameters. In the Appendix we show how this (i,j)∈VC×VT derivative relates to gradient descent. where ψ(hi|, |ji) is a kernel function of learned In the following subsections, we present the ∂f φ(i, j) derivations of ij , ψ , and φ for SVD (Levy model parameters, and is some scalar func- ∂ψij ij ij tion (such as a measure of association based on and Goldberg, 2014; Levy et al., 2015), SGNS how often i and j appear in the corpus); we denote (Mikolov et al., 2013), FastText (Joulin et al., 2017), these with ψij and φij for brevity. As well, fij are GloVe (Pennington et al., 2014), and LDS (Arora loss functions that take ψij and φij as inputs; all et al., 2016). The derivation for Swivel (Shazeer

8480 Model ∂fij ψ φ hi|ji ≈ ∂ψij ij ij   SVD 2 · ψij − φij hi|ji PMI(i, j) PMI(i, j)

−   Nij SGNS (Nij + Nij ) · σ(ψij) − σ(φij) hi|ji ln − PMI(i, j) − ln k Nij   GloVe 2h(Nij) · ψij − φij hi|ji + bi + bj ln Nij PMI(i, j)

h i 2 LDS 4h(Nij) · ψij − φij + C khi| + |ji|k ln Nij dPMI(i, j) − dγ

p h i Nij · ψij − φij PMI(i, j) PMI(i, j) Swivel hi|ji   ∗ ∗ 1 · σ ψij − φij PMI (i, j) PMI (i, j)

Table 1: Comparison of low rank embedders. Final column shows the value of hi|ji at ∂fij = 0. GloVe and ∂ψij LDS set fij = 0 when Nij = 0; h(Nij) is a weighting function sublinear in Nij. Swivel takes one form when − Nij > 0 (first row) and another when Nij = 0 (second row). Nij is the number of negative samples; in SGNS, − − Nij ∝ NiNj, and both Nij and Nij are tempered by undersampling and unigram smoothing.

| et al., 2016) as a low rank embedder is trivial, as it where ψij = ui Σvj. Allowing the square matrix of is already posed as a matrix factorization of PMI singular values Σ to be absorbed into the vectors (as statistics. We summarize the derivations in Table1. in Levy et al.(2015)), we have hi| = ui and |ji = ∂fij Σvj. Thus, taking the derivative ∂ψ (noting that 3.1 SVD as a low rank embedder ij fij here is simply the squared difference between Singular value decomposition (SVD) of the ψij and φij) and setting it equal to zero we observe: positive-PMI (PPMI) matrix is used by Levy and Goldberg(2014); Levy et al.(2015) to produce hi|ji = PMI(i, j). (5) word embeddings that perform more or less equiv- 3.2 SGNS as a low rank embedder alently to SGNS and GloVe. Converting the PMI matrix into PPMI is a trivial preprocessing step; φ Mikolov et al.(2013) proposed skip-gram with is augmented according to a factor α = 0 such that negative sampling with the following loss function: φij = 0 ∀φij ≤ α. We now prove why SVD of k the PMI matrix results in word embeddings with X n X h 0 io L = − ln σhi|ji+ E ln(1−σhi`|ji) , hi|ji ≈ PMI(i, j) dot products , noting that this (i,j)∈D2 `=1 proof naturally holds for all augmentations of φ according to the α factor, including PPMI. where σ is the logistic sigmoid function, D2 is a list containing each cooccurrence of a context-word i Proof. Truncated SVD provides an optimal so- with a focal word j in the corpus, and the expec- 0 lution to problem minD kD − AkF for some in- tation is taken by drawing i` from the (smoothed) teger K less than the dimensionality of matrix unigram distribution to generate k “negative sam- A such that rank(D) = K (Udell et al., 2016). ples” for a given focal-word (Mikolov et al., 2013). The solution is the truncated SVD of A where We will demonstrate that SGNS is a low rank em- PK | th D = k=1 σkukvk with σ being the k singu- bedder with hi|ji ≈ PMI − ln k. lar value and u and v as the kth left and right k k Proof. We can transform the loss function by singular vectors. counting the number of times each pair occurs in Within our framework, the truncated SVD of the the corpus, Nij, and the number of times each pair PMI matrix thus solves the following loss function − is drawn as a negative sample, Nij , while indexing (note Aij = φij = PMI(i, j)): the sum over the set VC ×VT :

X 2 X n − o L = − ψij − PMI(i, j)) , (4) L = − Nij ln σhi|ji + Nij ln(1 − σhi|ji) . (i,j)∈VC×VT (i,j)∈VC×VT 8481 The global loss is almost in the required form ) A) B) j mean = -0.99

, 8 i 250k ( for a low rank embedder (Eq.2), and the appropri- i s N r 6 i 200k N g ate setting for the model approximation function is a l p = i i

f 150k 4 b b o ψ = hi|ji. Calculating the partial derivative with ij r

e 100k 2 respect to the model approximation function ψij, b m 50k 0 u following algebraic manipulation (using the iden- N a 2 tity a ≡ (a + b)σ(ln )), we arrive at the following 10 5 0 5 10 0 2 4 6 8 b N PMI(i, j) lg i definition of the characteristic gradient for SGNS N as a low rank embedder, where ∂fij = ∂L : ∂ψij ∂hi|ji Figure 1: A) Histogram of PMI(i, j) values, for all pairs (i, j) with Nij > 0. B) Scatter plot of GloVe’s ∂L learned biases. Both from a Wikipedia 2018 corpus. = N −σhi|ji − N (1 − σhi|ji) ∂hi|ji ij ij   −  Nij  = (Nij + N ) σ hi|ji − σ ln (Pennington et al., 2014). Ignoring samples where ij N − ij Nij = 0, GloVe uses the following loss function:   −   = (Nij + Nij ) σ ψij − σ φij . (6) X  2 L = h(Nij) hi|ji + bi + bj − ln Nij (8) ij This provides that the association function for − SGNS is φij = ln(Nij/Nij ), since the derivative where bi and bj are learned bias parameters, and will be equal to zero at that point (Eq.3). How- h(Nij) is a weighting function sublinear in Nij. ever, recall that negative samples are drawn ac- GloVe can be cast as a low rank embedder by cording to the unigram distribution (or a smoothed using the model approximation function as a kernel variant (Levy et al., 2015)). This means that with bias parameters, and setting the association − Nij = kNiNj/N. Therefore, in agreement with measure to simply be the objective: Levy and Goldberg(2014), we find that:    | ψij = hi|1 · · · hi|d bi 1 · |ji1 · · · |jid 1 bj , NijN φij = ln = PMI(i, j) − ln k. (7) and φij = ln Nij. NiNjk Proof. Observe an optimal solution to the loss 3.3 FastText as a low rank embedder function, when ∂fij = 0: Proposed by Joulin et al.(2017), FastText’s moti- ∂ψij vation is orthogonal to the present work. Its pur- ∂fij h i pose is to provide subword-based representation of = 2h(Nij) hi|ji + bi + bj − ln Nij = 0 ∂ψij words to improve vocabulary coverage and general- =⇒ hi|ji + b + b = ln N . izability of word embeddings. Nonetheless, it can i j ij also be understood as a low rank embedder . Multiplying the log operand by 1: Proof. FastText uses a loss function that is iden-   NiNj N tical to SGNS except that the vector for each word hi|ji + bi + bj = ln Nij (9) is taken as the sum of embeddings for all character N NiNj N N n-grams appearing in the word, with 3 ≤ n ≤ 6. = ln √ i + ln √ j + PMI(i, j). P Therefore, define |ji by |ji ≡ g∈z(j) |gi, where N N |gi is the vector for n-gram g, and z(j) is the set (10) of n-grams in word j. Covectors are accorded to On the right side, we have two terms that depend words directly, so need not be redefined. The loss respectively only on i and j, which are candidates function and the derivation of entries for Table1 is for the bias terms. Based on this equation alone, then formally identical to those for SGNS. This pro- we cannot draw any conclusions. However, em- vides that ψij = hi|ji, and, φij = PMI(i, j)−ln k. pirically the bias terms are in fact very near √Ni N N 3.4 GloVe as a low rank embedder and √ j , and PMI is observed to be normally dis- N GloVe was proposed as an algorithm halfway be- tributed, as can be seen in Fig.1. This means that tween sampling methods and matrix factorization Eq. 10 provides hi|ji ≈ PMI(i, j).

8482 Analyzing the optimum of GloVe’s loss func- and matrix factorization, but their derivation di- tion yields important insights. First, GloVe can be verges from Levy and Goldberg’s result and masks added to the list of low rank embedders that learn a the connection between SGNS and other low rank bilinear parameterization of PMI. Second, we can embedders. Other works have also explored theo- see why such a parameterization is advantageous. retical or empirical relationships between SGNS Generally, it helps to standardize features of low and GloVe (Shi and Liu, 2014; Suzuki and Nagata, rank models (Udell et al., 2016), and this is essen- 2015; Levy et al., 2015; Arora et al., 2016). tially what transforming cooccurrence counts into PMI achieves. Thus, PMI can be viewed as a pa- 5 Discussion rameterization trick, providing an approximately We observe common features between each of the normal target association to be modelled. algorithms (Table1). In each case, ∂fij takes the ∂ψij 3.5 LDS as a low rank embedder form (multiplier) · (difference). The multiplier is always a “tempered” version of Nij (or NiNj); Arora et al.(2016) introduced an embedding per- that is, it increases sublinearly with Nij. spective based on generative modelling with ran- For each algorithm, φij is equal to PMI or a dom walks through a latent discourse space (LDS). scaled log of Nij. Yet, the choice of ψij in com- LDS provided a theoretical basis for the perfor- bination with φij provides that every model is op- mant SIF document embedding algorithm, devel- timized when hi|ji tends toward PMI(i, j) (with oped soon afterwards (Arora et al., 2017). We now or without a constant shift or scaling). We demon- demonstrate that LDS is also a low-rank embedder. strated that the optimum for SGNS (and FastTest) Proof. The low rank learning objective for LDS is equivalent to the shifted PMI (§3.2). For GloVe, follows directly from Corollary 2.3, in Arora et al. we showed that incorporation of the bias terms cap- (2016): tures the unigram counts needed for PMI (§3.4). A similar property is found in LDS with regards to hi|ji PMI(i, j) = + γ + O(). the L2 norm in its learning objective (Arora et al., d 2016). Thus, these algorithms all converge on two key points: (1) an optimum in which model pa- ∂fij can be found by straightforward differentia- ∂ψij rameters are bilinearly related to PMI; and, (2) the tion of LDS’s loss function: ∂fij weighting of by some tempered form of N . ∂ψij ij X  | 2 2 L = h(Nij) ln Nij − khi| + |ji k − C , 6 Conclusion ij Our low rank embedder framework has evoked the where h(Nij) is as defined by GloVe. The commonalities between many word embedding al- quadratic term is a valid kernel function because: gorithms. We believe a robust understanding of these algorithms is a prerequisite for theoretically ∂fij 2 = khi| + |ji|k = h˜i|˜ji, motivated development of deeper models. Indeed, ∂ψij we offer the following conjectures: deep embed- where ding models would benefit by incorporating PMI h√ √ i statistics into their training objective; such models ˜ | hi| = 2hi|1 ··· 2hi|d hi|hi| 1 , will also benefit from sub-linear scaling of frequent h√ √ i| word pairs during training; and, lastly, such models ˜ | |ji = 2|ji1 ··· 2|jid 1 |ji |ji . would benefit by learning representations with a dual character, as all of the embedding algorithms 4 Related work we described do by learning vectors and covectors. Our derivation of SGNS’s solution is inspired by Acknowledgements the work of Levy and Goldberg(2014), who proved that skip-gram with negative sampling (SGNS) This work is supported by the Fonds de recherche (Mikolov et al., 2013) was implicitly factorizing du Quebec´ – Nature et technologies and by the Nat- the PMI − ln k matrix. However, they required ural Sciences and Engineering Research Council additional assumptions for their derivation to hold. of Canada. The last author is supported in part by Li et al.(2015) explored relations between SGNS the Canada CIFAR AI Chair program.

8483 References Jun Suzuki and Masaaki Nagata. 2015. A unified learn- ing framework of skip-grams and global vectors. In Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, Proceedings of the 53rd Annual Meeting of the ACL and Andrej Risteski. 2016. A latent variable model and the 7th IJCNLP (Volume 2: Short Papers), vol- approach to pmi-based word embeddings. Transac- ume 2, pages 186–191. tions of the Association for Computational Linguis- tics, 4:385–399. Egidio Terra and Charles LA Clarke. 2003. Frequency estimates for statistical word similarity measures. In Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. Proceedings of the 2003 Conference of NAACL-HLT A simple but tough-to-beat baseline for sentence em- - Volume 1, pages 165–172. Association for Compu- beddings. International Conference on Learning tational Linguistics. Representations. Madeleine Udell, Corinne Horn, Reza Zadeh, Stephen Armand Joulin, Edouard Grave, and Piotr Bo- Boyd, et al. 2016. Generalized low rank mod- janowski Tomas Mikolov. 2017. Bag of tricks for els. Foundations and Trends in , efficient text classification. European Association 9(1):1–118. for Computational Linguistics 2017, page 427. A Appendix Omer Levy and Yoav Goldberg. 2014. Neural word A.1 On the characteristic gradient embedding as implicit matrix factorization. In Ad- vances in Neural Information Processing Systems, The relationship between ∂fij and the gradient de- ∂ψij pages 2177–2185. scent actions taken during learning requires simply Omer Levy, Yoav Goldberg, and Ido Dagan. 2015. Im- taking the next step in the chain rule during dif- proving distributional similarity with lessons learned ferentiation. For simplicity of exposition, we will from word embeddings. Transactions of the Associ- assume, like SGNS and Swivel, that ψij = hi|ji, ation for Computational Linguistics, 3:211–225. although the motivation of taking this derivative Yitan Li, Linli Xu, Fei Tian, Liang Jiang, Xiaowei holds for any definition of ψij, provided that it is a Zhong, and Enhong Chen. 2015. Word embedding kernel function of the model parameters. revisited: a new representation learning and explicit ∂fij By examining the derivative ∂hi|ji we observe matrix factorization perspective. In Proceedings of the primary objective of the model (to approximate the 24th International Conference on Artificial Intel- ligence, pages 3650–3656. AAAI Press. dot products), and how this objective symmetrically updates vectors and covectors during learning. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- Consider the generic update that occurs for a rado, and Jeff Dean. 2013. Distributed representa- single (i, j) pair with the pairwise loss function fij. tions of words and phrases and their compositional- ity. In Advances in neural information processing The gradient descent rule for a single update to the systems, pages 3111–3119. vector for word j, using some learning rate η, is: ∂f Eric Nalisnick, Bhaskar Mitra, Nick Craswell, and |ji ← |ji − η ij , (11) Rich Caruana. 2016. Improving document ranking |ji with dual word embeddings. In Proceedings of the 25th International Conference Companion on World However, since fij is a function of hi|ji and not of Wide Web, pages 83–84. International World Wide the vectors or covectors independently, we can use Web Conferences Steering Committee. the chain rule to arrive at the following:

Jeffrey Pennington, Richard Socher, and Christopher ∂f ∂hi|ji |ji ← |ji − η ij (12) Manning. 2014. Glove: Global vectors for word rep- ∂hi|ji ∂|ji resentation. In Proceedings of the 2014 conference ∂f on Empirical Methods in Natural Language Process- |ji ← |ji − η ij hi||, (13) ing, pages 1532–1543. ∂hi|ji

∂hi|ji Xin Rong. 2014. Word2vec parameter learning ex- since ∂|ji = hi|. Symmetrically, we also arrive at, plained. arXiv preprint arXiv:1411.2738. for the updates to covectors: Noam Shazeer, Ryan Doherty, Colin Evans, and ∂f hi| ← hi| − η ij |ji|. (14) Chris Waterson. 2016. Swivel: Improving embed- ∂hi|ji dings by noticing what’s missing. arXiv preprint arXiv:1602.02215. Therefore, taking ∂fij (more generally, ∂fij ) to be ∂hi|ji ∂ψij Tianze Shi and Zhiyuan Liu. 2014. Linking with the focal point of analysis in determining the objec- word2vec. arXiv preprint arXiv:1411.5595. tives of the low rank embedders is well grounded.

8484