Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected]
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17) Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected] Rolina Wu Kostas Tsioutsiouliklis University of Waterloo Yahoo Research [email protected] [email protected] Abstract Training for such large vocabularies presents several chal- lenges. Word2Vec needs to maintain two d-dimensional vec- Word2Vec recently popularized dense vector word represen- tors of single-precision floating point numbers for each vo- tations as fixed-length features for machine learning algo- cabulary word. All vectors need to be kept in main memory rithms and is in widespread use today. In this paper we in- to achieve acceptable training latency, which is impractical vestigate one of its core components, Negative Sampling, and propose efficient distributed algorithms that allow us to scale for contemporary commodity servers. As a concrete exam- to vocabulary sizes of more than 1 billion unique words and ple, Ordentlich et al. (Ordentlich et al. 2016) describe an corpus sizes of more than 1 trillion words. application in which they use word embeddings to match search queries to online ads. Their dictionary comprises ≈ 200 million composite words which implies a main mem- Introduction ory requirement of ≈ 800 GB for d = 500. A complemen- tary challenge is that corpora sizes themselves increase. The Recently, Mikolov et al (Mikolov et al. 2013; Mikolov and largest reported dataset that has been used was 100 billion Dean 2013) introduced Word2Vec, a suite of algorithms words (Mikolov and Dean 2013) which required training for unsupervised training of dense vector representations of time in the order of days. Such training speeds are impracti- words on large corpora. Resulting embeddings have been cal for web-scale corpus sizes or for applications that require shown (Mikolov and Dean 2013) to capture semantic word frequent retraining in order to avoid staleness of the models. similarity as well as more complex semantic relationships Even for smaller corpora, reduced training times shorten it- through linear vector manipulation, such as vec(“Madrid”) ≈ eration cycles and increase research agility. - vec(“Spain”) + vec(“France”) vec(“Paris”). Word2Vec Word2Vec is an umbrella term for a collection of word achieved a significant performance increase over the state- embedding models. The focus of this work is on the Skip- of-the-art while maintaining or even increasing quality of Grams with Negative Sampling (SGNS) model that has been the results in Natural Language Processing (NLP) applica- shown (Mikolov and Dean 2013) experimentally to perform tions. better, especially for larger corpora. The central component More recently, novel applications of word2vec involv- of SGNS is Negative Sampling (NS). Our main contribu- ing composite “words” and training corpora have been pro- tions are: (1) a novel large-scale distributed SGNS train- posed. The unique ideas empowering the word2vec mod- ing system developed on top of a custom graph process- els have been adopted by researchers from other fields to ing framework; (2) a collection of approximation algorithms tasks beyond NLP, including relational entities (Bordes et for NS that maintain the quality of single-node implemen- al. 2013; Socher et al. 2013), general text-based attributes tations while offering significant speedups; and (3) a novel (Kiros, Zemel, and Salakhutdinov 2014), descriptive text of hierarchical distributed algorithm for sampling from a dis- images (Kiros, Salakhutdinov, and Zemel 2014), nodes in crete distribution. We obtain results on a corpus created from graph structure of networks (Perozzi, Al-Rfou, and Skiena the top 2 billion web pages of Yahoo Search, which in- 2014), queries (Grbovic et al. 2015) and online ads (Or- cludes 1.066 trillion words and a dictionary of 1.42 billion dentlich et al. 2016). Although most NLP applications of words. Training time per epoch is 2 hours for typical hy- word2vec do not require training of large vocabularies, perparameters. To the best of our knowledge, this is the first many of the above-mentioned real-world applications do. work that has been shown to scale to corpora of more than 1 For example, the number of unique nodes in a social net- trillion words and out-of-memory dictionaries. work (Perozzi, Al-Rfou, and Skiena 2014), or the number of unique queries in a search engine (Grbovic et al. 2015), can Related Work easily reach few hundred million, a scale that is not achiev- able using existing word2vec systems. Dense word embeddings have a rich history (Hinton, Mc- clelland, and Rumelhart 1986; Elman 1990; Hinton, Rumel- Copyright c 2017, Association for the Advancement of Artificial hart, and Williams 1985), with the first popular Neural Net- Intelligence (www.aaai.org). All rights reserved. work Language Model (NNLM) introduced in (Bengio et 2569 al. 2003). Mikolov introduced the concept of using word Algorithm 1 SGNS Word2Vec vectors constructed with a simple hidden layer to train the 1: function UPDATE(η) full NNLM (Mikolov 2008; Mikolov et al. 2009) and was 2: for all skip-grams wI ,wO do the first work to practically reduce the high dimensional- 3: Wneg ←{} ity of bag-of-words representations to dense, fixed-length 4: for i ∈ [1,N] do vector representations. This model later inspired Mikolov’s 5: Wneg ← Wneg ∪ sample(Pn) Word2Vec system (Mikolov and Dean 2013) which has g ← (σ(vT v ) − 1)v 6: wO wO wI wI get gradients since been implemented and improved by many researchers. ← T − 7: gwI (σ(vw vwI ) 1)vwO ∈ O Single Node Implementations 8: for all wj Wneg do T 9: gw ← σ(v vw )v The reference implementation of word2vec, later imple- j wj I wI g ← g + σ(vT v )v mented into (Gensim 2016), (TensorFlow 2016) and (Medal- 10: wI wI wj wI wj lia 2016), uses multithreading to increase training speed. 11: v ← v − η · gw update vectors Gensim improves its word2vec implementation three-fold wO wO O 12: vw ← vw − η · gw by integrating BLAS primitives. Since SGNS word2vec I I I 13: for all wj ∈ Wn do maintains two matrices of size Words x Dimensions, lim- 14: v ← v − η · gw itations on main memory and single machine performance wj wI j make it impossible for single-node implementations to scale up with dictionary and corpus size. 1 where σ(x)= 1+e−x , and vw and vw are the “input” and Distributed Implementations “output” vector representations for word w. Its gradients Examples of distributed training systems that have been pre- with respect to the word vectors are: sented are (Deeplearning4j 2016), (MLLib 2016), Ji et al’s ∂E T (Ji et al. 2016) and Ordentlich et al’s (Ordentlich et al. 2016). =(σ(v vw ) − 1)vw ∂v wO I I Deeplearning4j adopts GPUs to accelerate learning times. wO Spark MLLib’s word2vec implementation uses an approach ∂E T = σ(v vw )vw that is similar to Ji et al’s (Ji et al. 2016). Both implemen- ∂v wj I I tations replicate the word embeddings on multiple execu- wj tors, which are then periodically synchronized among them- ∂E T =(σ(v vw ) − 1)vw ∂v wO I O selves. The input corpus is partitioned onto each executor wI which independently updates its local copy of the embed- T + σ(v vw )v dings. This technique reduces training time, yet the memory wj I wj ∈ required to store all vectors in all Spark executors makes wj Wneg this approach unsuitable for large dictionaries, although it The loss function is then minimized using a variant of addresses well the problem of large corpora. stochastic gradient descent, yielding a simple learning algo- Ordentlich et al. provide an alternative approach in (Or- rithm that iterates over all the skip-grams, calculates the gra- dentlich et al. 2016). Their implementation scales to large dients of their loss functions and updates the values of their vocabularies using a Parameter Server (PS) framework to respective vectors, as well as the vectors of negative sam- store the latest values of model parameters. The word em- ples chosen for each skip-gram. A summary of the model is bedding matrices are distributed over a predefined number depicted in Algorithm 1. k of worker nodes by assigning each node with 1/kth of the columns (dimensions). The worker nodes then perform partial inner product sum operations on their corresponding Distributed Discrete Sampling segment, upon request by a separate computation node. This Central to SGNS is sampling from a noise distribution parameter server approach enabled the models to be trained Pn(w):w ∈ W over the vocabulary W . Mikolov et al. over large vocabularies. Nevertheless, it is still hard to scale (Mikolov and Dean 2013) experimentally showed that the 3 up the system since the number of executors is limited to the unigram distribution U(w) raised to the /4-th power signif- dimensionality of the word embeddings. In fact it only scales icantly outperformed both the U(w) and uniform distribu- up to 10-20 computing nodes for typical model parameters, tions. Negative Sampling lies at the core of the performance while the average throughput they obtained is 1.6·106 words improvements obtained by Word2Vec as it allows for updat- per second because of this limitation. ing the loss function in O(kd) instead of O(|W |d), where k is the number of noise samples per skip gram and d the di- Continuous Skip-gram Model mensionality of the embedding. Nevertheless, it remains the For each skip-gram (w ,w ) and a set of negative samples dominant time component of the algorithm. I O It is desirable to be able to draw from the noise distribu- Wneg, Mikolov et al.