Proceedings of the Thirty-First AAAI Conference on (AAAI-17)

Distributed Negative Sampling for Word Embeddings Stergios Stergiou Zygimantas Straznickas Yahoo Research MIT [email protected] [email protected]

Rolina Wu Kostas Tsioutsiouliklis University of Waterloo Yahoo Research [email protected] [email protected]

Abstract Training for such large vocabularies presents several chal- lenges. needs to maintain two d-dimensional vec- Word2Vec recently popularized dense vector word represen- tors of single-precision floating point numbers for each vo- tations as fixed-length features for algo- cabulary word. All vectors need to be kept in main memory rithms and is in widespread use today. In this paper we in- to achieve acceptable training latency, which is impractical vestigate one of its core components, Negative Sampling, and propose efficient distributed algorithms that allow us to scale for contemporary commodity servers. As a concrete exam- to vocabulary sizes of more than 1 billion unique words and ple, Ordentlich et al. (Ordentlich et al. 2016) describe an corpus sizes of more than 1 trillion words. application in which they use word embeddings to match search queries to online ads. Their dictionary comprises ≈ 200 million composite words which implies a main mem- Introduction ory requirement of ≈ 800 GB for d = 500. A complemen- tary challenge is that corpora sizes themselves increase. The Recently, Mikolov et al (Mikolov et al. 2013; Mikolov and largest reported dataset that has been used was 100 billion Dean 2013) introduced Word2Vec, a suite of algorithms words (Mikolov and Dean 2013) which required training for unsupervised training of dense vector representations of time in the order of days. Such training speeds are impracti- words on large corpora. Resulting embeddings have been cal for web-scale corpus sizes or for applications that require shown (Mikolov and Dean 2013) to capture semantic word frequent retraining in order to avoid staleness of the models. similarity as well as more complex semantic relationships Even for smaller corpora, reduced training times shorten it- through linear vector manipulation, such as vec(“Madrid”) ≈ eration cycles and increase research agility. - vec(“Spain”) + vec(“France”) vec(“Paris”). Word2Vec Word2Vec is an umbrella term for a collection of word achieved a significant performance increase over the state- embedding models. The focus of this work is on the Skip- of-the-art while maintaining or even increasing quality of Grams with Negative Sampling (SGNS) model that has been the results in Natural Language Processing (NLP) applica- shown (Mikolov and Dean 2013) experimentally to perform tions. better, especially for larger corpora. The central component More recently, novel applications of word2vec involv- of SGNS is Negative Sampling (NS). Our main contribu- ing composite “words” and training corpora have been pro- tions are: (1) a novel large-scale distributed SGNS train- posed. The unique ideas empowering the word2vec mod- ing system developed on top of a custom graph process- els have been adopted by researchers from other fields to ing framework; (2) a collection of approximation algorithms tasks beyond NLP, including relational entities (Bordes et for NS that maintain the quality of single-node implemen- al. 2013; Socher et al. 2013), general text-based attributes tations while offering significant speedups; and (3) a novel (Kiros, Zemel, and Salakhutdinov 2014), descriptive text of hierarchical distributed algorithm for sampling from a dis- images (Kiros, Salakhutdinov, and Zemel 2014), nodes in crete distribution. We obtain results on a corpus created from graph structure of networks (Perozzi, Al-Rfou, and Skiena the top 2 billion web pages of Yahoo Search, which in- 2014), queries (Grbovic et al. 2015) and online ads (Or- cludes 1.066 trillion words and a dictionary of 1.42 billion dentlich et al. 2016). Although most NLP applications of words. Training time per epoch is 2 hours for typical hy- word2vec do not require training of large vocabularies, perparameters. To the best of our knowledge, this is the first many of the above-mentioned real-world applications do. work that has been shown to scale to corpora of more than 1 For example, the number of unique nodes in a social net- trillion words and out-of-memory dictionaries. work (Perozzi, Al-Rfou, and Skiena 2014), or the number of unique queries in a search engine (Grbovic et al. 2015), can Related Work easily reach few hundred million, a scale that is not achiev- able using existing word2vec systems. Dense word embeddings have a rich history (Hinton, Mc- clelland, and Rumelhart 1986; Elman 1990; Hinton, Rumel- Copyright  2017, Association for the Advancement of Artificial hart, and Williams 1985), with the first popular Neural Net- Intelligence (www.aaai.org). All rights reserved. work Language Model (NNLM) introduced in (Bengio et

2569 al. 2003). Mikolov introduced the concept of using word Algorithm 1 SGNS Word2Vec vectors constructed with a simple hidden layer to train the 1: function UPDATE(η) full NNLM (Mikolov 2008; Mikolov et al. 2009) and was 2: for all skip-grams wI ,wO do the first work to practically reduce the high dimensional- 3: Wneg ←{} ity of bag-of-words representations to dense, fixed-length 4: for i ∈ [1,N] do vector representations. This model later inspired Mikolov’s 5: Wneg ← Wneg ∪ sample(Pn) Word2Vec system (Mikolov and Dean 2013) which has g ← (σ(vT v ) − 1)v  6: wO wO wI wI get gradients since been implemented and improved by many researchers. ← T − 7: gwI (σ(vw vwI ) 1)vwO ∈ O Single Node Implementations 8: for all wj Wneg do T 9: gw ← σ(v vw )v The reference implementation of word2vec, later imple- j wj I wI g ← g + σ(vT v )v mented into (Gensim 2016), (TensorFlow 2016) and (Medal- 10: wI wI wj wI wj lia 2016), uses multithreading to increase training speed. 11: v ← v − η · gw  update vectors Gensim improves its word2vec implementation three-fold wO wO O 12: vw ← vw − η · gw by integrating BLAS primitives. Since SGNS word2vec I I I 13: for all wj ∈ Wn do maintains two matrices of size Words x Dimensions, lim- 14: v ← v − η · gw itations on main memory and single machine performance wj wI j make it impossible for single-node implementations to scale up with dictionary and corpus size. 1 where σ(x)= 1+e−x , and vw and vw are the “input” and Distributed Implementations “output” vector representations for word w. Its gradients Examples of distributed training systems that have been pre- with respect to the word vectors are: sented are (Deeplearning4j 2016), (MLLib 2016), Ji et al’s ∂E T (Ji et al. 2016) and Ordentlich et al’s (Ordentlich et al. 2016). =(σ(v vw ) − 1)vw ∂v wO I I Deeplearning4j adopts GPUs to accelerate learning times. wO Spark MLLib’s word2vec implementation uses an approach ∂E T = σ(v vw )vw that is similar to Ji et al’s (Ji et al. 2016). Both implemen- ∂v wj I I tations replicate the word embeddings on multiple execu- wj tors, which are then periodically synchronized among them- ∂E T =(σ(v vw ) − 1)vw ∂v wO I O selves. The input corpus is partitioned onto each executor wI  which independently updates its local copy of the embed- T + σ(v vw )v dings. This technique reduces training time, yet the memory wj I wj ∈ required to store all vectors in all Spark executors makes wj Wneg this approach unsuitable for large dictionaries, although it The loss function is then minimized using a variant of addresses well the problem of large corpora. stochastic gradient descent, yielding a simple learning algo- Ordentlich et al. provide an alternative approach in (Or- rithm that iterates over all the skip-grams, calculates the gra- dentlich et al. 2016). Their implementation scales to large dients of their loss functions and updates the values of their vocabularies using a Parameter Server (PS) framework to respective vectors, as well as the vectors of negative sam- store the latest values of model parameters. The word em- ples chosen for each skip-gram. A summary of the model is bedding matrices are distributed over a predefined number depicted in Algorithm 1. k of worker nodes by assigning each node with 1/kth of the columns (dimensions). The worker nodes then perform partial inner product sum operations on their corresponding Distributed Discrete Sampling segment, upon request by a separate computation node. This Central to SGNS is sampling from a noise distribution parameter server approach enabled the models to be trained Pn(w):w ∈ W over the vocabulary W . Mikolov et al. over large vocabularies. Nevertheless, it is still hard to scale (Mikolov and Dean 2013) experimentally showed that the 3 up the system since the number of executors is limited to the unigram distribution U(w) raised to the /4-th power signif- dimensionality of the word embeddings. In fact it only scales icantly outperformed both the U(w) and uniform distribu- up to 10-20 computing nodes for typical model parameters, tions. Negative Sampling lies at the core of the performance while the average throughput they obtained is 1.6·106 words improvements obtained by Word2Vec as it allows for updat- per second because of this limitation. ing the loss function in O(kd) instead of O(|W |d), where k is the number of noise samples per skip gram and d the di- Continuous Skip-gram Model mensionality of the embedding. Nevertheless, it remains the For each skip-gram (w ,w ) and a set of negative samples dominant time component of the algorithm. I O It is desirable to be able to draw from the noise distribu- Wneg, Mikolov et al. (Mikolov and Dean 2013) define the Skip-gram Negative Sampling loss function as: tion in constant time. Word2Vec (Mikolov and Dean 2013)  achieves this by using the following algorithm. It creates an E = − log σ(vT v ) − log σ(−vT v ) array A (of size |A|  |W |) in which it places word identi- wO wI wj wI wj ∈Wneg fiers proportionally to their probability in Pn(w) and draws

2570 Algorithm 2 Alias Method Let us first pick j =3and k =0since S[3] < 1.0,S[0] > 1.0 1: function BUILDALIAS(W, Pn(w)) . The arrays are updated to: 2: ∀w ∈ W : S[w]=Pn(w) ·|W | S 1.4 1.0 0.6 0.4 ∀ ∈ 3: w W : A[w]=w A 0 1 2 0 4: TL = {w|Pn(w) < 1/|W |} 5: TH = {w|Pn(w) > 1/|W |} Following, we select j =2and k =0since S[2] < 6: for j ∈ TL do 1.0,S[0] > 1.0. Notice that j =3is no longer a valid selec- 7: k ← POP(TH )  remove an element from TH tion as it has already been examined. The arrays are updated 8: S[k]=S[k] − 1+S[j] to: 9: A[j]=k S 1.0 1.0 0.6 0.4 10: if S[k] < 1 then A 0 1 0 0 11: TL = TL ∪{k} 12: else if S[k] > 1 then To gain some insight on this algorithm, we observe that

13: T = T ∪{k} for each word w, either S[w]=1or A[w] = w, and that H H − | | Pn(w)=(S[w]+ w=w,A[w]=w 1 S[w ])/ W . We also 14: TL = TL\{j} return S, A note that constructing the arrays S, A requires O(|W |) time: 15: function SAMPLEALIAS(S, A) Initially, all words are separated into three sets TH ,TM ,TL, | | | | 16: u = Sample(U{0, |W |}  real sample ∈ [0, |W |) containing words with Pn(w) > 1/ W , Pn(w)=1/ W | | | | 17: if S[u ] ≤ u −u then and Pn(w) < 1/ W respectively, in O( W ) time. The al- ∈ 18: return A[u ] gorithm iterates over set TL and “pairs” each word j TL ∈ 19: else with a word k TH .IfS[k] is reduced to less than 1.0 then k ∅ 20: return u is added to TL. The algorithm terminates when TL = . The alias method was studied in (Walker 1974; 1977; Vose 1991; Marsaglia et al. 2004) and is presented in Algorithm 2. A uniformly from . As a concrete example, assuming a vo- Hierarchical Sampling cabulary W = {A, B, C, D} where Pn(A)=0.5,Pn(B)= 0.25,P (C)=0.15,P (D)=0.1, array A might be: In a distributed context, where vocabulary W is (ideally) n n equi-partitioned onto p computation nodes, even assigning [A,A,A,A,A,A,A,A,A,A, B,B,B,B,B, C,C,C, D,D]. O(|W |) space for sampling on each node may be wasteful.

This algorithm is approximate as probabilities in Pn(w) are We present a hierarchical sampling algorithm that still al- quantized to 1/|A| and also requires significant memory re- lows for drawing from Pn(w) in constant time, but only re- sources for larger vocabularies (for instance, |A|/|W | > quires O(p + |W |/p) space per partition and O(p + |W |/p) 100 in the reference implementation.) In the following sec- preprocessing time. While the alias method is an improve- tion we present an algorithm for drawing from Pn(w) in ment over the sampling algorithm in (Mikolov and Dean constant time that only requires O(|W |) space and O(|W |) 2013), it still requires O(|W |) memory, which as the dic- pre-processing time. tionary size increases, may become prohibitively large. For instance, for a dictionary size of 1 billion words, an imple- Alias Method mentation of the alias method may require 16GB of mem- ory on every computation node. In this section we propose a Let Pn(w):w =[0, .., |W |−1] be a discrete distribution over W . We construct two arrays S, A, where |S| = |A| = two-stage sampling algorithm that still allows to draw from Pn(w) in constant time. However each partition only main- |W |, initialized to S[i]=Pn(i) ·|W | and A[i]=i. While there exists j such that S[j] < 1, select an arbitrary k : tains information about the subset of the vocabulary it is re- S[k] > 1 and set A[j]=k, S[k]=S[k] − 1+S[j]. Entry j sponsible for (in addition to some low overhead auxiliary is never examined thereafter. Such a selection for k is always information.  ≤ ≤ possible as S[w]=|W |. Let W be the set of all words and Wi, 1 i p be its w ∩ ∅ We will use arrays S, A to draw from P (w) in constant p partitions such that i Wi = W and Wi Wj = for n ≤ ≤ time as follows: Let u ∈ [0, |W |) be a draw from the uni- all 1 i = j p. Let c(u) be some positive function of form continuous distribution U{0, |W |}, obtained in con- the word u and T be a probability distribution of sampling a stant time. Then word u from the whole set W :  u if S[u ] >u−u  c(u) q = T (u)= (1) A[u ] otherwise w∈W c(w) is a draw from P (w), also obtained in constant time. For each partition i, let its word probability distribution n be:  Let us examine the construction of arrays S, A for  c(u) if u ∈ Wi P (A)=0.5,P (B)=0.25,P (C)=0.15,P (D)=0.1 ∈ c(w) n n n n Ti(u)= w Wi (2) from our previous example. Arrays S, A are initialized to: 0 otherwise S 2.0 1.0 0.6 0.4 Consider a sampling process where first a partition k is A 0 1 2 3 sampled from a probability distribution P , then a word is

2571 Algorithm 3 Hierarchical Sampling Each word w and its associated information is placed on ∈ a specific partition M(h(w)), determined by h(w) which is 1: function BUILD TABLES(c(u),Wi : i [1..p]) ∈ c(w) the 128-bit md5sum of w. During each epoch, the input is w Wk 2: P (k)= ( ) w∈W c w subsampled and translated into a set of skip-grams (wi,wo). 3: [SP ,AP ]=BUILDALIAS([1..p],P(k)) All skip-grams related to the input word wi are then sent to ∈ 4: for j [1..p]do partition M(h(wi)). To support efficient subsampling, we  c(u) maintain all word identifiers with frequency >ton all par- if u ∈ Wj ∈ c(w) 5: Tj(u)= w Wj titions. This set of words is typically a very small subset of 0 otherwise the vocabulary. 6: [Sj,Aj]=BUILDALIAS(Wj,Tj(u)) Considering dictionary words as nodes in a graph G and 7: function SAMPLE([SP ,AP ], [Si,Ai]:i ∈ [1..p]) skip-grams as directed edges between them, SGNS can be G 8: k =SAMPLEALIAS(SP ,AP ) seen as a distributed computation over . Following, we 9: return SAMPLEALIAS(Sk,Ak) provide a brief overview of our custom distributed graph processing framework on which we developed the algo- rithms presented in this work. We then present our three dis- tributed algorithms for negative sampling, that differ only in sampled from the probability distribution Tk. Our goal is to the approach with which they obtain the negative samples choose P such that the total distribution of sampling a word and update the related word vectors. The learning phase of using this process is equal to T . the epoch varies according to the negative sampling algo- Assume the word u is in partition k. Then the probability rithm used and dominates the execution time. of sampling it is: Graph Processing Framework N c(u) P (i)Ti(u)=P (k)Tk(u)=P (k) Chronos is a proprietary in-memory / secondary-storage hy- ∈ c(w) i=1 w Wk brid graph processing system, developed in C++11 on top of Hadoop MapReduce. Word-related state is maintained in We would like this to be equal to T . Then: memory, while the graph itself, specifically a serialization of the skip grams information, can either be stored in memory c(u) c(u) P (k) =  or on disk. Messages originating from a source process are c(w) c(w) w∈Wk w∈W transparently buffered before being sent to target processes, Therefore:  with buffering occurring separately for each target parti- ∈ c(w) tion. Each message contains the related P (k)= w Wk . (3) c(w) as its payload. The framework can optionally [de]compress w∈W buffers on-the-fly, transparently to the application. Commu- We perform hierarchical sampling as follows. First, we nication is performed via a large clique of unidirectional obtain an auxiliary probability distribution over the set of point-to-point message queues. The system builds on the all partitions, as defined in Eq. 3. This distribution is made following low-level communication primitives: clique com- available to all computation nodes and is low-overhead as munication for graph level computations; and one-to-all and p<<|W |. We construct arrays SP ,AP as per the Alias all-to-one communication for process-level computations. method. For each partition i we additionally construct ar- It implements loading and storing of vertex information rays Si,Ai using the word distributions in Eq. 2. We can through which it supports fault tolerance via check-pointing. now sample from distribution T (u) (Eq. 1) as follows: We Each superstep is executed as a sequence of three primitives: first draw a partition j from P (k) and subsequently draw a a one-to-all communication primitive that sends initializa- word w from Tj(u). We observe that preprocessing is now tion parameters to all peers, a clique communication prim- O(|Wi| + p) for partition i while drawing from T (u) is still itive that transmits edge messages, and an all-to-one com- a constant-time operation. The algorithm is summarized in munication primitive that gathers partition-level results from Algorithm 3. the peers to the master process. Distributed SGNS Baseline Negative Sampling Preliminaries In Baseline Negative Sampling (BNS), each partition iter- ates over the skip-grams it maintains. For each skip-gram We closely follow the reference SGNS implementation with (wI ,wO), it sends the input vector vw to partition p0 = respect to the conversion from an input corpus to the set of I M(h(wO)), which is the partition that owns word wO. Upon skip-grams. In particular, we reject input word wi with prob- reception of vwI , partition pO updates output vector vw ability:  O and sends back gradient gwI to the source partition for up- t dating v . BNS draws N samples p1, .., p from the par- − wI N P (wi)=1 titions distribution. Subsequently, for each p it sends v f(wi) i wI to partition pi. Upon reception, partition pi draws a neg- −5 where t is a frequency cutoff, typically set to 10 and f(w) ative word wneg from its own word distribution, updates v g are the unigram frequencies. wneg and sends back gradient wI to the source partition

2572 Algorithm 4 Baseline Negative Sampling

1: function PROCESSSKIPGRAM(wI ,wO,η) w1 ← 2: gwI GETREMOTE(M(h(wO)), vwI ,wO) ∈ g 3: for all i [1,N] do w 4: p ←SAMPLEALIAS(SP ,AP ) I ← v 5: gwI gwI + GETNEGSAMPLE(p, vwI ) w ← − · I 6: vwI vwI η gwI gwI vwI 7: function GETREMOTE(vwI ,wO) T w2 wI wO 8: gw ← (σ(v vw ) − 1)vw O wO I I g ← T − vwI wI 9: gwI (σ(vw vwI ) 1)vw O O 10: v ← v − η · gw I wO wO O g w 11: return gwI I w 12: function GETNEGSAMPLE(vwI ) v 13: wneg ← SAMPLEALIAS(Sk,Ak) kis partition id g ← σ(vT v )v 14: wneg wneg wI wI w3 g ← g + σ(vT v )v 15: wI wI wneg wI wneg v ← v − η · g 16: wneg wneg wneg 17: return g wI Figure 1: Baseline Negative Sampling Communication Pat- tern during processing of skip-gram (wI ,wO), for 3 negative samples w1,w2,w3. for updating vwI . The communication across partitions is depicted in Figure 1 while the algorithm is summarized in Algorithm 4. We observe that, while BNS closely adheres to (wI ,wO), the gradient update to vwI that corresponds to the the reference implementation, this comes at a cost of send- negative samples can be added to the gradient that p0 is re- ing 2(N +1)vectors over the network, which may be pro- turning to the source partition. This allows for a communi- hibitively expensive for large corpora. cation pattern of 2 vectors per processed skip-gram. TNS is presented in Algorithm 6. Let us examine the set of skip- Single Negative Sampling { } grams with the same input word wI : (wI ,wOj ) . The pos- A reasonable approximation to BNS is to restrict sampling sible partitions from which TNS will be drawing negative ∪ { } to pools of words whose frequencies still approximate the samples from is R = j M(h(wOj )) . Considering that original distribution while being faster to learn from. We the number of partitions is typically much smaller than the present the first such approximation, Single Negative Sam- dictionary size, R is likely a large subset of the set of all par- pling (SNS). SNS behaves similarly to BNS with respect to titions RA. However, it is not necessary a true random sub- set of RA, while additionally it remains fixed across epochs the output word wO: it transfers input vector vwI to partition p0 = M(h(wO)), which subsequently updates output vec- (modulo any variation induced by subsampling.) Neverthe- less, our experiments demonstrate that TNS performs well tor vw and sends back gradient gwI to the source partition O in practice. for updating vwI . However, it draws all negative samples for a given skip-gram from a single partition drawn from the partitions distribution. It is presented in Algorithm 5. We ob- Experiments serve that SNS scales much better with N since it requires We collect experimental results on a cluster of commodity communicating 4 vectors per skip-gram over the network, a nodes whose configuration is depicted on Table 1. significant reduction from 2(N +1). We establish the quality of the embeddings obtained by Target Negative Sampling Target Negative Sampling (TNS) aims to further reduce the Accuracy [%] Model communication cost. We observe that, if negative samples Semantic Syntactic Total are drawn from partition p0 = M(h(wO)) for skip-gram SGNS (Reference) 74.27 69.35 71.58 SG-BNS 73.10 66.16 69.31 SG-SNS 73.48 68.08 70.53 CPU 2x Intel Xeon E5-2620 SG-TNS 74.72 66.13 70.03 Frequency 2.5GHz (max) RAM 64GB Table 2: Accuracy Results on Google Analogy Dataset for Memory Bandwidth 42.6 GB/s (max) Single-Epoch Trained Embeddings on the Composite Cor- Network 10Gbps Ethernet pus for d = 300,win =5,neg =5,t =10−5, min − count =10and threads =20. 350 partitions are Table 1: Cluster Node Configuration used for BNS, SNS and TNS.

2573 Algorithm 5 Single Negative Sampling Algorithm 6 Target Negative Sampling

1: function PROCESSSKIPGRAM(wI ,wO,η) 1: function PROCESSSKIPGRAM(wI ,wO,η) ← ← 2: gwI GETREMOTE(M(h(wO)), vwI ,wO) 2: gwI GETREMOTE(M(h(wO)), vwI ,wO) ← ← − · 3: p SAMPLEALIAS(SP ,AP ) 3: vwI vwI η gwI ← 4: gwI gwI + GETNEGSAMPLES(p, vwI ) 4: function GETREMOTE(k, vwI ,wO) ← − · T 5: vwI vwI η gwI g ← (σ(v v ) − 1)v 5: wO wO wI wI %Statex g ← (σ(vT v ) − 1)v 6: wI wO wI wO 6: function GETREMOTE(vwI ,wO) 7: j ∈ [1..N] for do 7: g ← (σ(v T v ) − 1)v ← wO wO wI wI 8: wj SAMPLEALIAS(Sk,Ak) ← T − T 8: gwI (σ(vw vwI ) 1)vw 9: g ← σ(v v )v O O wj wj wI wI 9: v ← v − η · g T wO wO wO g ← g + σ(v v )v 10: wI wI wj wI wj 10: return gwI 11: v ← v − η · gw 11: function GETNEGSAMPLES(k, v ) wO wO O wI ∈ ← 12: for all wj Wneg do 12: gwI 0 13: v ← v − η · gw 13: for j ∈ [1..N] do wj wj j ← 14: wj SAMPLEALIAS(Sk,Ak) kis partition id 14: return gwI g ← σ(vT v )v 15: wj wj wI wI g ← g + σ(vT v )v 16: wI wI wj wI wj = Query-Ads Dataset Value 17: v ← v − η · gw wj wj j Dictionary Words 200 Million

18: return gwI Dataset Words 54.8 Billion Dataset Size 411.5 GB Model Learning Time (sec) Speedup Partitions 1500 SGNS (Reference) 11507 1x Learning Time 856 sec SG-BNS 4378 2.6 x Table 4: Query-Ads Dataset and Single-Epoch Learning SG-SNS 1331 8.7 x −5 Runtime for d = 300,win=5and neg =5,t=10 SG-TNS 570 20.2 x

Table 3: Single-Epoch Learning Times on the Composite Corpus for d = 300,win =5,neg =5,t =10−5, results on two large datasets. The first dataset is obtained min − count =10and threads =20. 350 partitions are from (Ordentlich et al. 2016) and is related to mapping web used for BNS, SNS and TNS. queries to online ads. Results are shown on Table 4. The sec- ond dataset is obtained from 2 billion web pages and com- prises more than 1 trillion words and more than 1.4 billion our algorithms on a composite corpus comprising two news unique dictionary words. Results are shown on Table 5. corpora12, the 1 Billion Word Language Model Benchmark3 (Chelba et al. 2013), the UMBC WebBase corpus4 (Han et al. 2013) and Wikipedia5, as constructed by the open-source Conclusion reference word2vec implementation evaluation scripts. We report quality results on the Google Analogy evalu- In this paper we proposed an efficient distributed algorithm ation dataset (Mikolov et al. 2013) for the BNS, SNS and for sampling from a discrete distribution and used it to opti- TNS algorithms, as well as the reference word2vec imple- mize Negative Sampling for SGNS Word2Vec, allowing us mentation, on Table 2. We observe that all three algorithms to scale to vocabulary sizes of more than 1 billion words and perform similarly to each other, slightly trailing the refer- corpus sizes of more than 1 trillion words. Specifically, our ence implementation. On Table 3, we report single-epoch system learns from a web corpus of 1.066 trillion words on learning times. Specifically, we report a 20X speedup over 1.42 billion vocabulary words in 2 hours. the reference implementation. We then present scalability

1http://www.statmt.org/wmt14/training-monolingual-news- Web Dataset Value crawl/news.2012.en.shuffled.gz Dictionary Words 1.42 Billion 2 http://www.statmt.org/wmt14/training-monolingual-news- Dataset Words 1.066 Trillion crawl/news.2013.en.shuffled.gz 3 Dataset Size 6021.6 GB http://www.statmt.org/lm-benchmark/1-billion-word- Partitions 3000 language-modeling-benchmark-r13output.tar.gz 4http://ebiquity.umbc.edu/redirect/to/resource/id/351/UMBC- Learning Time 7344 sec webbase-corpus 5 Table 5: Web Dataset and Single-Epoch Learning Runtime http://dumps.wikimedia.org/enwiki/latest/enwiki-latest- −5 pages-articles.xml.bz2 for d = 300,win=5,neg=5and t =10

2574 References Mikolov, T.; Kopecky, J.; Burget, L.; Glembek, O.; et al. Bengio, Y.; Ducharme, R.; Vincent, P.; and Jauvin, C. 2003. 2009. Neural network based language models for highly in- A neural probabilistic language model. Journal of Machine flective languages. In 2009 IEEE International Conference Learning Research 3:1137–1155. on Acoustics, Speech and Signal Processing, 4725–4728. IEEE. Bordes, A.; Usunier, N.; Garcia-Duran, A.; Weston, J.; and Yakhnenko, O. 2013. Translating embeddings for modeling Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013. Ef- multi-relational data. In Advances in Neural Information ficient estimation of word representations in vector space. Processing Systems, 2787–2795. arXiv preprint arXiv:1301.3781. Chelba, C.; Mikolov, T.; Schuster, M.; Ge, Q.; Brants, T.; Mikolov, T. 2008. Language models for automatic speech and Koehn, P. 2013. One billion word benchmark for recognition of czech lectures. Proc. of Student EEICT. measuring progress in statistical language modeling. CoRR MLLib, S. 2016. https://spark.apache.org/docs/latest/mllib- abs/1312.3005. feature-extraction.html. Deeplearning4j. 2016. http://deeplearning4j.org/word2vec. Ordentlich, E.; Yang, L.; Feng, A.; Cnudde, P.; Grbovic, html. M.; Djuric, N.; Radosavljevic, V.; and Owens, G. 2016. Elman, J. L. 1990. Finding structure in time. Cognitive Network-efficient distributed word2vec training system for science 14(2):179–211. large vocabularies. arXiv preprint arXiv:1606.08495. Gensim. 2016. http://rare-technologies.com/word2vec-in- Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: python-part-two-optimizing/. Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowl- Grbovic, M.; Djuric, N.; Radosavljevic, V.; Silvestri, F.; and edge discovery and , 701–710. ACM. Bhamidipati, N. 2015. Context-and content-aware embed- dings for query rewriting in sponsored search. In Proceed- Socher, R.; Chen, D.; Manning, C. D.; and Ng, A. 2013. ings of the 38th International ACM SIGIR Conference on Reasoning with neural tensor networks for knowledge base Research and Development in Information Retrieval, 383– completion. In Advances in Neural Information Processing 392. ACM. Systems, 926–934. Han, L.; Kashyap, A. L.; Finin, T.; Mayfield, J.; and Weese, TensorFlow. 2016. https://www.tensorflow.org/. J. 2013. UMBC EBIQUITY-CORE: Semantic Textual Sim- Vose, M. D. 1991. A linear algorithm for generating random ilarity Systems. In Proceedings of the Second Joint Confer- numbers with a given distribution. IEEE Transactions on ence on Lexical and Computational Semantics. Association software engineering 17(9):972–975. for Computational Linguistics. Walker, A. J. 1974. New fast method for generating dis- Hinton, G. E.; Mcclelland, J. L.; and Rumelhart, D. E. 1986. crete random numbers with arbitrary frequency distribu- Distributed representations, Parallel distributed processing: tions. Electronics Letters 8(10):127–128. explorations in the microstructure of cognition, vol. 1: foun- Walker, A. J. 1977. An efficient method for generating dations. MIT Press, Cambridge, MA. discrete random variables with general distributions. ACM Hinton, G.; Rumelhart, D.; and Williams, R. 1985. Learning Transactions on Mathematical Software (TOMS) 3(3):253– internal representations by back-propagating errors. Parallel 256. Distributed Processing: Explorations in the Microstructure of Cognition 1. Ji, S.; Satish, N.; Li, S.; and Dubey, P. 2016. Parallelizing word2vec in shared and distributed memory. arXiv preprint arXiv:1604.04661. Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Multi- modal neural language models. In ICML, volume 14, 595– 603. Kiros, R.; Zemel, R.; and Salakhutdinov, R. R. 2014. A multiplicative model for learning distributed text-based at- tribute representations. In Advances in Neural Information Processing Systems, 2348–2356. Marsaglia, G.; Tsang, W. W.; Wang, J.; et al. 2004. Fast gen- eration of discrete random variables. Journal of Statistical Software 11(3):1–11. Medallia. 2016. https://github.com/medallia/ Word2VecJava. Mikolov, T., and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems.

2575