arXiv:1606.07822v1 [cs.CL] 24 Jun 2016 ase Eickhoff Carsten Netherlands The Technology, of University Science Delft Applied of University Hague The re .d Vries de P. Arjen Switzerland Zurich, Science, Computer of Department Zurich ETH od,b osrcigas-ald”odembedding”, tasks ”word NLP ( different many so-called across effective of a shown been representation have constructing distributed by use words, but that sentences, approaches and recent bag- documents of simple representations use of-word dominantly approaches NLP Traditional Introduction 1. Nijmege Sciences, Information and Computing for Institute Nijmegen University Radboud rceig fte33 the of Proceedings Vuurens B.P. Jeroen olcieykona odVc( as are known which corpora, collectively text from embeddings learn to strategies Learning 8 oyih 06b h author(s). the by 2016 Copyright 48. pc,adteeoeefiinyi h e oices the increase to key technique. the embedding this is of in potential efficiency therefore encoded and gener- automatically space, of syntactic are volume and that semantic the alizations the increasing improves simply data that training shown have They etne al. et Weston mrvsteefiinyb atro 4. of factor a that by efficiency strategy the improves caching and straightforward learning, a parallel propose of efficiency the can collisions degrade these that show We collisions. in up- date model memory incidental single ignoring a memory, on shared threads operating multiple while running parallel in by efficiently learn aim to implementations produce Existing to tasks. Language Processing Natural used various em- be for results an can state-of-art in which entities space, or semantics-preserving bedding words learn of to representations used widely variants are its and Word2Vec introduction, its Since e ok Y S,21.JL:WC volume W&CP JMLR: 2016. USA, NY, York, New , , 2014 .Mklve l nrdcdefficient introduced al. et Mikolov ). rd Abstract nentoa ofrneo Machine on Conference International fcetPrle erigo Word2Vec of Learning Parallel Efficient ioo tal. et Mikolov , 2013 ). ,TeNetherlands The n, fue oe,adpooeaslto htue higher- of uses efficiency that ( the sampling improve solution negative using to Word2Vec number a functions the propose algebra over linear and level well implemen- cores, current scale used the not that of do argue Word2Vec al. of et tations Ji work, recent In orglrz egt.Atraiey a Alternatively, max pairs word weights. random regularize of number to vocabulary. limited a the with together in pairs the word update sampling every to Negative observing having avoid of that al. probability et strategies Mikolov two embeddings, word proposed of learning efficient For text. in th jointly words appear on learning iteratively by representations the vector optimize to Descent Gradient Stochastic uses Word2vec rnlto n usinaseig( answering question similar- and machine word recognition, translation as entity named such tasks, analogy tasks, word ity, various called on corpora results text of-the-art large over ( embeddings efficient Word2Vec very learn a to proposed al. way in- et semantic Mikolov latent Recently, in dexing. obtained those to f comparable lan- representation words vector-space the a natural learn on to in architecture based work word represen- space a distributed ( precede embedding guage a immediately an that learn in words to words for propose tation al. et Bengio Work Related 2. result. large a a as show efficiency and in vectors, gain updated pro- frequently we cache solution, to straightforward pose a As memory threads. same multiple the by of access is simultaneous it by conclude caused and mainly problem scalability the analyze we work, sue salann betv,frwihteoutput the which for objective, learning a as used is egoe al. et Bengio ioo tal. et Mikolov CARSTEN , 2006 ossso paigosre word observed updating of consists J . .Te s hlo erlnet- neural shallow a use They ). , B 2013 . P . . EICKHOFF VUURENS ,wihpoue state- produces which ), ie al. et Ji ARJEN etne al. et Weston irrhclsoft- hierarchical @ @ INF TUDELFT , @ 2016 ACM . ETHZ .I this In ). , . ORG 2014 . . CH NL or at ). Efficient Parallel Learning of Word2Vec layer is transformed into a binary Huffmann tree and in- all words that share the same contextword (e.g. with a win- stead of predicting an observed word, the model learns the dow size of 5 up to 10 words would train against the same word’s position in the Huffmann tree by the left and right context term), with a matrix that contains the context word turns at each inner node along its path from the root. and a shared set of negative samples (Ji et al., 2016). Effi- ciency is improved up to 3.6x over the original Word2Vec The original Word2Vec implementation uses a Hogwild implementation, by the leveraging the improved ability for strategy of allowing processors lock-free access to shared matrix-matrix multiplication in high-end CPU’s, and only memory, in which they can update at will (Recht et al., updating the model after each mini-batch. Typically, the 2011). Recht et al. show that when data access is sparse batches are limited to about 10-20 words for convergence memory overwrites are rare and barely introduce error reasons, reporting only a marginal loss of accuracy. when they occur. However, Ji et al. notice that for train- ing Word2Vec this strategy does not use computational re- We suspect that for parallel training a model in shared sources efficiently (Ji et al., 2016). They propose to use memory the real bottleneck is latency caused by concurrent level-3 BLAS routines on specific CPU’s to operate on a access of the same memory. The solution that is presented mini-batches, which is then updated to the main memory by (Ji et al., 2016) uses level-3 BLAS functions to perform and reduces the access to main memory. They address the matrix-matrix multiplications, which implicitly lowers the Word2Vec variant that uses negative sampling. In our work number of times shared memory is accessed and updated. we address the hierarchical softmax variant, and show that We alternatively propose a straightforward caching strat- straightforward caching of frequently occurring data is suf- egy that provides a comparable efficiency gain over the hi- ficient to improve efficiency on conventional hardware. erarchical softmax variant, that is less hardware dependent since it does not need highly optimized level-3 BLAS func- 3. Experiment tions. When using the hierarchical softmax, efficiency is more affected by memory conflicts than with negative sam- 3.1. Analysis of inefficiencies for learning Word2Vec pling; the root node of the tree is in every word’s path and therefore gives a conflict every time, the direct children of For faster training of models, commonly, the input is pro- the rootwill bein thepath of half thevocabulary,etc. When cessed in parallel by multiple processors. One strategy is each thread caches the most frequently used upper nodes of to map the input into separate batches and aggregating the the tree and only updates the change to main memory after results in a reduce step. Alternatively, multithreading can a number of trained words, the number of memory conflicts be used on a single model in shared memory, which is used is reduced. by the original Word2Vec implementation in C as well as the widely used Gensim package for Python. We analyzed Algorithm 1 describes the caching in pseudocode, in which how the learning times scale over the used number of cores for the most frequently used inner nodes in the Huffmann with Word2Vec, and noticed the total run time appears to be tree a local copy is created and used for the updates and optimal when using approximately 8 cores, beyond which the non frequently used inner nodes are updated in shared efficiency degrades (see Figure 1). memory. After processing a number of words (typically in the range 10 to 100) for every cached weight vector the Ji et al. analyze this problem and conclude that multi- update is computed by subtracting the original value when ple threads can cause conflicts when they attempt to up- cached and adding this update to the weight vector in main date the same memory (Ji et al., 2016). Current Word2Vec memory. The vectors are processed using level-1 BLAS implementations simply ignore memory conflicts between functions, scopy to copy a vector, sdot to compute the threads, however, concurrent attempts to read and update dot product between two vectors and saxpy to add a scalar the same memory increases memory access latency. This multiplied by the first vector to the second vector. Hogwild strategy was introduced under the assumption that memory overwrites are rare, however, for natural language The proposed caching strategy is implemented in memory collisions may be more likely than for other data Python/ and published as part of the Cythnn open given the skew towards frequently occurring words. Col- source project1. lisions are even more likely when learning against a hier- archical softmax, since the top nodes in the Huffmann tree 4. Results are used by a large part of the vocabulary. 4.1. Experiment setup 3.2. Caching frequently used vectors To compare the efficiency of the proposed caching strat- For efficient learning of word embeddings with negative egy with the existing implementations, we used the skip- sampling, Ji et al. propose a mini-batch strategy that uses a 1http://cythnn.github.io level-3 BLAS function to multiply a matrix that consists of Efficient Parallel Learning of Word2Vec

Algorithm 1 Cached Skipgram HS 4.2. Efficiency 1: for each wordid v in text do In Figure 1, we compare the changesin total run time when 2: for each wordid w within window of v do adding more cores between the original Word2Vec C im- 3: for each node n, turn t in treepath to v do plementation, Gensim, Cythnn without caching (c0) and 4: if isFrequent(n) then Cythnn when caching the 31 most frequently used nodes 5: if not cached[n] then in the Huffmann tree (c31). The first observation is that 6: scopy( weight[n], cachedweight[ n ]) Word2Vec and Cythnn without caching do have degrading 7: scopy( cachedweight[ n ], original[ n ]) efficiencies beyond the use of 8 cores, while Gensim is ap- 8: cached[n] = True parently more optimized to use up to 16 cores. The run of 9: end if Cythnn that caches the 31 top-vectors (157 sec) performs 10: usedweight=cachedweight[ n ] up to 3.9x faster than the fastest C-run (630 sec) and up to 11: else 2.5x faster than the fastest Gensim run (385 sec). 12: usedweight=weight[ n ] 13: end if 14: f = sigmoid( sdot( vector[ w], usedweight ) ) 15: gradient = ( 1 - t -f)*alpha 1200 16: saxpy( 1, usedweight, hiddenlayer ) w2v gensim c0 c31 1000 17: saxpy( 1, vector[ w ], usedweight ) 18: end for 800 19: saxpy( 1, hiddenlayer, vector[ w ]) 600 20: end for 500 21: if time to flush cache then 22: for each node n do 400 23: if isFrequent( n ) and cached[ n ] then 300 24: saxpy( -1, original[ n], cachedweight[ n ]) execution time(s) 25: saxpy( 1, cachedweight[ n ], weight[ n ]) 26: cached[ n ] = False 200 27: end if 150 28: end for 29: end if 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30: if time to update alpha then #cores 31: update alpha 32: end if Figure 1. Comparison of the total execution time when chang- 33: end for ing the number of cores (x-axis) between the original Word2Vec, Gensim, Cythnn without caching and Cythnn when caching the top-31 nodes in the Huffmann tree

In Figure 2 we compare the execution time of Cythnn when gram with hierarchical softmax, with the default set- changing the number of cached top level nodes in the Huff- tings mintf=5, windowsize=5, vectorsize=100, downsam- mann tree over different numbers of cores used. Interest- ple=0.001, and without negative sampling. Although our ingly, caching just the root node (#cached nodes=1) has implementation is only slightly improved by downsam- the highest impact on efficiency, and reduces the execution pling of frequent terms, the original C implementation is time by more than 40% when using more than 20 cores. approximately 24% faster and produces slightly more ac- This supports the idea that the reduction of concurrent ac- curate word embeddings on the analogy test. Every run cess of the same memory is key to improving efficiency. processed 10 iterations over the standard Text8 collection2, Since the root node in the Huffmann tree is in the path of which consists of the first 100MB of flat text words found every word in the vocabulary, caching just this node im- in Wikipedia. All experiments were performed on the same proves most. On the Text8 dataset the improvement is near virtual machine with two Intel(R) Xeon(R) CPU E5-2698 optimal when caching at least the top-15 nodes in the tree. v3, which together have 32 physical cores. In our experiments, using more than 24 cores is slightly counter effective, possibly indicating that memory band- 2http://mattmahoney.net/dc/text8.zip width is fully used. Efficient Parallel Learning of Word2Vec

cores Table 1. Comparison of the accuracy between systems 4 System Accuracy 800 8 12 W2VC 33.73 600 16 Gensim 35.45 500 20 Cythnnnocache 33.54 24 Cythnncachetop-31 33.57 400 28

300 a hierarchical softmax. When different processes access execution time(s) and update the same memory, either because a skew in 200 the data or because of the increased number of concur- rent cores used, there is only limited benefit to adding more 150 cores. The original Word2Vec implementation is most ef- 0 1 3 7 15 31 63 ficient when trained using 8 cores and becomes less effi- cient when more cores are added. We propose a straight- #cached nodes forward caching strategy that caches the weight vectors that are used most frequently, and updates their change to main Figure 2. Comparison of the execution time when changing the number of cores and on the x-axis the number of most frequent memory after a short delay reducing concurrent access of terms from the Huffmann tree that are cached.. shared memory. We compared the efficiency of this strat- egy against existing Word2Vec implementations and show up to 4x improvement in efficiency. 4.3. Effectiveness Acknowledgment Ji et al. compared the effectiveness between their approach and the original Word2Vec implementation and report that This work was carried out on the Dutch national e- processing in mini-batches has a marginal but noticeable infrastructure with the support of SURF Foundation. negative effect on the accuracy of the embeddings learned. In Table 1, we show the average accuracy of each sys- References tem over the runs of 2-28 cores, where the accuracy was computed using the evaluation tool and question-words test Bengio, Yoshua, Schwenk, Holger, Sen´ecal, Jean- set that is packaged with Word2Vec3. The results show a S´ebastien, Morin, Fr´ederic, and Gauvain, Jean-Luc. higher accuracy for Gensim, but we have found no doc- Neural probabilistic language models. In Innovations in umentation that helps to explain this difference. Using , pp. 137–186. Springer, 2006. Cythnn, caching does not result in lower effectiveness as reported by (Ji et al., 2016), a difference being that our Ji, Shihao, Satish, Nadathur, Li, Sheng, and Dubey, caching approach always uses updated vectors within a Pradeep. Parallelizing word2vec in shared and dis- thread whereas the BLAS-3 solution does not use updated tributed memory. preprint arXiv:1604.04661, 2016. vectors within a mini-batch. Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, When varying cache update frequency u, we expect a trade- Greg S, and Dean, Jeff. Distributed representations of off between training efficiency and model accuracy. In words and phrases and their compositionality. In Ad- practice, our experimentsshow, that this balance is a slowly vances in neural information processing systems, pp. shifting one that is further influencedby the chosen window 3111–3119, 2013. size n. Using the default window size of n = 5, updating Recht, Benjamin, Re, Christopher, Wright, Stephen, and the cache after each training word (u = 1) is as efficient as Niu, Feng. Hogwild: A lock-free approach to paralleliz- flushing after u = 10 words. For this paper, cached updates ing stochastic gradient descent. In Advances in Neural were written back to shared memory every 10 words. Information Processing Systems, pp. 693–701, 2011. 5. Conclusion Weston, Jason, Chopra, Sumit, and Adams, Keith. # tagspace: Semantic embeddings from hashtags. In Pro- In this study, we analyzed the scalability of parallel learn- ceedings of EMNLP 2014, 2014. ing of a single model in shared memory, most specifically learning Word2Vec using a Skipgram architecture against

3http://word2vec.googlecode.com