Distributed Training of Embeddings Using Graph Analytics

Distributed Training of Embeddings using Graph Analytics Gurbinder Gill Roshan Dathathri Saeed Maleki Katana Graph Inc. Katana Graph Inc. Microsoft Research Austin, TX, USA Austin, TX, USA Redmond, WA, USA [email protected] [email protected] [email protected] Madan Musuvathi Todd Mytkowicz Olli Saarikivi Microsoft Research Microsoft Research Microsoft Research Redmond, WA, USA Redmond, WA, USA Redmond, WA, USA [email protected] [email protected] [email protected] Abstract—Many applications today, such as natural language includes Word2Vec (Gensim [9]), Vertex2Vec (DeepWalk [4] processing, network and code analysis, rely on semantically and Node2Vec [5] are examples), Sequence2Vec [8], and embedding objects into low-dimensional fixed-length vectors. Doc2Vec [3] among others. Applications in this class maintain Such embeddings naturally provide a way to perform useful downstream tasks, such as identifying relations among objects a large embedding matrix, where each row corresponds to the and predicting objects for a given context. Unfortunately, training embedding vector for each object. Sequences of objects in the accurate embeddings is usually computationally intensive and training data (text segments for Word2Vec and graph paths requires processing large amounts of data. in Vertex2Vec) identify their relations, which are used during This paper presents a distributed training framework for a (semi-supervised) training to update the associated embedding class of applications that use Skip-gram-like models to generate embeddings. We call this class Any2Vec and it includes Word2Vec vectors. The relations between objects are irregular, which (Gensim), and Vertex2Vec (DeepWalk and Node2Vec) among leads to sparse accesses to the embedding matrix. Furthermore, others. We first formulate Any2Vec training algorithm as a graph the relations change with the training data, making accesses application. We then adapt the state-of-the-art distributed graph dynamic during the training process. Due to these dynamic analytics framework, D-Galois, to support dynamic graph gener- and sparse accesses, distributing the training of Any2Vec ation and re-partitioning, and incorporate novel communication optimizations. We show that on a cluster of 32 48-core hosts embeddings is challenging and existing distributed implemen- our framework GraphAny2Vec matches the accuracy of the tations [10]–[13] scale poorly due to the overheads of frequent state-of-the-art shared-memory implementations of Word2Vec communication and even then they do not match the accuracy and Vertex2Vec, and gives geo-mean speedups of 12× and 5× of the shared-memory implementations. respectively. Furthermore, GraphAny2Vec is on average 2× We first formulate the Any2Vec class of machine learning faster than DMTK, the state-of-the-art distributed Word2Vec implementation, on 32 hosts while yielding much better accuracy. algorithms that use Skip-gram-like models as a graph appli- Index Terms—Machine Learning, Graph Analytics, Dis- cation. The objects are vertices and the relations between tributed Computing them are edges. An edge operator updates the embedding vectors of the edge’s source and destination vertices. The I. INTRODUCTION algorithm applies these edge operators in a specific order or Embedding is a popular technique that associates a fixed- schedule (the schedule is critical for accuracy). However, these length vector to objects in a set such that the mathematical applications cannot be implemented in existing distributed closeness of any two vectors approximates the relationship of graph analytics frameworks [14]–[17] because it requires: the associated objects. Applications in areas such as natural (1) edge operators, (2) general (vertex-cut) graph partitioning language processing (NLP) [1]–[3], network analysis [4], [5], policies, and (3) dynamic graph re-partitioning. None of the code analysis [6], [7], and bioinformatics [8] rely on semanti- existing frameworks support all these required features. cally embedding objects into these vectors. Embeddings pro- This paper introduces a distributed training framework for vide a way to perform downstream tasks, such as identifying Any2Vec applications called GraphAny2Vec, which extends relations among words, by studying their associated vectors. D-Galois [17], a state-of-the-art distributed graph analytics Training these embeddings requires large amounts of data framework, to support dynamic graph generation and re- to be used and is computationally expensive, which makes partitioning. GraphAny2Vec leverages the communication op- distributed solutions especially important. timizations in D-Galois and adds novel application-specific An important class of embedding applications use Skip- communication optimizations. It also leverages the Adaptive gram-like models, like the one used in Word2Vec [1], to Summation technique [18] for preserving accuracy during syn- generate embeddings. We call this class Any2Vec and it chronization and exposes a runtime parameter to transparently trade-off synchronization rounds and accuracy. ...... window words We evaluate two applications, Word2Vec and Vertex2Vec, e e 1 1 the 0 t 0 t in GraphAny2Vec on a cluster of up to 32 48-core hosts 1 1 0 1 fox dog 1 with 3 different datasets each. We show that compared to the The quick brown fox jumps over the lazy dog. quick 0 ...... 1 1 0 jumps state-of-the-art shared-memory implementations (the original 0 cat 0 0 cat 0 C implementation [2] and Gensim [9] for Word2Vec and brown over words DeepWalk [4] for Vertex2Vec) on 1 host, GraphAny2Vec on ...... 32 hosts is on average 12× and 5× faster than Word2Vec Fig. 1. Viewing Word2Vec in Skip-gram model as a graph. The unique words and Vertex2Vec respectively, while matching their accuracy. in the training data corpus form vertices of the graph and edges represent GraphAny2Vec can reduce the training time for Word2Vec the positive (edge label: 1) and negative (edge label: 0) correlations among from 21 hours to less than 2 hours on our largest dataset of vertices. Labels on the vertices are: embedding (e) and training (t) vectors. Wikipedia articles while matching the accuracy of Gensim. On 32 hosts, GraphAny2Vec is on average 2× faster than the state- to be carefully set and decayed as the training progresses. of-the-art distributed Word2Vec implementation in Microsoft’s Training is complete when the model reaches a desired loss Distributed Machine Learning Toolkit (DMTK) [10]. or evaluation accuracy. An epoch of training is the number The rest of this paper is organized as follows. Section II of updates needed to go through the whole dataset once or provides a background on Any2Vec and graph analytics. equivalently, l iterations. For subsequent epochs, the dataset is Section III describes our GraphAny2Vec framework and Sec- shuffled again. For the rest of this paper, we will drop w tion IV presents our evaluation of GraphAny2Vec. Related i−1 from g (w ) as the model is known from the context. work and conclusions are presented in Sections V and VI. ri i−1 As it is evident from Equation 1, SGD’s update rule for II. BACKGROUND wi depends on wi−1 which makes it an inherently sequential In this section, we first briefly describe how stochastic algorithm. Luckily, there are several parallelization approaches gradient descent is used to train machine learning models for SGD algorithm each of which changes the semantics of (Section II-A), followed by how Any2Vec models are trained the algorithm. The most well-known technique is mini-batch with Word2Vec as an example (Section II-B). We then provide SGD [20], where the gradients for multiple training examples an overview of graph analytics (Section II-C). are computed from the same point and then averaged. Mini- batch algorithm is semantically different from SGD and there- A. Stochastic Gradient Descent fore, the convergence of the algorithm is negatively affected A machine learning model is usually expressed by a func- for the same number of gradients computed. Hogwild! [21] is tion M : IRn × IRk ! IRr or M(w; x) = y where w 2 IRn another well-known SGD parallelization technique, wherein is the weight of the model parameters, x 2 IRk is the input multiple threads compute gradients for different training ex- example and y 2 IRr is the prediction of the model. The amples in parallel in a lock-free way and update the model training task of a machine learning model is defined by a set in a racy fashion. Surprisingly, this approach works well k r of training examples (xi; yi) where xi 2 IR and yi 2 IR and on a shared-memory system, especially with models where a loss function L : IRr × IRr ! IR+ (IR+ is the set of non- gradients are sparse. The staleness of the models following negative reals) which assigns a penalty based on closeness of this parallelization technique hurts the accuracy with too M(w; xi), the model prediction, and yi, the true answer. In an many working threads. Adaptive Summation (AdaSum) [18] ideal world, M(w; xi) = yi and the loss function would be 0 is another technique similar to mini-batch algorithm which for all training examples i. Since (xi; yi) are constant during computes the gradients for multiple gradients in parallel but n + the training process, we will use function Li : IR ! IR instead of averaging, it uses the following formula to reduce g g defined by Li(w) = L(M(w; xi); yi) which is the loss for two arbitrary gradients, a and b, into one: training example i given model parameters w for convenience. T ga · gb Since perfect prediction is usually impossible, for a given set AdaSum(ga; gb) = ga + gb − 2 gb (2) P kgbk of training examples, a w with minimum i Li(w) is desired. Stochastic Gradient Descent (SGD) [19] is a popular algo- Using this reduction rule, AdaSum takes any number of rithm for machine learning training to find w with minimum gradients and produces one. Maleki et al. [18] show that total loss. Suppose that there are l training examples and AdaSum preserves the convergence of the sequential SGD fr1; : : : ; rlg is a random shuffle of indices from 1 to l.

Load more