Scalable Asynchronous Gradient Descent Optimization for Out-Of-Core Models

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qinz∗ Martin Torresy Florin Rusuy yUniversity of California Merced, zGraphSQL, Inc. fcqin3, mtorres58, [email protected] ABSTRACT Due to the explosive growth in data acquisition, the current trend Existing data analytics systems have approached predictive model is to devise prediction models with an ever-increasing number of training exclusively from a data-parallel perspective. Data exam- features, i.e., big models. For example, Google has reported models ples are partitioned to multiple workers and training is executed with billions of features for predicting ad click-through rates [27] concurrently over different partitions, under various synchroniza- as early as 2013. Feature vectors with 25 billion unigrams and tion policies that emphasize speedup or convergence. Since mod- 218 billion bigrams are constructed for text analysis of the English els with millions and even billions of features become increasingly Wikipedia corpus in [20]. Big models also appear in recommender systems. Spotify applies Low-rank Matrix Factorization (LMF) for common nowadays, model management becomes an equally im- 2 portant task for effective training. In this paper, we present a gen- 24 million users and 20 million songs , which leads to 4.4 billion eral framework for parallelizing stochastic optimization algorithms features at a relatively small rank of 100. over massive models that cannot fit in memory. We extend the lock- Since HOGWILD! is an in-memory algorithm, it cannot handle free HOGWILD!-family of algorithms to disk-resident models by these big models – models that go beyond the available memory vertically partitioning the model offline and asynchronously updat- of the system – directly. In truth, none of the analytics systems ing the resulting partitions online. Unlike HOGWILD!, concur- mentioned above support out-of-memory models because they rep- rent requests to the common model are minimized by a preemptive resent the model as a single shared variable—not as a partitioned push-based sharing mechanism that reduces the number of disk ac- dataset, which is the case for the training data. The only exception cesses. Experimental results on real and synthetic datasets show is the sequential dot-product join operator introduced in [32] which that the proposed framework achieves improved convergence over represents the model as a relational table. Parameter Server [22] is HOGWILD! and is the only solution scalable to massive models. an indirect approach that resorts to distributed shared memory. The big model is partitioned across several servers, with each server storing a sufficiently small model partition that fits in its local mem- 1. INTRODUCTION ory. In addition to the complexity incurred by model partitioning Data analytics is a major topic in contemporary data manage- and replication across servers, Parameter Server also has a high ment and machine learning. Many platforms, e.g., OptiML [38], cost in hardware and network traffic. While one can argue that GraphLab [26], SystemML [12], Vowpal Wabbit [1], SimSQL [2], memory will never be a problem in the cloud, this is not the case GLADE [4], Tupleware [7] and libraries, e.g., MADlib [15], Bis- in IoT settings. The edge and fog computing paradigms3 push pro- 1 marck [11], MLlib [37], Mahout , have been proposed to provide cessing to the devices acquiring the data which have rather scarce support for parallel statistical analytics. Stochastic gradient de- resources and do not consider data transfer a viable alternative—for scent is the most popular optimization method used to train ana- bandwidth and privacy reasons. Machine learning training in such lytics models across all these systems. It is implemented – in a an environment has to consider secondary storage, e.g., disk, SSD, form or another – by all of them. The seminal HOGWILD!-family and flash cards, for storing big models. of algorithms [28] for stochastic gradient descent has received – Problem. In this work, we investigate parallel stochastic opti- in particular – significant attention due to its near-linear speedups mization methods for big models that cannot fit in memory. Specif- across a variety of machine learning tasks, but, mostly, because of ically, we focus on designing a scalable HOGWILD! algorithm. its simplicity. Several studies have applied HOGWILD! to paral- Our setting is a single multi-core server with attached storage, i.e., lelize classical learning methods [33, 11, 25, 9, 8, 5] by performing disk(s). There is a worker thread associated with each core in model updates concurrently and asynchronously without locks. the system. The training data as well as the model are stored on ∗Work mostly done while a Ph.D. student at UC Merced. disk and moved into memory only when accessed. Training data 1https://mahout.apache.org are partitioned into chunks that are accessed and processed as a unit. Several chunks are processed concurrently by multiple worker threads—data-parallel processing. While access to the training data follows a well-behaved sequential pattern, the access to the model is unpredictable. Moreover, there are many model accesses for each This work is licensed under the Creative Commons Attribution- training example—the number of non-zero entries in the example. NonCommercial-NoDerivatives 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For 2 any use beyond those covered by this license, obtain permission by emailing http://slideshare.net/MrChrisJohnson/algorithmic-music- recommendations-at-spotify [email protected]. 3 Proceedings of the VLDB Endowment, Vol. 10, No. 10 https://techcrunch.com/2016/08/02/how-fog-computing-pushes-iot- Copyright 2017 VLDB Endowment 2150-8097/17/06. intelligence-to-the-edge/ 986 Thus, the challenge in handling big models is how to efficiently • Implement the entire framework using User-Defined Aggregates schedule access to the model. In the worst case, each model access (UDA) which provides generality across databases. requires a disk access. This condition is worsened in data-parallel • Evaluate the framework for three analytics tasks over synthetic processing by the fact that multiple model accesses are made con- and real datasets. The results prove the scalability, reduced over- currently by the worker threads—model-parallel processing. head incurred by model partitioning, and the consistent superior Approach. While extending HOGWILD! to disk-resident mod- performance of the framework over an optimized HOGWILD! els is straightforward, designing a truly scalable algorithm that sup- extension to big models. ports model and data-parallel processing is a considerably more challenging task. At a high-level, our approach targets the main Outline. Preliminaries on stochastic optimization and vertical source that impacts performance – the massive number of con- partitioning are presented in Section 2, while HOGWILD! is in- current model accesses – with two classical database processing troduced in Section 3. The high-level approach of the proposed techniques—vertical partitioning and model access sharing. framework is presented in Section 4, while the details are given in The model is vertically partitioned offline based on the concept Section 5 (offline stage) and Section 6 (online stage). Experimental of “feature occurrence” – we say a feature “occurs” when it has a results are included in Section 7, related work in Section 8, while non-zero value in a training example – such that features that co- concluding remarks and plans for future work are in Section 9. occur together require a single model access. Feature co-occurrence is a common characteristic of big models in many analytics tasks. 2. PRELIMINARIES 4 For example, textual features such as n-grams which extract a In this section, we give an overview of several topics relevant contiguous sequence of n words from text generate co-occurring to the management and processing of big models. Specifically, we features for commonly used sentences. Gene sequence patterns discuss gradient descent optimization as the state-of-the-art in big represent an even more widespread example in this category. It is model training, key-value stores as the standard big model storage important to notice that feature co-occurrence is fundamentally dif- manager, and vertical partitioning. ferent from the feature correlation that standard feature engineering Big model training. Consider the following optimization prob- processes [24, 41, 6, 17] try to eliminate. In feature engineering, lem with a linearly separable objective function: correlation between features is measured by coefficients such as N Pearson’s coefficient [13] instead of co-occurrence. In this work, X Λ( ~w) = min d f ( ~w; ~x ; y ) (1) we are interested exclusively in what features co-appear together. w2R i i Thus, we refer to “feature co-occurrence” as “correlation”. i=1 During online training, access sharing is maximized at several in which a d-dimensional model ~w has to be found such that the stages in the processing hierarchy in order to reduce the number objective function is minimized. The constants ~xi and yi, 1 ≤ i ≤ of disk-level model accesses. The data examples inside a chunk N, correspond to the feature vector of the ith data example and its are logically partitioned vertically according to the model parti- scalar label, respectively. tions generated offline. The goal of this stage is to cluster together Gradient descent represents the most popular method to solve accesses to model features even across examples—vertical parti- the class of optimization problems given in Eq. (1). Gradient de- tioning achieves this only for the features that co-occur in the same scent is an iterative optimization algorithm that starts from an ar- example. In order to guarantee that access sharing occurs across bitrary model ~w(0) and computes new models ~w(k+1), such that partitions, we introduce a novel push-based mechanism to enforce the objective function, a.k.a., the loss, decreases at every step, i.e., sharing by vertical traversals of the example data and partial dot- Λ(w(k+1)) < Λ(w(k)).

Load more