Scaling Distributed Machine Learning with System and Algorithm Co-Design

January 4, 2017 DRAFT Scaling Distributed Machine Learning with System and Algorithm Co-design Mu Li February 2017 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: David Andersen, co-chair Jeffrey Dean (Google) Barnabas Poczos Ruslan Salakhutdinov Alexander Smola, co-chair Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright c 2017 Mu Li January 4, 2017 DRAFT Keywords: Large Scale Machine Learning, Distributed System, Parameter Server, Dis- tributed Optimization Method January 4, 2017 DRAFT Dedicated to my lovely wife, QQ. January 4, 2017 DRAFT iv January 4, 2017 DRAFT Abstract Due to the rapid growth of data and the ever increasing model complexity, which often manifests itself in the large number of model parameters, today, many impor- tant machine learning problems cannot be efficiently solved by a single machine. Distributed optimization and inference is becoming more and more inevitable for solving large scale machine learning problems in both academia and industry. How- ever, obtaining an efficient distributed implementation of an algorithm, is far from trivial. Both intensive computational workloads and the volume of data communication demand careful design of distributed computation systems and distributed machine learning algorithms. In this thesis, we focus on the co-design of distributed computing systems and distributed optimization algorithms that are specialized for large machine learning problems. In the first part, we propose two distributed computing frameworks: Parameter Server, a distributed machine learning framework that features efficient data communication between the machines; MXNet, a multi-language library that aims to simplify the development of deep neural network algorithms. We have witnessed the wide adoption of the two proposed systems in the past two years. They have enabled and will continue to enable more people to harness the power of distributed computing to design efficient large-scale machine learning applications. In the second part, we examine a number of distributed optimization problems in machine learning, leveraging the two computing platforms. We present new methods to accelerate the training process, such as data partitioning with better locality properties, communication friendly optimization methods, and more compact statis- tical models. We implement the new algorithms on the two systems and test on large scale real data sets. We successfully demonstrate that careful co-design of computing systems and learning algorithms can greatly accelerate large scale distributed machine learning. January 4, 2017 DRAFT vi January 4, 2017 DRAFT Acknowledgments January 4, 2017 DRAFT viii January 4, 2017 DRAFT Contents 1 Introduction 1 1.1 Background . .2 1.1.1 Large Scale Models . .2 1.1.2 Distributed Computing . .4 1.1.3 Optimization Methods . .7 1.2 Thesis Statement . .9 1.3 Thesis Contributions . .9 1.4 Notations, Datasets and Computing Systems . 12 1.4.1 Notations . 12 1.4.2 Datasets . 12 1.4.3 Computing systems . 14 I System 15 2 Preliminaries on Distributed Computing Systems 17 2.1 Heterogeneous Computing . 17 2.2 Data center . 19 3 Parameter Server: Scaling Distributed Machine Learning 23 3.1 Introduction . 23 3.1.1 Engineering Challenges . 23 3.1.2 Our contribution . 25 3.1.3 Related Work . 25 3.2 Architecture . 26 3.2.1 (Key,Value) Vectors . 29 3.2.2 Range-based Push and Pull . 30 3.2.3 User-Defined Functions on the Server . 30 3.2.4 Asynchronous Tasks and Dependency . 30 3.2.5 Flexible Consistency . 31 3.2.6 User-defined Filters . 32 3.3 Implementation . 32 3.3.1 Vector Clock . 32 3.3.2 Messages . 33 ix January 4, 2017 DRAFT 3.3.3 Consistent Hashing . 34 3.3.4 Replication and Consistency . 34 3.3.5 Server Management . 35 3.3.6 Worker Management . 36 3.4 Evaluation . 36 3.4.1 Sparse Logistic Regression . 37 3.4.2 Latent Dirichlet Allocation . 39 3.4.3 Sketches . 41 4 MXNet: a Flexible and Efficient Deep Learning Library 43 4.1 Introduction . 43 4.1.1 Background . 43 4.1.2 Our contribution . 44 4.2 Front-End Programming Interface . 45 4.3 Back-End System . 47 4.3.1 Computation Graph . 48 4.3.2 Graph Transformation and Execution . 48 4.4 Data Communication . 50 4.4.1 Distributed Key-Value Store . 50 4.4.2 Implementation of KVStore . 50 4.5 Evaluation . 52 4.5.1 Multiple GPUs on a Single Machine . 53 4.5.2 Multiple GPUs on Multiple Machines . 54 4.5.3 Convergence . 55 4.6 Discussions . 56 II Algorithm 59 5 Preliminaries on Optimization Methods for Machine Learning 61 5.1 Optimization Methods . 61 5.2 Convergence Analysis . 63 5.3 Distributed Optimization . 64 5.3.1 Data Parallelism versus Model Parallelism . 64 5.3.2 Synchronous Update versus Asynchronous Update . 64 6 DBPG: Delayed Block Proximal Gradient Method 67 6.1 Introduction . 67 6.2 Delayed Block Proximal Gradient Method . 69 6.2.1 Proposed Algorithm . 69 6.2.2 Convergence Analysis . 69 6.3 Experiments . 70 6.3.1 Sparse Logistic Regression . 70 6.3.2 Reconstruction ICA . 74 x January 4, 2017 DRAFT 6.4 Proof of Theorem 2 . 76 7 EMSO: Efficient Minibatch Training for Stochastic Optimization 81 7.1 Introduction . 81 7.1.1 Problem formulation . 81 7.1.2 Minibatch Stochastic Gradient Descent . 81 7.1.3 Related Work and Discussion . 82 7.1.4 Our work . 83 7.2 Efficient Minibatch Training Algorithm . 83 7.2.1 Our algorithm . 83 7.2.2 Convergence Analysis . 84 7.2.3 Efficient Implementation . 86 7.3 Experiments . 88 7.4 Proof of Theorem 7 . 93 8 AdaDelay: Delay Adaptive Stochastic Optimization 99 8.1 Introduction . 99 8.2 AdaDelay Algorithm . 101 8.2.1 Model Assumptions . 101 8.2.2 Algorithm . 102 8.2.3 Convergence Analysis . 102 8.3 Experiments . 103 8.3.1 Setup . 103 8.3.2 Results . 105 9 Parsa: Data Partition via Submodular Approximation 111 9.1 Introduction . 111 9.2 Problem Formulation . 113 9.3 Algorithm . 115 9.3.1 Partition the data vertex set U ....................... 116 9.3.2 Partition the parameter vertex set V .................... 117 9.4 Efficient Implementation . 118 9.4.1 Find Solution to (9.6) . 118 9.4.2 Divide into Subgraphs . 121 9.4.3 Parallelization with Parameter Server . 122 9.4.4 Initialize the Neighbor Sets . 122 9.5 Experiments . 123 9.5.1 Setup . 123 9.5.2 Performance of Parsa . ..

Load more