Efficient and Programmable Distributed Shared Memory

August 16, 2018 DRAFT Efficient and Programmable Distributed Shared Memory Systems for Machine Learning Training Jinliang Wei August 16, 2018 School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Thesis Committee: Garth A. Gibson, Co-chair Eric P. Xing, Co-chair Phillip B. Gibbons Vijay Vasudevan, Google Brain Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Copyright © 2018 Jinliang Wei August 16, 2018 DRAFT Keywords: machine learning, distributed computing, distributed shared memory, static analysis, parallelization, scheduling, deep learning August 16, 2018 DRAFT Abstract Machine learning training involves frequent and often sparse updates to a large number of numerical values called model parameters. Many distributed training systems have resorted to using distributed shared memory (DSM) (e.g. Parameter Server) for efficient sparse access and in-place updates. Compared to traditional programs, machine learning applications tolerate bounded error, which presents op- portunities for trading off learning progress for higher computation throughput. In this thesis, I develop efficient and programmable distributed learning systems, by exploiting this trade-off in the design of distributed shared memory systems, along with parallelization and static and dynamic scheduling. Thanks to this tolerance to bounded error, a machine learning program can often be parallelized without strictly preserving data dependence. Parallel workers may thus observe inconsistent model parameter values compared to a serial execution. More frequent communication to propagate updates and fresher parameter values may reduce such inconsistency, while incurring higher communication overhead. I present a communication management mechanism that automates communication using spare network bandwidth and prioritizes messages according to their impor- tance in order to reduce error due to inconsistency while retaining high computation throughput. When each model update reads and writes to only a subset of model parameters, it is possible to achieve an efficient parallelization while preserving critical data dependence, exploiting sparse parameter access. Existing systems require substan- tial programmer effort to take advantage of this opportunity. In order to achieve dependence-preserving parallelization without imposing a huge burden on application programmers, I present a system Orion that provides parallel for-loops on distributed shared memory and parallelizes loops with loop-carried dependence. At last, I propose to explore dynamic scheduling for dynamic control flow in dataflow systems such as TensorFlow. In TensorFlow, the computation graph is statically partitioned and assigned with computation devices. Static device placement is suboptimal as the operators’ load can no longer be determined statically due to dynamic control flow. A suboptimal static device placement may result in imbal- anced load and extra communication. It is promising to address the deficiency of static device placement by dynamically scheduling operations based on their load at runtime. August 16, 2018 DRAFT Contents 1 Introduction 1 1.1 Performance Metrics for Distributed Training Systems.................2 1.2 Frequent Sparse Updates in Machine Learning Training................2 1.3 Distributed Shared Memory for Efficient Updates...................3 1.4 Scheduling for Preserving Data Dependence......................3 1.5 Programming Model Expressiveness Power.......................4 1.6 Thesis Statement and Contributions...........................4 1.7 Systems for Distributed Offline Learning........................5 1.7.1 Dataflow Systems..................................5 1.7.2 Distributed Shared Memory Systems......................6 1.7.3 Graph Processing Systems.............................7 2 Managed Communication for Data-Parallel Training [76]8 2.1 Data-Parallel Distributed Training and Synchronization................8 2.1.1 Data-Parallel Machine Learning Training....................8 2.1.2 Parameter Servers Synchronization Schemes..................9 2.2 Bösen Architecture and API................................ 10 2.2.1 Bösen Client API.................................. 11 2.2.2 Bounded Staleness Consistency.......................... 11 2.2.3 Client Library And Server Partitions....................... 12 2.3 Managed Communication in Bösen............................ 13 2.3.1 Bandwidth-Driven Communication....................... 13 2.3.2 Update Prioritization................................ 14 2.4 Evaluation.......................................... 14 2.4.1 Evaluation Setup.................................. 14 2.4.2 Evaluating Managed Communication...................... 16 2.4.3 Comparing with Manual Mini-batch Size Tuning............... 18 2.5 Conclusion.......................................... 19 3 Dependence-Aware Parallelization 20 3.1 Introduction......................................... 20 3.1.1 Machine Learning Training............................ 20 3.1.2 Data Parallelism.................................. 21 3.1.3 Dependence-aware Parallelization........................ 21 3.2 Motivation.......................................... 23 3.2.1 Batch Dataflow Systems.............................. 23 3.2.2 TensorFlow..................................... 24 iv August 16, 2018 DRAFT 3.2.3 Graph Processing Systems............................. 24 3.2.4 Parameter Server Systems............................. 25 3.2.5 Orion Design Summary.............................. 25 3.3 Orion’s Programming Model............................... 26 3.3.1 Scripting for Productivity With Small Performance Loss........... 26 3.3.2 Distributed Arrays................................. 28 3.3.3 Parallel For-Loop.................................. 28 3.3.4 Distributed Array Buffers............................. 29 3.3.5 Program Annotations............................... 30 3.3.6 Additional Features................................ 30 3.3.7 Putting Everything Together........................... 30 3.4 Static Parallelization And Code Generation....................... 30 3.4.1 Parallelization.................................... 32 3.4.2 Static Scheduling.................................. 33 3.4.3 Partitioning DistArrays for Random Access................... 34 3.4.4 Dealing with Skewed Data Distribution..................... 35 3.5 Preliminary Evaluation................................... 37 3.5.1 Ease of Use..................................... 37 3.5.2 Orion Abstraction Overhead........................... 39 3.5.3 Orion vs. Manual Data Parallelism........................ 39 3.5.4 Orion vs. Manual Data Parallelism w/ Managed Communication...... 41 3.5.5 Orion vs. Manual Model Parallelism....................... 43 3.5.6 Orion vs. Dataflow Systems............................ 44 3.6 Summary and Planned Case Studies........................... 44 4 Dynamic Scheduling for Dynamic Control Flow 46 4.1 Distributed Machine Learning w/ Dataflow Graph.................. 46 4.2 Conditional Computation for Outrageously Large Neural Networks........ 46 4.3 Preliminary Experimental Investigation......................... 48 4.3.1 Limited Scalability of MoE Networks...................... 49 4.3.2 Load Distributation among Expert Networks.................. 51 4.3.3 Throughput Drop.................................. 51 4.4 Dynamic Scheduling for Dynamic Control Flow in Dataflow Graphs........ 53 4.4.1 Incorporating Static and Dynamic Device Placement............. 53 4.4.2 Operator Scheduling Strategies.......................... 53 4.4.3 Operator Profiling................................. 54 4.4.4 Memory Swapping and Dynamic Batching................... 54 Appendices 55 A Orion Application Program Examples 56 A.1 Stochastic Gradient Descent Matrix Factorization.................... 56 A.2 Sparse Logistic Regression................................. 58 Bibliography 61 v August 16, 2018 DRAFT Chapter 1 Introduction The range of Machine Learning (ML) applications is fast growing. Examples include person- alized recommendations, visual and language understanding, game playing, and autonomous driving, just to name a few. At the core of ML applications is an expert-suggested model, whose parameters are determined by training the model to fit a set of observed data samples. Typi- cally training starts with a random guess of the model parameters and refines them to optimize an objective function by iteratively processing a training dataset. This iterative process stops when the quality of the model as measured by the objective function reaches certain criteria or stops improving, which is referred to as convergence. Fig 1.1 is a cartoon that demostrates the learning progress. Depending on the application, the objective function can be maixmized (e.g. maximizing log-likelihood in topic modeling) or minimized (e.g. minimizing loss in regression). In both cases, the objective function value approaches an asymptote. Due to the high complexity of training computation and iterative refinement nature and the large size of training datasets, training a model typically requires a large amount of computing power. The demand for computing power increases as more complex models are invented to serve more challenging real-world tasks. It is thus increasingly important to parallelize and dis- tribute the training computation to reduce training time. In distributed machine learning training, the learning algorithm is parallelized across a set of worker machines which all contribute to refining a model’s

Load more