Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018 Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment ERMIAS GEBREMESKEL KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment ERMIAS GEBREMESKEL Master in Computer Science Date: June 26, 2018 Supervisor: Håkan Lane, Jim Dowling, & Robin Andersson Examiner: Örjan Ekeberg Swedish title: Analys och jämförelse av distribuerade träningstekniker för djupa neurala nätverk i en dynamisk miljö School of Computer Science and Communication iii Abstract Deep learning models’ prediction accuracy tends to improve with the size of the model. The implications being that the amount of computational power needed to train models is continuously increasing. Dis- tributed deep learning training tries to address this issue by spreading the computational load onto several devices. In theory, distributing computation onto N devices should give a performance improvement of xN. Yet, in reality the performance improvement is rarely xN, due to communication and other overheads. This thesis will study the communication overhead incurred when distributing deep learning training. Hopsworks is a platform designed for data science. The purpose of this work is to explore a feasible way of deploying distributed deep learning training on a shared cluster and analyzing the performance of different distributed deep learning algorithms to be used on this platform. The findings of this study show that bandwidth-optimal communication algorithms like ring all-reduce scales better than many-to-one communication algorithms like parameter server, but were less fault tolerant. Furthermore, system usage statistics collected revealed a network bottleneck when training is distributed on multiple machines. This work also shows that it is possible to run MPI on a hadoop cluster by building a prototype that orchestrates resource allocation, deployment, and monitoring of MPI based training jobs. Even though the experiments did not cover different cluster configurations, the results are still relevant in showing what considerations need to be made when distributing deep learning training. Keywords: deep learning, large scale distributed deep learning, data parallelism. iv Sammanfattning Träffsäkerheten hos djupinlärningsmodeller tenderar att förbättras i relation med storleken på modellen. Implikationen blir att mängden beräkningskraft som krävs för att träna modeller ökar kontinuerligt. Distribuerad djupinlärning försöker lösa detta problem genom att distribuera beräkningsbelastning på flera enheter. Att distribuera beräk- ningarna på N enheter skulle i teorin innebär en linjär skalbarhet (xN). I verkligenheten stämmer sällan detta på grund av overhead från nät- verkskommunikation eller I/O. Hopsworks är en dataanalys och maskininlärningsplattform. Syftet med detta arbeta är att utforska ett möjligt sätt att utföra distribuerad djupinlärningträning på ett delat datorkluster, samt analysera prestan- dan hos olika algoritmer för distribuerad djupinlärning att använda i plattformen. Resultaten i denna studie visar att nätverksoptimala algoritmer såsom ring all-reduce skalar bättre för distribuerad djupinlärning än många-till-en kommunikationsalgoritmer såsom parameter server, men är inte lika feltoleranta. Insamlad data från experimenten visade på en flaskhals i nätverket vid träning på flera maskiner. Detta arbete visar även att det är möjligt att exekvera MPI program på ett hadoopkluster genom att bygga en prototyp som orkestrerar resursallokering, dis- tribution och övervakning av exekvering. Trots att experimenten inte täcker olika klusterkonfigurationer så visar resultaten på vilka faktorer som bör tas hänsyn till vid distribuerad träning av djupinlärningsmo- deller. Contents 1 Introduction 1 1.1 Research Question . .2 1.2 Scope . .3 1.3 Sustainability and Relevance . .3 2 Background 5 2.1 Training Neural Networks . .5 2.1.1 Stochastic Gradient Descent (SGD) . .6 2.2 Distributed Training in Deep Learning . .7 2.2.1 Distributing SGD using MapReduce . .8 2.3 Algorithms for Collective Communication . .8 2.3.1 Message Passing Interface (MPI) . .9 2.4 Data Parallelism . 10 2.4.1 Synchronous SGD . 10 2.4.2 Asynchronous SGD . 11 2.4.3 Parameter Server . 11 2.4.4 Ring All-Reduce . 12 2.5 Model Parallelism . 13 2.5.1 Partitioning Neural Network Model Graphs . 14 2.5.2 Device Placement Optimization . 15 2.6 Hybrid Data and Model Parallelism . 15 2.7 TensorFlow . 15 2.7.1 Distributed TensorFlow . 16 2.8 Resource Management in Hops-YARN . 16 2.9 Spark on YARN . 17 2.9.1 TensorFlow On Spark . 18 2.10 Evaluation . 18 2.11 Related Work . 19 v vi CONTENTS 3 Method 20 3.1 Datasets and Models . 20 3.2 Cluster Setups . 21 3.2.1 Hardware Specification . 21 3.2.2 Distributed Deep Learning and Big data Frame- works . 22 3.3 Experiment Design . 23 3.3.1 Batch Size . 24 3.3.2 Number of Workers . 24 3.3.3 System Monitoring . 24 3.4 Model Deployment . 24 3.4.1 Parameter Server with TensorFlowOnSpark . 25 3.4.2 Ring All-Reduce with Horovod . 26 3.4.3 Evaluation of Model Deployment . 30 3.5 Data Collection . 31 4 Results 33 4.1 Parameter Server . 33 4.1.1 Synchronous Update . 33 4.1.2 Asynchronous Update . 34 4.1.3 Multi-Node Synchronous Update . 34 4.2 Ring All-Reduce . 34 4.3 Scalability . 36 4.4 Resource Utilization . 38 4.5 Model Deployment . 38 5 Discussion and Conclusion 43 5.1 Scalability . 43 5.1.1 Number of Parameter Servers . 44 5.1.2 Asynchronous Update . 45 5.1.3 Ring All-Reduce . 45 5.2 Fault Tolerance . 47 5.3 Resource Utilization . 47 5.4 Model Deployment . 48 5.5 Possible Sources of Error . 48 5.6 Conclusion . 48 5.7 Future Work . 49 Bibliography 51 CONTENTS vii A Complete Results 58 List of Figures 2.1 Parameter server model . 12 2.2 Model parallelism . 14 3.1 Cluster setup . 23 3.2 Starting and monitoring MPI . 31 4.1 Processed images/sec using parameter server sync update mode . 35 4.2 Processed images/sec using parameter server async update mode . 36 4.3 System usage parameter server multi-node . 37 4.4 Processed images/sec using ring all-reduce . 38 4.5 Speedup with parameter server . 39 4.6 Speedup with ring all-reduce . 40 4.7 Ring all-reduce and parameter server system usage . 41 4.8 One and two parameter servers system usage . 42 5.1 InfiniBand usage in ring all-reduce . 46 viii List of Tables 3.1 Models used in experiments . 21 3.2 Experiment setup . 32 A.1 Parameter server sync update results . 58 A.2 Two Parameter servers sync update results . 59 A.3 Parameter servers async update results . 59 A.4 Ring all-reduce results . 60 ix List of Listings 1 Sample cluster specification . 26 2 Sample mpirun hostfile . 26 3 Sample mpirun script . 27 4 System usage commands . 27 x Chapter 1 Introduction Statistical machine learning initially operated on handcrafted features extracted from datasets by human experts. The handcrafted features were then used to train a model using methods like maximum likeli- hood, Support Vector Machines, k-Nearest Neighbors, k-means, deci- sion trees, and regression algorithms. This approach works for small tasks where the informative features of a dataset are easy to identify. Unfortunately, identifying informative features from real-world prob- lems like computer vision, speech recognition, and natural language processing is extremely difficult. Deep neural networks attempt to solve this problem by including feature extraction in the learning process. This learning of hierarchi- cal features from raw data with no task-specific prior knowledge is much harder than training prediction models, therefore requiring sig- nificantly more training data and computational resources. The recent success in deep learning (DL) is thus largely attributed to the advances in computing capability and availability of large amounts of labeled data [18, 26, 27]. Even when done on GPUs, training deep learning models on large datasets can take excessively long time if done on a single machine. Furthermore, training on a single machine with lim- ited resources will restrict the size of the model that can be trained. Distributed deep learning tries to address these limitations by de- composing a machine learning problem onto multiple machines and devices. The decomposition can be done across two dimensions: (1) data dimension and (2) model dimension. However, scaling up is not merely adding computational resources. The main consideration to make when distributing computation onto multiple machines is com- 1 2 CHAPTER 1. INTRODUCTION munication cost. In particular the ratio of computation to data management housekeeping. Distributed deep learning training is only effi- cient if the parameters being communicated are also computationally expensive. Amdahl [4] speculates that in a parallel program if data management overhead average around 10% of the operation it will force at least 25% of the computation to be sequential. In this paper, the performance and scalability of state-of-the-art distributed training algorithms will be empirically analyzed and their performance compared in a shared cluster environment with container- based resource management. The rest of this paper is organized as follows. Chapter 2 gives the necessary background to follow this paper and some related works, Chapter 3 presents the research design and performed experiments, and Chapter 4 reports the findings of these experiments. Finally, Chap- ter 5 discusses the results and presents conclusions based on the analysis. 1.1 Research Question Accuracy of deep learning models improve with the amount of data used for training and the size of the model [17, 11, 10, 28], requiring more computational resources. Researchers and professionals that have access to GPU clusters typ- ically use them for: (1) running parallel experiments (for example, to establish good hyper-parameters, learning rate, number of layers, choice of model architecture, etc) [6] and (2) expensive training jobs, distributed over many GPUs on many servers [23].

Load more