DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2018

Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment

ERMIAS GEBREMESKEL

KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Analysis and Comparison of Distributed Training Techniques for Deep Neural Networks in a Dynamic Environment

ERMIAS GEBREMESKEL

Master in Computer Science Date: June 26, 2018 Supervisor: Håkan Lane, Jim Dowling, & Robin Andersson Examiner: Örjan Ekeberg Swedish title: Analys och jämförelse av distribuerade träningstekniker för djupa neurala nätverk i en dynamisk miljö School of Computer Science and Communication

iii

Abstract

Deep learning models’ prediction accuracy tends to improve with the size of the model. The implications being that the amount of computa- tional power needed to train models is continuously increasing. Dis- tributed training tries to address this issue by spreading the computational load onto several devices. In theory, distributing computation onto N devices should give a performance improvement of xN. Yet, in reality the performance improvement is rarely xN, due to communication and other overheads. This thesis will study the com- munication overhead incurred when distributing deep learning train- ing. Hopsworks is a platform designed for data science. The purpose of this work is to explore a feasible way of deploying distributed deep learning training on a shared cluster and analyzing the performance of different distributed deep learning algorithms to be used on this platform. The findings of this study show that bandwidth-optimal commu- nication algorithms like ring all-reduce scales better than many-to-one communication algorithms like parameter server, but were less fault tolerant. Furthermore, system usage statistics collected revealed a net- work bottleneck when training is distributed on multiple machines. This work also shows that it is possible to run MPI on a hadoop clus- ter by building a prototype that orchestrates resource allocation, de- ployment, and monitoring of MPI based training jobs. Even though the experiments did not cover different cluster configurations, the re- sults are still relevant in showing what considerations need to be made when distributing deep learning training.

Keywords: deep learning, large scale distributed deep learning, data parallelism. iv

Sammanfattning

Träffsäkerheten hos djupinlärningsmodeller tenderar att förbättras i relation med storleken på modellen. Implikationen blir att mängden beräkningskraft som krävs för att träna modeller ökar kontinuerligt. Distribuerad djupinlärning försöker lösa detta problem genom att dis- tribuera beräkningsbelastning på flera enheter. Att distribuera beräk- ningarna på N enheter skulle i teorin innebär en linjär skalbarhet (xN). I verkligenheten stämmer sällan detta på grund av overhead från nät- verkskommunikation eller I/O. Hopsworks är en dataanalys och maskininlärningsplattform. Syftet med detta arbeta är att utforska ett möjligt sätt att utföra distribuerad djupinlärningträning på ett delat datorkluster, samt analysera prestan- dan hos olika algoritmer för distribuerad djupinlärning att använda i plattformen. Resultaten i denna studie visar att nätverksoptimala algoritmer såsom ring all-reduce skalar bättre för distribuerad djupinlärning än många-till-en kommunikationsalgoritmer såsom parameter server, men är inte lika feltoleranta. Insamlad data från experimenten visade på en flaskhals i nätverket vid träning på flera maskiner. Detta arbete visar även att det är möjligt att exekvera MPI program på ett hadoopkluster genom att bygga en prototyp som orkestrerar resursallokering, dis- tribution och övervakning av exekvering. Trots att experimenten inte täcker olika klusterkonfigurationer så visar resultaten på vilka faktorer som bör tas hänsyn till vid distribuerad träning av djupinlärningsmo- deller. Contents

1 Introduction 1 1.1 Research Question ...... 2 1.2 Scope ...... 3 1.3 Sustainability and Relevance ...... 3

2 Background 5 2.1 Training Neural Networks ...... 5 2.1.1 Stochastic (SGD) ...... 6 2.2 Distributed Training in Deep Learning ...... 7 2.2.1 Distributing SGD using MapReduce ...... 8 2.3 Algorithms for Collective Communication ...... 8 2.3.1 Message Passing Interface (MPI) ...... 9 2.4 Data Parallelism ...... 10 2.4.1 Synchronous SGD ...... 10 2.4.2 Asynchronous SGD ...... 11 2.4.3 Parameter Server ...... 11 2.4.4 Ring All-Reduce ...... 12 2.5 Model Parallelism ...... 13 2.5.1 Partitioning Neural Network Model Graphs . . . 14 2.5.2 Device Placement Optimization ...... 15 2.6 Hybrid Data and Model Parallelism ...... 15 2.7 TensorFlow ...... 15 2.7.1 Distributed TensorFlow ...... 16 2.8 Resource Management in Hops-YARN ...... 16 2.9 Spark on YARN ...... 17 2.9.1 TensorFlow On Spark ...... 18 2.10 Evaluation ...... 18 2.11 Related Work ...... 19

v vi CONTENTS

3 Method 20 3.1 Datasets and Models ...... 20 3.2 Cluster Setups ...... 21 3.2.1 Hardware Specification ...... 21 3.2.2 Distributed Deep Learning and Big data Frame- works ...... 22 3.3 Experiment Design ...... 23 3.3.1 Batch Size ...... 24 3.3.2 Number of Workers ...... 24 3.3.3 System Monitoring ...... 24 3.4 Model Deployment ...... 24 3.4.1 Parameter Server with TensorFlowOnSpark . . . 25 3.4.2 Ring All-Reduce with Horovod ...... 26 3.4.3 Evaluation of Model Deployment ...... 30 3.5 Data Collection ...... 31

4 Results 33 4.1 Parameter Server ...... 33 4.1.1 Synchronous Update ...... 33 4.1.2 Asynchronous Update ...... 34 4.1.3 Multi-Node Synchronous Update ...... 34 4.2 Ring All-Reduce ...... 34 4.3 Scalability ...... 36 4.4 Resource Utilization ...... 38 4.5 Model Deployment ...... 38

5 Discussion and Conclusion 43 5.1 Scalability ...... 43 5.1.1 Number of Parameter Servers ...... 44 5.1.2 Asynchronous Update ...... 45 5.1.3 Ring All-Reduce ...... 45 5.2 Fault Tolerance ...... 47 5.3 Resource Utilization ...... 47 5.4 Model Deployment ...... 48 5.5 Possible Sources of Error ...... 48 5.6 Conclusion ...... 48 5.7 Future Work ...... 49

Bibliography 51 CONTENTS vii

A Complete Results 58 List of Figures

2.1 Parameter server model ...... 12 2.2 Model parallelism ...... 14

3.1 Cluster setup ...... 23 3.2 Starting and monitoring MPI ...... 31

4.1 Processed images/sec using parameter server sync up- date mode ...... 35 4.2 Processed images/sec using parameter server async up- date mode ...... 36 4.3 System usage parameter server multi-node ...... 37 4.4 Processed images/sec using ring all-reduce ...... 38 4.5 Speedup with parameter server ...... 39 4.6 Speedup with ring all-reduce ...... 40 4.7 Ring all-reduce and parameter server system usage . . . 41 4.8 One and two parameter servers system usage ...... 42

5.1 InfiniBand usage in ring all-reduce ...... 46

viii List of Tables

3.1 Models used in experiments ...... 21 3.2 Experiment setup ...... 32

A.1 Parameter server sync update results ...... 58 A.2 Two Parameter servers sync update results ...... 59 A.3 Parameter servers async update results ...... 59 A.4 Ring all-reduce results ...... 60

ix List of Listings

1 Sample cluster specification ...... 26 2 Sample mpirun hostfile ...... 26 3 Sample mpirun script ...... 27 4 System usage commands ...... 27

x Chapter 1

Introduction

Statistical initially operated on handcrafted features extracted from datasets by human experts. The handcrafted features were then used to train a model using methods like maximum likeli- hood, Support Vector Machines, k-Nearest Neighbors, k-means, deci- sion trees, and regression algorithms. This approach works for small tasks where the informative features of a dataset are easy to identify. Unfortunately, identifying informative features from real-world prob- lems like , , and natural language processing is extremely difficult. Deep neural networks attempt to solve this problem by including feature extraction in the learning process. This learning of hierarchi- cal features from raw data with no task-specific prior knowledge is much harder than training prediction models, therefore requiring sig- nificantly more training data and computational resources. The recent success in deep learning (DL) is thus largely attributed to the advances in computing capability and availability of large amounts of labeled data [18, 26, 27]. Even when done on GPUs, training deep learning models on large datasets can take excessively long time if done on a single machine. Furthermore, training on a single machine with lim- ited resources will restrict the size of the model that can be trained. Distributed deep learning tries to address these limitations by de- composing a machine learning problem onto multiple machines and devices. The decomposition can be done across two dimensions: (1) data dimension and (2) model dimension. However, scaling up is not merely adding computational resources. The main consideration to make when distributing computation onto multiple machines is com-

1 2 CHAPTER 1. INTRODUCTION

munication cost. In particular the ratio of computation to data man- agement housekeeping. Distributed deep learning training is only effi- cient if the parameters being communicated are also computationally expensive. Amdahl [4] speculates that in a parallel program if data management overhead average around 10% of the operation it will force at least 25% of the computation to be sequential. In this paper, the performance and scalability of state-of-the-art distributed training algorithms will be empirically analyzed and their performance compared in a shared cluster environment with container- based resource management. The rest of this paper is organized as follows. Chapter 2 gives the necessary background to follow this paper and some related works, Chapter 3 presents the research design and performed experiments, and Chapter 4 reports the findings of these experiments. Finally, Chap- ter 5 discusses the results and presents conclusions based on the anal- ysis.

1.1 Research Question

Accuracy of deep learning models improve with the amount of data used for training and the size of the model [17, 11, 10, 28], requiring more computational resources. Researchers and professionals that have access to GPU clusters typ- ically use them for: (1) running parallel experiments (for example, to establish good hyper-parameters, learning rate, number of layers, choice of model architecture, etc) [6] and (2) expensive training jobs, distributed over many GPUs on many servers [23]. Available deep learning frameworks like TensorFlow [1], [24], [9], [12], and others use different approaches to scale out and speed up training through distribution. Some, like Tensorflow, Caffe2, and Torch, have native support for distributed training, while others, like Caffe(1.0), relay on Apache Spark [47], a cluster computing framework that uses a similar programming model to MapReduce [14] to handle distribution. However, choosing the right distributed train- ing system architecture for a given model is far from trivial. When training deep neural networks in a distributed manner, one need to know:

1. If the performance is affected by the number of parameter servers? CHAPTER 1. INTRODUCTION 3

2. If updating parameters in an asynchronous manner improve per- formance? 3. How the performance scales with each added worker when us- ing ring all-reduce compared to parameter server? 4. How easy it is to deploy a model, with the chosen distributed training system architecture, on a cluster with shared resources? In this work, these questions will be analyzed for two distributed train- ing system architectures, to identify bottlenecks and limitations in the methods. This paper will also explore mechanisms for deploying dis- tributed deep learning training on a cluster with shared resources.

1.2 Scope

Running parallel experiments to establish good hyper-parameters and/or model architecture is a less interesting problem to analyze be- cause: (1) there is no communication between processes running in parallel and (2) the amount of training data needed is significantly less than the actual model training. Thus, the analysis in this paper will only be concerned with expensive training jobs that can benefit from distributed training. Focusing mainly on the communication overhead incurred when training is distributed. When distributing deep learn- ing training some loss of accuracy might occur or more training may be needed to achieve the same level of accuracy as in a non-distributed training. In this work the loss of accuracy that might occur when dis- tributing training is not investigated. Moreover, the performance of distributing deep learning training can be affected by other factors, like disk I/O that is ignored in this work. Although deep learning can refer to any neural network with a large number of layers and parameters, here we will only be concerned with models trained using some variant of Stochastic Gradient De- scent (SGD) to limit the scope but the analysis should generalize to most deep learning models.

1.3 Sustainability and Relevance

Distributing training requires a lot of computational resources. Thus, economic and environmental costs of using these resources need to be 4 CHAPTER 1. INTRODUCTION

considered. This work will help in identifying the cost of distribution and if this cost is justified by the gain. This is done by showing how much speedup is achieved with each added computational resource. This study will also help in identifying bottlenecks in system setups that can hamper the performance gain that can be obtained by distri- bution. The company Logical Clocks AB is a startup with the product Hado- op Open Platform-as-a-Service (Hops), a distribution of Apache Hado- op [46] that provides a web-based application to manage datasets and interactively analyze them. The Hops web-based application (Hopsworks) also provides CPU, GPU, and storage as managed re- sources on a cluster. This work is interesting to Logical Clocks be- cause distributed deep learning is one of the services on Hopsworks and need to identify any limitations or bottlenecks in the cluster cur- rently hosting the platform. This work is also relevant for anyone that wants to distribute deep learning training, by showing the limitations and the considerations that need to be made when choosing distributed training architecture. Chapter 2

Background

Artificial Neural Networks (ANN) are computational models (algo- rithms or actual hardware) that are modeled after a highly simplified mammalian cerebral cortex neuronal structure. The basic computation unit of a neural network is the , a simplified neuron model capable of finding decision surface for linearly separable patterns. By layering two or more of these in a feed-forward manner neural networks can approximate arbitrary functions [22]. These mul- tilayered perceptrons are usually trained using , first introduced by Rumelhart et al. [36]. Backpropagation works faster than previously known approaches to learning, making deeper networks feasible. Deeper networks increase the ANN’s memorization capacity making learn data representations possible, which is the basis for deep learning. Deep learning, also modeled after a highly simplified neocortical neuronal structure, attempt to mimic neural coding. The success of deep learning models such as deep neural networks (DNN), deep be- lief networks (DBN) and recurrent neural networks (), depends on their ability to learn data representation. Training these models is mainly done through backpropagation using gradient descent and re- quires significantly more training data and computational resources than shallow networks.

2.1 Training Neural Networks

Training neural networks in a supervised manner is done through the Backpropagation algorithm comprising two phases: the forward pass

5 6 CHAPTER 2. BACKGROUND

which maps input to output and backward pass which takes the out- put calculate the error by comparing with the desired output and back- propagate it all the way to the input node. The weights associated with individual links in the network are updated proportional to the portion of the error propagated back. The weight updates can be done either in batches (where all the training samples pass through the algorithm before weights are up- dated) or after each training sample, presented in random order. Stochas- tic Gradient Descent (or some flavors of it, like AdaGrad, RMSProp, or Adam) is the most commonly used update rules (optimizers) for training deep neural networks [10, 7]. This process is repeated for all training samples multiple times (epochs) until convergence, i.e. when the error drops below a given value or when validation error starts to go up. Most of the operations in Backpropagation algorithm can be per- formed as matrix-vector operation, which is highly parallel and nu- merically intensive. Thus can be offload to the many-cores of a GPU, that can perform floating-point arithmetic at a much higher rate than CPUs.

2.1.1 Stochastic Gradient Descent (SGD) The batch gradient descent algorithm starts with a randomly initial- ized parameter θ and repeatedly updates it with the gradient of the objective function J(θ) : ∂ θj := θj − α J(θ). (for j = 0, ..., n) (2.1) ∂θj where α is the learning rate. For least mean squares (LMS) cost func- tion: 1 2 J(θ) = (hθ(x) − y) (2.2) 2 ∂ J(θ) = (hθ(x) − y)xj (2.3) ∂θj For m training examples the update rule can be written as:

m 1 X i i i θj := θj + α (y − hθ(x ))x (for all j) (2.4) m j i=0 CHAPTER 2. BACKGROUND 7

Therefore, the computational cost of gradient descent scales lin- early with the training dataset size (m in 2.4). Stochastic Gradient De- scent approximates gradient descent by randomly sampling a single dataset at uniform and performing the update. For i ∈ (0, ..., m) the update rule is given by:

i i i θj := θj + α(y − hθ(x ))xj (for all j) (2.5)

Mini-batch gradient descent is another update rule that uses b ran- dom examples to perform the parameter update. For mini-batch gra- dient descent equation 2.5 can be rewritten as:

i+b 1 X k k k θj := θj + α (y − hθ(x ))x (for all j) (2.6) b j k=i for i := 0, ..., (m − b) and 1 < b < m. When b = 1 this algorithm is identical to SGD (equation 2.5) and when b = m it is batch gradient descent (equation 2.4).

2.2 Distributed Training in Deep Learning

Deep Learning models need to be trained on big datasets. Even when training is done on GPUs it can take days or even weeks, if done on a single machine. There are two dimensions that deep learning model training can be parallelized over: data parallelism (across data di- mension), where each machine contains a complete model replica but process only part of the data samples and model parallelism (across model dimension), where different parts of a model run on different machines in parallel. In both schemes some kind of synchronization between workers is required: in model parallelism neuron activities need to be communicated, while in data parallelism model parame- ters (weights and biases) are communicated to ensure all models are trained evenly [25]. The performance of these parallelization schemes is highly depen- dent on model architecture. Notably, parallelism is only efficient if the parameters (units) being communicated are also computationally ex- pensive. Therefore, when per weight computation is high data paral- lelism is more efficient, while model parallelism is more efficient when the neuron activity is computationally expensive [25]. 8 CHAPTER 2. BACKGROUND

To make full use of distributed training both dimensions of paral- lelism should be exploited to a degree taking into account the commu- nication architecture of the model. Other factors like weight update schemes, batch size, and communication algorithms can also affect the performance of these parallelization schemes.

2.2.1 Distributing SGD using MapReduce The MapReduce [14] approach is applicable to problems that can be ex- pressed as computing sums of functions over a training dataset. Map- reduce refers to two separate tasks: (1) the map job and (2) the reduce job. The map job takes the data and computes a result in parallel on multiple workers that can be spread across multiple devices or ma- chines. The reduce job then takes the output from all mappers and combines them to give the cumulative result. The mini-batch gradient descent algorithm given in equation 2.6 can be distributed using the map-reduce paradigm onto n machines as follows:

i+b/n l X k k k Machinej := (y − hθ(x ))xj (for all j) (2.7) k=i where l := 1, ..., n. Then a single reduce job will calculate:

n 1 X l θj := θj + α Machine (for all j) (2.8) b j l=1

The parameter θ is then broadcast to all machines to calculate the next iteration of the . This will potentially give us a ×n speedup if there was no network latency.

2.3 Algorithms for Collective Communication

Collective communication algorithms are a set of algorithms designed for communication involving multiple processes. The three interest- ing collective communication algorithms for this study are: One-to-all broadcast (when parameter server broadcast the global model), All- to-one reduction (when all workers push their gradients to be reduced by the parameter server), and All-reduce. All algorithms in collective CHAPTER 2. BACKGROUND 9

communication try to minimize the overhead of point-to-point mes- sage passing. The communication overhead of point-to-point message passing is given by:

overhead = ts + twm (2.9) where ts is the latency, tw is per word transfer time (inverse band- width), and m is the message size in number of words. If the link is bi-directional and used by more than one message then: tw → tw/M, where M is the number of messages that share the link ( i.e. the trans- fer time will be shared by M messages). This shows that distributed training is highly dependent on the amount of data that needs to be sent, communication channel through- put, and number of workers using the communication channel. Unlike all-to-one reduction, all-reduce algorithms are independent of the number of processes that need to communicate. In all-reduce, each worker communicate with two neighbors; receiving a message from a neighbor on one side combines it with local message and passes it to its neighbor on the other side. Hence the message size that shares the same link does not grow with the number of processes and tw → tw/1 for all M.

2.3.1 Message Passing Interface (MPI) The Message Passing Interface (MPI) standard is a cross-platform API that allows for distributed-memory parallel programs to exchange data by abstracting the underlying network procedures. MPI is available on single compute nodes for inter-process communication and on clusters for communication between nodes. MPI can take advantage of high performance interconnects, with high throughput and low latency, such as InfiniBand, Intel Omni-Path, and Cray interconnects that are not available in Spark, MapReduce, and gRPC (Google RPC, used by distributed TensorFlow). MPI has corresponding functions for all the collective communication algorithms mentioned above: MPI_Bcast (One-to-all broadcast), MPI_Reduce (All- to-one reduction), MPI_Allreduce (All-reduce), and others. Open MPI [15] is an open source implementation of MPI developed and main- tained by the High performance Computing community. Open MPI is used in implementations of parallel and distributed deep learning 10 CHAPTER 2. BACKGROUND

training frameworks like Baidu’s contribution to TensorFlow (tensorflow- allreduce), Theano-MPI [31] , and horovod [38].

2.4 Data Parallelism

Scalability challenges that arise from the massive data volume used to train deep learning models can be alleviated leveraging data paral- lelism. In data parallelism, a replica of the same model run on multi- ple workers on different subsets of the training data, in parallel. Thus, data parallelism requires keeping a global model and some way of up- dating its parameters by gathering results from workers. Stochastic gradient descent (SGD) is an inherently sequential algo- rithm [49]. To perform SGD in parallel, a global model is maintained and workers send their gradients, which are aggregated, to the global model. This gradient update can be performed in two different ways: • Synchronous SGD where gradient updates are made when all work- ers are done calculating their respective gradients, and • Asynchronous SGD where gradient updates are made incremen- tally as local workers finish calculating their respective gradients. Finally, the updated global model is broadcast to all workers. Thus, techniques common in Collective Communication and high-performance computing (HPC) are highly relevant for model gradient update and propagation.

2.4.1 Synchronous SGD In Synchronous SGD, a global model that is updated with the aggre- gate of all worker gradients is kept. This updated global model is then sent to all workers, making Synchronous SGD a true mini-batch stochastic gradient descent; where the mini-batch size is the sum of the mini-batches of all workers. Synchronous SGD has two obvious drawbacks: (1) the update time is dependent on the slowest worker and (2) the scalability of the ap- proach is dependent on the batch size of each worker (i.e. bigger mini- batches on workers will limit the number of workers that can be added because very big batch sizes can affect the convergence rate of SGD). Although training with large mini-batch sizes up to 32k with no loss in accuracy has been demonstrated [16, 3]. CHAPTER 2. BACKGROUND 11

2.4.2 Asynchronous SGD Downpour SGD first introduced in [13] used Asynchronous Stochas- tic Gradient Descent, where each worker pushes its gradients ∆w and gets back an updated global model W , independently of other work- ers. Asynchronous SGD has two advantages: 1) the performance of the algorithm is not affected by a slow worker and 2) it is fault toler- ant. Fault tolerance is a challenge faced when implementing any dis- tributed framework. Long-running applications, like deep learning, are susceptible to faults and require robust and fault tolerant frame- works. Task failure rate of 10k machine-hour jobs of batch machine learning tasks was reported to be 24.7% in [30]. While Asynchronous SGD solves the bottleneck introduced by the slowest worker in synchronous SGD, it suffers from the problem of the delayed gradient. This is encounter when a slow worker pushes its gradients to the global model and the model is already updated by other workers. Different techniques have been proposed to solve the problem of delayed gradient [48, 8]. While the delayed gradients can create some noise in the global model and delay convergence, deep neural networks can recover and learn successfully [8].

2.4.3 Parameter Server The parameter server framework introduced in [42] is widely adopted [2, 21, 30] as an efficient solution to scale machine learning algorithms. Parameter server frameworks distribute data and model parameters across multiple nodes to spread the workload. In parameter server architecture (shown in Figure 2.1) a distributed key-value storage is used for synchronizing parameters between work- ers. Parameter servers store all parameters, while workers are state- less but can cache parameters across iterations. In a straight-forward configuration where we have a single parameter server and multi- ple workers; each worker will compute a gradient on its subset of the mini-batch and sends it to the parameter server, which then takes the average of all the gradients and broadcasts it back to all workers. If we imagine a neural network with 100 million trainable parame- ters (which is not uncommon in deep learning), where each parame- ter is four bytes, we need to communicate roughly 400 megabytes of data per worker. In the above configuration, where all workers share 12 CHAPTER 2. BACKGROUND

Figure 2.1: Parameter server model. Training data is divided across machines each containing a model replica. The parameter server col- lects model gradients ∆W from each machine and returns an aggre- gated model parameter W that is used to update every model replica. (image source [13]) the same bandwidth, the communication cost grows linearly with the number of workers. To make further parallelization practical the syn- chronization constraint can be removed (asynchronous SGD) or we can use communication algorithms with cost independent of the num- ber of workers (all-reduce). The synchronization constraint is the main bottleneck in this frame- work, so some implementations loosen this constraint to reduce com- munication overhead [20, 30]. While others like Baidu’s ring all-reduce take advantage of bandwidth-optimal communication algorithms with- out loosening synchronization constraints.

2.4.4 Ring All-Reduce The bottleneck created by sending data to a single parameter server in all-to-one reduction can be alleviated with ring all-reduce; an algo- rithm common in the field of high-performance computing. In ring all-reduce each worker is assigned two neighboring workers; one to send data to and another to receive from. The algorithm as presented in [50] consists of two stages: CHAPTER 2. BACKGROUND 13

1. Scatter-reduce in this step, each worker sends a chunk and re- ceives a chunk from its direct neighbors. After the first data transfer iteration, each worker will aggregate the chunk it re- ceived with its local copy and do the data transfers stage again until each worker has some part of the aggregated final value that includes contributions from all workers.

2. All-gather in this step each worker receive the final value from their respective (sender) neighbor. After the data transfer stage, each worker will replace its value with the newly received one and continue with the data transfer until each worker receives the contributions from all other workers.

The ring all-reduce algorithm speed is independent of the number of workers, instead; it is limited by the slowest communication link between neighboring workers. The ring all-reduce algorithm can be applied to deep learning in the same way as parameter server frame- work, but instead of sending gradients to a single parameter server here it is sent to immediate neighbors. Furthermore, communication can be overlapped with the gradient computation, by sending the gra- dients of the output while the other layers are being computed [50].

2.5 Model Parallelism

Deep learning models can have billions of parameters [13]. At four bytes per parameter, these models can reach gigabytes in size and can- not fit into the memory of a single GPU. Model parallelism can be used to address this problem. In model parallelism synchronization between workers is done when one worker needs neuron activities output by another worker as input. To minimize the communication overhead a neural network model graph can be partitioned in a way that edges running between sep- arated components (shown in thicker line in Figure 2.2) of the model are minimal or the amount of data that flows through the edges is low. However; model parallelism, with the goal of minimizing execution time, can not be achieved solely by partitioning. Operations also need to run in a particular order, to make sure outputs from a task are avail- able when they are needed as input by other tasks (scheduling). 14 CHAPTER 2. BACKGROUND

Figure 2.2: Model parallelism. Gray boxes represent devices(GPU or CPU) or machines. The thicker edges show communication across ma- chines or devices. (image source [13])

Both finding an optimal partitioning of a data flow graphs and scheduling are NP-complete problems [32]. However, there are sub- optimal heuristics-based algorithms both for scheduling and partition- ing computational graphs.

2.5.1 Partitioning Neural Network Model Graphs Graph Partitioning can be used for load balancing tasks onto multiple devices while minimizing communication. Graph Partitioning can be defined as follows: Given a directed acyclic graph G = (N,E,WN ,WE), where N = node, E = edges, WN = node weights, and WE = edge weights. Find a partition that will distribute the load WN evenly, while minimizing the sum of all edge weights connecting all the different partitions. In a neural network computational graph: N represent computa- tional operations, WN is the cost of the operation N, and an edge E(i,j) with weight WE is the amount of data that flows between i and j. CHAPTER 2. BACKGROUND 15

2.5.2 Device Placement Optimization As shown in [33], graph partitioning algorithms do not produce sat- isfactory placements due to the fact that modeling cost estimate for the graphs is expensive. [33] instead proposes a reinforcement learn- ing model for device placement optimization, that outperforms both graph partitioning algorithms and expert human placements.

Reinforcement Learning for Device Placement Optimization is a part of machine learning that is inspired by the reward system (dopamine pathway) of the brain. Unlike su- pervised learning, where training data is labeled with the true class, reinforcement learning agents learn from experience. By using execution time, of a proposed placement executed on ac- tual hardware, as a reward signal a reinforcement learning agent can learn to optimize device placement [33]. The agent trained this way was reported to be 3.5 times faster than SCOTCH’s [35] graph parti- tioning based placement and up to 20% faster than human experts’ placements.

2.6 Hybrid Data and Model Parallelism

No one dimension of parallelism is better than the other. Which scheme to use should be informed by the communication architecture of a model, the amount of training data, and the size of the model. Many distributed deep learning frameworks use both schemes [8, 34]. One useful observation made in [25] is that convolutional neural networks have two types of layers; Convolutional layers contain 90-95% of the computation, but only about 5% of the parameters, and Fully-connected layers contain only about 5-10% of the computation, but 95% of the pa- rameters showing that data parallelism is more appropriate for convo- lutional layers, and model parallelism for fully-connected layers.

2.7 TensorFlow

TensorFlow is an open-source framework for building and deploying machine learning models. TensorFlow is a second-generation frame- work derived from DistBelief [13]: a project. TensorFlow 16 CHAPTER 2. BACKGROUND

has support for both CPUs and GPUs (using NVIDIA CUDA) on one node or a cluster with multiple nodes. Currently it is the only machine learning framework supported on hops-hadoop’s data management and analysis platform. The TensorFlow programming framework contains three basic con- cepts:

1. Tensors are the main computational units in TensorFlow. A tensor is a multidimensional array with a rank to represent its dimen- sion.

2. Graphs represent dataflow between computations in a Tensor- Flow program. Where graph edges and vertices represent dataflow and operations respectively.

3. Session holds information about the TensorFlow graph, runs the computations described by the graph, and provides access to hardware resources on local or remote devices.

2.7.1 Distributed TensorFlow TensorFlow supports distributed training both for data and model par- allelism and it allows both synchronous and asynchronous training. TensorFlow also supports ring all-reduce on a single node using nccl (NVIDIA’s library for collective communication). TensorFlow distributes training by creating a cluster of tasks that will execute a TensorFlow graph. Distributed execution is achieved using two objects:

1. Server is used to create a session for each task.

2. Worker executes operations in a graph.

A specification dictionary is used to map job names to network ad- dresses. These job names can then be used in TensorFlow code to specify what part of the execution will run on which worker. And the network addresses are used for communication.

2.8 Resource Management in Hops-YARN

Hops-YARN is a distribution of Apache Hadoop YARN [45] that have moved the StateStore to the transactional in-memory databases (MySQL CHAPTER 2. BACKGROUND 17

NDB Cluster). An application that is submitted to a hadoop cluster is assigned the required resources and monitored through YARN (Yet Another Resource Negotiator). YARN cluster management includes resource management, monitoring, and scheduling. These tasks are performed by three separate components:

1. ResourceManager (RM) is the central authority responsible for me- diating resources for applications running in the cluster. With tasks including, scheduling and resource management.

2. NodeManager (NM) is tasked with reporting, resources availabil- ity, faults, and container lifecycle management (e.g., starting, killing) on each node in a cluster.

3. ApplicationMaster (AM) manages all lifecycle aspects of a job. The AM can run arbitrary user code and communicates with the RM to issue resource requests that can contain locality preferences.

The Hadoop version (2.8) used in Hops have support for managing Memory and CPU as a resource, but GPUs are not managed natively by this version of Hadoop YARN. Hops-hadoop YARN added GPUs as managed resources in 2017 [5].

2.9 Spark on YARN

Apache Spark [47], is a cluster computing framework that uses a simi- lar programming model to MapReduce [14]. Launching an application with Spark involves five processes:

1. Driver launches the job and provides the code that will run on the workers.

2. Cluster manager is used to ask for resources by Spark. The cluster manager can be: standalone, Mesos, or YARN.

3. Workers is a container that provides resources to the application.

4. Executors is a JVM process that is created on a worker node by Spark.

5. Task is a thread in an executor. 18 CHAPTER 2. BACKGROUND

A Spark application can be deployed on YARN in two modes: cluster mode or client mode. In cluster mode, the driver process runs in YARN AM. While in client mode the driver runs in the client that started the Spark job.

2.9.1 TensorFlow On Spark TensorFlowOnSpark (TFoS) [29] is an open-source framework that en- ables distributed training and inference on Spark by using Tensor- Flow’s own distributed deep learning capabilities. This allows TFoS to support all the distributed deep learning methods in TensorFlow. TFoS has support for communication using both Ethernet and Con- verged Ethernet (RoCE). The GPU support in TFoS is not managed, i.e. the program will take any available GPU on a cluster. This makes it unsuitable for shared cluster where GPUs are managed resources. The Hops team have released a distribution of TFoS that comply with the YARN resource management restrictions to use with the hops-hadoop platform.

2.10 Evaluation

The performance of a distributed training algorithm is measured by the amount of time it takes to train a model with no significant loss of accuracy. Scalability is another metric that can be used to evalu- ate distributed training algorithms. A distributed training algorithm is scalable: if the running time of the algorithm is reduced with every added computational resources and the additional cost is justified by the performance gain. In an ideal scaling the running time of the al- gorithm improves proportional to the added computational resources (linear scalability). Benchmarks for distributed training are usually done both with synthetic and real data. The synthetic data helps to remove disk I/O from the equation. Real data can then be used to verify the result and measure disk I/O impact. CHAPTER 2. BACKGROUND 19

2.11 Related Work

Distributing implementation of deep learning training is important for efficiency, achieving better accuracy and quicker trial and error in experiments. Given this fact, much work has been put into build- ing distributed frameworks for training deep networks and analyzing their performance on single-GPU, multi-GPU and multi-node environ- ments [39, 40]. Frameworks like Project Adam [8] and SINGA [34] at- tempt to solve the performance and scalability issues of deep learning training by building a highly customized solution. Others like Ten- sorFlowOnSpark integrate deep learning frameworks with a general- purpose batch computational framework like Spark and horovod [38] brings HPC techniques to deep learning. As shown in [16, 3] HPC techniques like ring all-reduce can achieve nearly linear scalability, but there is no native support in MPI for YARN-based resource manage- ment. Generally available machine learning frameworks like Tensor- flow, Caffe2, and Torch, that support distributed training natively, have no built-in support for deploying a model in a shared cluster environ- ment. There is a clear mismatch between shared clusters, managed by YARN, and machine learning frameworks that is addressed by this project. This work explored the feasibility of running machine learn- ing frameworks and HPC techniques in a shared cluster managed by YARN, and analyze the performance of different distributed deep learn- ing algorithms in a shared cluster environment. The frameworks that are used in the analyses are: TensorFlowOn- Spark and Horovod customized for container-based resource manage- ment. The tests were performed on a 2-node cluster that were configured with 10 GPUs each. The GPUs in the two nodes are GeForce GTX 1080 Ti (for a total of 20 GeForce GTX 1080 Ti). The nodes are connected via a 40Gb/s InfiniBand connection and a 1Gb/s Ethernet cable. Chapter 3

Method

Distributing training of deep learning models is important for achiev- ing better accuracy, allowing for quicker trial and error in experiments, and general efficiency. But to be efficient a distributed training algo- rithm must take into consideration the communication cost incurred when training is done in parallel. There are many distributed train- ing architectures that try to address this cost. Here, two widely used distributed training algorithms, parameter server and ring all-reduce, are analyzed using image recognition models on hops-hadoop clus- ter with 20 GPUs. The analyses is done on the run-time performance of the models with: different number of parameter servers, update modes, and communication algorithms. Given the fact that hops-hadoop clusters are designed for multi-tenancy a mechanism for deploying distributed training on a shared cluster is also presented.

3.1 Datasets and Models

ImageNet Large Scale Visual Recognition Competition (ILSVRC) [37] wining models are widely used for performance benchmarking in dis- tributed deep learning [3, 16, 39, 40]. InceptionV4 [43], ResNet-50 [19], AlexNet [26], ResNet152_v2 [19], and VGG16(19) [41] are used in the experiments in this study with synthetic ImageNet dataset to analyze the performance of different distributed training schemes and mea- sure the impact of the communication cost. Synthetic datasets are ran- domly generated pixel values that match the dimensions of the desired dataset (256 × 256 for ImageNet). Using synthetic datasets allow us to test the computational performance and communication overhead by

20 CHAPTER 3. METHOD 21

Table 3.1: Models used in experiments with number of parameters, operations, and accuracy

Accuracy (%) Models # Parameters Flops top-5 top-1 AlexNet ~60M ~2.27 Bn 80.3 57.0 VGG16 ~138M ~30.94 Bn 90.0 70.5 VGG19 ~143M ~39 Bn 92.7 71.3 Resnet50 ~25M ~10 Bn 92.9 75.8 Resnet152 ~60M ~29.4 Bn 93.8 77.6 googlenet ~10M ~3 Bn 92.1 <70 inception4 ~65M ~20 Bn 95.0 80.0 removing disk I/O from the equation. For the purpose of this paper the disk I/O impact on scalability is not considered, therefore no tests with real data were carried out. The models used for the experiments are implementations of the original models in TensorFlow benchmarks [44]. The models were chosen to represent different architectures and number of parameters. The models and their respective number of parameters, floating point operations per second, and accuracy is shown in Table 3.1.

3.2 Cluster Setups

A 2-node GPU cluster was used for the experiments. In the next two sections the hardware and software setup of both nodes are specified.

3.2.1 Hardware Specification The two machines in the cluster used for the experiments have identi- cal hardware specifications that is shown below: • CPU: 2x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz

• Cores per socket: 8

• Threads per core: 2

• CPU MHz: 1273.699, min 1200, max 3000

• Memory: 256 GB 22 CHAPTER 3. METHOD

• GPU: 10x GeForce GTX 1080 Ti, PCIe Gen3, x16 (16 lanes)

• InfiniBand: QLogic Corp. IBA7322 QDR InfiniBand HCA, PCIe Gen2, x8 (8 lanes)

Giving a total of 64 cores, 512 GB of memory, and 20 GPUs. The 10 GPUs in one machine are divided into two groups of 5 placed on dif- ferent PCIe buses as shown in Figure 3.1. With a connection travers- ing a PCIe switch as well as a PCIe Host Bridge (typically the CPU) between GPUs in different groups. Within a group each GPU is con- nected to the others with a connection traversing a single PCIe switch. The two nodes are connected with a 40Gb/s InfiniBand connection and a 1Gb/s Ethernet cable. InfiniBand performance might be affected by the bandwidth limi- tation introduced by the PCIe. The maximum possible bandwidth of the PCIe is calculated by multiplying the PCIe width and speed mi- nus ~1Gb/s for error correction protocols and 20% PCIe headers over- head. The speed and width of the PCIe expansion bus the InfiniBand card supports are: 5GT/s (GT/s stands for "billion transactions per second") and x8 respectively. PCIe maximum possible bandwidth is 5G ∗ 8 ∗ (1 − 1/5) − 1G = 40G ∗ 0.8 − 1G ≈ 31Gb/s.

3.2.2 Distributed Deep Learning and Big data Frame- works Both machines in the cluster were running CentOS 7.2 and were in- stalled with the following software and distributed deep learning frame- works.

• Open MPI: version 3.0.1

• CUDA Toolkit: version 9.0

• NCCL: version 2.1.15

• cuDNN: version 7.0

• Hops-YARN: version 2.7

• Spark: version 2.2.1

• TensorFlow: version 1.7 CHAPTER 3. METHOD 23

Figure 3.1: Two nodes with 10 GPUs each and a 40Gb/s InfiniBand connection running between them. The InfiniBand cards are using a PCIe gen2 expansion bus with a speed of 5GT/s and width of x8. The GPUs are mounted on PCIe gen3 expansion bus with a speed of 8GT/s and width of x16.

• TensorFlowOnSpark: version 1.3.5 (hoshadoop fork https://github.com/hopshadoop/TensorFlowOnSpark)

• horovod: version 0.12.1

3.3 Experiment Design

The running time performance of the distributed deep learning frame- works is measured by the number of images different models are able to process in distributed mode. Because the batch size can affect the number of images per second that can be processed all models were tested with the same batch size. The scalability of the models is then compared with the ideal scaling (i.e. images/sec for one worker mul- tiplied by the number of workers) to give a performance metric that is independent of the model used in the experiment. This was done for parameter server and all-reduce algorithms with the two chosen frameworks TensorFlowOnSpark and Horovod. 24 CHAPTER 3. METHOD

3.3.1 Batch Size The mini-batch size used to train a model affects: (1) how fast a model converges or if it converges at all (2) the amount of time spent on out- put calculation by each replica of a model, and (3) the amount of data that needs to be read from disk before each mini-batch is processed (only in the case of real data experiment). In this work none of these three properties were analyzed, and so a mini-batch size of 32 per worker is used for all experiments. This is because 32 was the biggest batch size that can fit in memory for some big models.

3.3.2 Number of Workers The number of workers in the experiments were limited by the number of GPUs available in the cluster. On the parameter server experiments all 20 GPUs could not be used because of the low bandwidth avail- able between the two nodes and TFoS not supporting InfiniBand. So all parameter server experiments were limited to 10 GPUs. An experi- ment involving two nodes is presented in section 4.1.3 to show the net- work bandwidth bottleneck. Given the limited number of GPUs avail- able only one and two parameter servers are tested. In hops-hadoop version of TensorFlowOnSpark GPUs are not assigned to parameter servers allowing for 10 worker experiments.

3.3.3 System Monitoring System usage statistics were collected using the NVIDIA System Man- agement Interface (nvidia-smi) and Collectl. Collectl was used to col- lect usage data on CPU, network, and InfiniBand. While nvidia-smi was used to get GPU usage statistics. Both monitoring services were started on both nodes at the start of each experiment and stopped when the experiment finished.

3.4 Model Deployment

Model deployment and resource allocation is done through spark with YARN cluster manager, both for TensorFlowOnSpark and Horovod. CHAPTER 3. METHOD 25

TensorFlowOnSpark is built on top of spark, thus needing little modi- fication to work with hops-hadoop YARN. Horovod on the other hand only works with MPI. A framework around MPI, that uses spark for resource allocation with YARN, was built to allow MPI to spawn pro- cesses in the cluster and have access to resources. In the next sections the design of the deployment architecture for both frameworks are de- scribed.

3.4.1 Parameter Server with TensorFlowOnSpark TensorFlowOnSpark’s GPU support is not managed. After each ex- ecutor is started by Spark it will check the GPU utilization, by run- ning nvidia-smi (NVIDIA System Management Interface), and take any GPU with low memory utilization. This will clearly create a race con- dition where two or more executors check the utilization at the same time and start using the same GPU. On hops-hadoop, YARN manages GPUs and executors only get access to GPUs allocated exclusively to the container they are running in. Resource allocation and model deployment for TFoS on hops-hadoop include the following steps:

1. Spark makes a resource request to YARN with number of execu- tors, CPU cores per executor, memory per executor, and number of GPUs per executor.

2. When the resource request is fulfilled TFoS starts a coordination server on the driver and send its address to all executors along with the client code. This server will wait until all executors are registered with their respective address and port.

3. After all executors are registered TFoS creates a cluster specifi- cation containing parameter server and worker addresses. This is then used by TensorFlow to create a session for each task. A sample cluster specification with one parameter server and three workers is shown in Listing 1.

4. Parameter updates and gradient broadcasting are then handled by distributed TensorFlow. 26 CHAPTER 3. METHOD

1 Cluster spec:{

2 ’ps’:[’10.0.1.15:4287’],

3 ’worker’:[’10.0.1.16:4062’, ’10.0.1.16:3426’,

4 ’10.0.1.18:4129’]

5 }

Listing 1: Sample cluster specification with one parameter server and three workers. The cluster specification is a python dictionary used by TensorFlow to create a session for each task.

3.4.2 Ring All-Reduce with Horovod Horovod relay on MPI to spawn processes and NCCL (NVIDIA’s li- brary for collective communication) to handle communication between processes. After MPI started all processes and assigned ranks to each process, horovod uses the global ranks to create a communication link between neighboring workers. This will create the ring in ring all- reduce collective communication algorithm. Horovod broadcasts vari- ables from the process with rank 0, to ensure consistent initialization. Each process will pin a single GPU with its local rank and adds it to TensorFlows device list, which maps physical GPU ids to virtual GPU ids. This will ensure no two processes are assigned the same GPU. MPI processes are created using a script similar to the one shown in Listing 3. The mpirun program creates process on remote machines using ssh. A hostfile can be used to specify where to create the pro- cesses or to limit the number of processes that can be created on each host as shown in Listing 2.

1 #------Host file ------

2 node1 max_slots=20

3 node2 max_slots=20

4 #------

Listing 2: Sample mpirun hostfile with 2 nodes allowing a maximum of 20 processes per node. The host file is a newline separated text file with each node in one line. CHAPTER 3. METHOD 27

1 % mpirun \

2 -hostfile hostfile \

3 -np2

4 -H node1

5 -wdir WORKING_DIR \

6 -x CUDA_VISIBLE_DEVICES="0,1" \

7 python main.py : \

8 -np2

9 -H node2

10 -wdir WORKING_DIR \

11 -x CUDA_VISIBLE_DEVICES="2,4" \

12 python main.py

Listing 3: Sample mpirun bash script to start 2 processes per node on 2 nodes. The wdir argument changes the working directory of the pro- cess and CUDA_VISIBLE_DEVICES will make the specified GPU ids visible to the process. The hostfile is similar to the one shown in List- ing 2 and is used to limit the number of processes that can be created on each node.

Monitoring MPI processes MPI processes and their resource usage can be monitored from the machine that run the mpirun program using ompi-top. This MPI util- ity program takes the process id of the mpirun program (as shown in Listing 4) and returns usage statistics of all processes started by it.

1 % ompi-top -pid

2 % nvidia-smi --query-compute-apps=gpu_uuid,pid \

3 --format=csv

Listing 4: Bash commands used for system usage statistics collection. The first command is similar to top command and displays sys- tem usage of all processes started by the mpirun process identified by its pid. The second command lists the gpu_uuid and pid of all GPUs that are being used by a process given by its pid. 28 CHAPTER 3. METHOD

Monitoring GPU Usage The GPU usage on the cluster is monitored using NVIDIA System Management Interface (nvidia-smi). The command shown in Listing 4 will return all GPUs that are in use with the process id of the program using them.

MPI Wrapper In the next three sections a complete architecture of the deployment mechanism for ring all-reduce (horovod) training job is presented. To deploy an MPI job on a hadoop cluster:

1. Resources need to be allocated.

2. An MPI process need to be launched with a system user that have ssh access to all nodes in the cluster.

3. The processes spawned by MPI need to be monitored.

Resources Allocation Resource allocation is done through Spark on YARN. After Spark ac- quires all the required resources the driver will start a simple socket server to coordinate all workers. Then the driver launches the client code on every executor with the address of the coordination server and waits until all workers report back with the necessary informa- tion. Each executor’s client code collects: its working directory, host name or ip address, assigned GPU uuid (gpu_uuid: NVIDIA GPU universally unique identifier), and any environment variable set for the executor and sends it to the coordination server. Finally, after the coordination server received all the responses it will make a call to the MPI wrapper application REST endpoint with the information from the executors and a path to the user program that MPI should launch.

Launching MPI Processes The MPI wrapper is an application that can be deployed on any web server that support Java EE application. The wrapper exposes REST endpoints for: CHAPTER 3. METHOD 29

• Starting MPI process: given a valid YARN application id and a program to run, it will start the MPI process and return its pro- cess id.

• Stopping running MPI process: given a valid YARN application id and MPI process id, it will stop the running MPI process.

• Getting status of MPI process: given a valid YARN application id and MPI process id, it will return the current status of the job (Running or Stopped).

• Getting log of MPI process: given a valid YARN application id and MPI process id, it will return the stdout and stderr of the MPI process.

• Getting all running MPI processes: given sufficient access level (Administrator), it will return all running MPI jobs.

To access any of the endpoints listed above a user need to have a valid authentication token or session key. This is used to identify the user making the request and do any necessary access control. In addi- tion the YARN application id is used to check if the authenticated user have access to the application she/he is trying to run an MPI process for. When a start MPI process request is received and all the access con- trol checks have passed:

1. An mpirun script similar to the one in Listing 3 is constructed and executed.

2. The process id of the mpirun process is saved in a database along with the application id of the Spark job and the resources allo- cated to the application (i.e CPU cores, GPUs, and memory).

3. A status OK with the process id is sent back as a response.

The Spark Driver can subsequently query the MPI wrapper for the status of the running MPI process and retrieve logs using the process id and application id. If the Spark job is stopped properly it can call stop MPI process before exiting. 30 CHAPTER 3. METHOD

Monitoring MPI Processes Monitoring the MPI process is necessary for two reasons: (1) because the MPI process is running outside a managed container it needs to be monitored for malicious activities and (2) if the Spark job is terminated suddenly without a proper cleanup, stray MPI processes will remain in the system consuming resources. The Monitoring thread is a timer Enterprise Java Beans (EJB) that runs periodically to check process resource usage and cleanup stray MPI processes. On every run the timer thread:

1. reads all entries in the table containing the appid and pid of the MPI job. The appid is used to query YARN for the status of the application. If the application is in any of the states [F INISHED, F AILED, KILLED] the MPI process is killed, if still running, and the row is removed from the table.

2. for each row remaining in the table the usage is compared with the assigned resources stored in a table. If the usage exceeds the resources assigned, the process is killed and the user added to a table that can be used to block users.

The process of starting and monitoring MPI processes is shown in Fig- ure 3.2

3.4.3 Evaluation of Model Deployment The distributed deep learning model deployment method for ring all- reduce need to fulfill three requirements to run on hops-hadoop clus- ter:

1. The user should have access to program output.

2. The resource allocation should be managed by YARN.

3. Any process created by a user should be isolated and monitored.

These requirements need to be fulfilled by all user submitted program on hops-hadoop cluster. CHAPTER 3. METHOD 31

node2 node3

Spark driver executor 1 mpi 1

start call executors MPI start executor 2 mpi 2

executor 3 mpi 3

Data: MPI Jobs table while rows in jobs table do executor 4 | read current row; mpi 4 | endStates ← [FINISHED, FAILED, KILLED] ; | if getAppState(row.appid) in endStates then | | killProcessIfExist(row.mpiPid); | | removeRow(row); | else | | usage ← checkSystemUsage(row.mpiPid) ; | | if usage > row.allocated then | | | killProcessIfExist(row.mpiPid); node4 | | | reportSystemAbuse(row.user); | | | removeRow(row); | | end executor 5 | end mpi 5 end

node1

MPI wrapper executor 6 mpi 6

Monitoring

executor 7 mpi 7

mpirun process Database

Figure 3.2: Starting and monitoring MPI processes using the MPI wrapper module.

3.5 Data Collection

A run with a given number of batch iterations, mini-batch size, num- ber of workers, number of parameter servers (only for TFoS), and a given model constitutes a single experiment. Every experiment is run from three to five times and the average of the runs is taken as the result of the experiment. The cluster used for the experiments was iso- lated to remove any interference from other processes. This made the number of experiments needed, to get a result with low variance man- ageable. The experiments were performed using TensorFlow bench- marks [44] with the setup given in Table 3.2. 32 CHAPTER 3. METHOD

Table 3.2: Experiment setup

Parameter server (TFoS) Ring all-reduce Sync Update Async Update (Horovod) # workers 3 - 10 3 -10 3 - 20 # parameter servers 1,2 1 N/A Batch size 32/worker 32/worker 32/worker ImageNet Dataset synthetic synthetic synthetic # batches 100-200 100-200 100-200 Chapter 4

Results

In this chapter the results of the experiments run with the two dis- tributed deep learning architectures, parameter server and ring all- reduce, are presented. Furthermore, the model deployment system discussed in section 3.4.2 (Ring All-Reduce with Horovod) is assessed on the evaluation criteria given in section 3.4.3.

4.1 Parameter Server

For the experiments in parameter server with TFoS, one and two pa- rameter servers with workers ranging between 3-10 and 4-10 respec- tively were performed. The experiments were limited to 10 workers because distributed TensorFlow (version 1.7) only supports RDMA over Converged Ethernet (RoCE), which is not available on the cluster used for the experiments. However, some system monitoring results from a run on two nodes are reported in section 4.1.3 to show network and system usage.

4.1.1 Synchronous Update Synchronous update as discussed in section 2.4.1 requires all work- ers to update a global model before gradients can be broadcast to all workers. In the experiments with synchronous updating one and two parameter servers were used to investigate the effects of an extra net- work bandwidth. Figure 4.1 shows images per second processed by different models with one and two parameter servers. Here, the two

33 34 CHAPTER 4. RESULTS

parameter servers run showed slightly better performance than the single parameter server one.

4.1.2 Asynchronous Update The two models that scaled best in the previous experiment were used for comparison on the asynchronous gradient updating experiments. Figure 4.2 shows images per second processed by inception4 and resnet50 models with one parameter server in asynchronous parameter update mode. The asynchronous parameter update mode showed no signif- icant performance improvement over synchronous parameter update mode.

4.1.3 Multi-Node Synchronous Update A multi-node experiment with four workers and a single parameter server is presented in this section to show the network bottleneck that restricted parameter server experiments from using two nodes. In this experiment two workers were placed on one node and two more workers plus the parameter server on another. The system usage of this experiment is shown in Figure 4.3. This run did not complete and had to be killed after running for 36 min (as can be seen on the x-axis of Figure 4.3). The main thing to notice in Figure 4.3 is the network I/O, that clearly shows the 1Gb/s network bottleneck. All parameter server runs on multi-node fail to complete. Thus the remaining parameter server experiments were forced to run on one node by decommissioning one of the nodes (i.e. stopping the NM on that node), forced YARN to schedule all jobs on the remaining node.

4.2 Ring All-Reduce

The ring all-reduce (horovod) experiments were deployed using a sim- ilar MPI script to the one shown in Listing 3, where workers ranging between 3 and 10 were placed on the first node and workers between 11 and 20 on the second node. The ring formed on a single node have peer-to-peer connection between all workers, while the ring with 11 workers need to traverse an InfiniBand interconnect (NET/IB) as can be seen in Figure 3.1. CHAPTER 4. RESULTS 35

Number of images/sec one parameter server 540 520 500 480 460 440 420 400 380 360

c 340

e 320 s

/ 300

s 280 e 260 g 240 a 220 m

i 200 180 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 9 10 Number of GPUs vgg19 resnet152v2 resnet50 vgg16 inception4

(a) One parameter server

Number of images/sec two parameter servers 540 520 500 480 460 440 420 400 380 360

c 340

e 320 s

/ 300

s 280 e 260 g 240 a 220 m

i 200 180 160 140 120 100 80 60 40 20 0 1 3 4 5 6 7 8 9 10 Number of GPUs vgg16 resnet152v2 resnet50 vgg19 inception4

(b) Two parameter servers

Figure 4.1: Number of images processed per second with different models and number of workers in synchronous gradient update mode. a) Showing results using one parameter server and b) showing results using two parameter servers. 36 CHAPTER 4. RESULTS

Number of images/sec one parameter server async update mode 540 520 500 480 460 440 420 400 380

c 360

e 340 s / 320 s

e 300

g 280 a 260 m

i 240 220 200 180 160 140 120 100 80 60 1 3 4 5 6 7 8 9 10 Number of GPUs inception4 resnet50

Figure 4.2: Number of images processed per second with different models and number of workers in asynchronous gradient update mode.

The results from the ring all-reduce (horovod) experiments are shown in Figure 4.4. All models except vgg16 and vgg19 show improvement with each added worker. While, the images processed per second, for the same amount of workers, compared to parameter server has al- most doubled.

4.3 Scalability

The speedup in the number of images per second processed with each added computational resource (GPU) is used here to show the scala- bility of the algorithms. Figure 4.5, shows the speedup in parameter server mode compared to the ideal scaling shown in green. Figure 4.5 (a), showing one parameter server with synchronous update, (b) one parameter server with asynchronous update, and (c) two parame- ter servers with synchronous update. The speedup for ring all-reduce (horovod) experiments are shown in Figure 4.6. CHAPTER 4. RESULTS 37

Machine 1 60 User

% 55 Sys

n 50 i 45 e

g 40 a

s 35 u

30

U 25 P 20 C 15

140 RxTotal B 120 TxTotal M 100 RxErrsTotal n i 80 k r

o 60 w

t 40 e 20 N 0

70 GPU Util %

60 n i

50 e

g 40 a

s 30 u 20 U

P 10 G 0 09.00 12.00 15.00 18.00 21.00 24.00 27.00 30.00 33.00 36.00 39.00 42.00 45.00 time

(a) System usage on Machine 1

Machine 2 70 User % 60 Sys n i

50 e

g 40 a

s 30 u 20 U

P 10 C 0

1200 RxTotal B 1000 TxTotal M RxErrsTotal

n 800 i

k

r 600 o

w 400 t

e 200 N 0

90 GPU Util

% 80

n 70 i 60 e

g 50 a

s 40 u

30

U 20 P 10 G 0 09.00 12.00 15.00 18.00 21.00 24.00 27.00 30.00 33.00 36.00 39.00 42.00 45.00 time

(b) System usage on Machine 2

Figure 4.3: System usage in parameter server with multi-node. Two workers running on Machine 1 and the parameter server along with two workers running on Machine 2. The network usage in a) shows how the message from the two workers to the parameter server is choking the 1Gb/s Ethernet connection. 38 CHAPTER 4. RESULTS

Number of images/sec using Ring all-reduce 2500 2400 2300 2200 2100 2000 1900 1800 1700 1600 c

e 1500

s 1400 /

s 1300 e 1200 g

a 1100 1000 m i 900 800 700 600 500 400 300 200 100 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of GPUs vgg19 inception4 resnet50 vgg16 resnet152v2

Figure 4.4: Number of images processed per second with different models and number of workers with ring all-reduce (horovod)

4.4 Resource Utilization

The resource utilization is directly affected by the time spent on com- munication as can be seen in Figures 4.7 and 4.8. Figure 4.7 a) Shows GPU usage close to 100% for the entire run in ring all-reduce while b) shows a usage that bounces up and down with the network I/O. A similar trend is seen in Figure 4.8 where the GPU utilization increases when network I/O is low.

4.5 Model Deployment

The prototype built for ring all-reduce model deployment fulfilled two of the requirements but the third involved adding an extra service. The extra service monitors the activity of the user program and per- form cleanup of stray processes. The isolation requirement was met by running all processes as a system user with limited privileges. CHAPTER 4. RESULTS 39

Speedup of models sync update one parameter server 10 9 8 7 ) x

( 6

p u

d 5 e e

p 4 s 3 2

1 0 1 2 3 4 5 6 7 8 9 10 Number of GPUs vgg19 vgg16 inception4 ideal resnet152v2 resnet50 (a) Results for synchronous update mode with one parameter server

Speedup of models async update one parameter server 10 9 8 7 ) x

( 6

p u

d 5 e e

p 4 s 3 2

1 0 1 3 4 5 6 7 8 9 10 Number of GPUs vgg16 inception4 ideal resnet50 (b) Results for asynchronous update mode with one parameter server.

Speedup of models sync update two parameter servers 10 9 8 7 ) x

( 6

p u

d 5 e e

p 4 s 3 2

1 0 1 3 4 5 6 7 8 9 10 Number of GPUs vgg16 vgg19 inception4 ideal resnet152v2 resnet50 (c) Results for synchronous update mode with 2 parameter servers

Figure 4.5: Speedup of model training with each added worker (GPU) in parameter server mode. The values in the speedup (y-axis) that are zero are missing data where no experiments were done for the number of workers (on the x-axis). 40 CHAPTER 4. RESULTS

Speedup of models with ring all-reduce 20 19 18 17 16 15 14

) 13 x

( 12

p 11 u

d 10

e 9 e

p 8 s 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of GPUs vgg19 vgg16 resnet152v2 ideal inception4 resnet50

Figure 4.6: Speedup of model training with each added worker (GPU) with ring all-reduce (horovod). CHAPTER 4. RESULTS 41

Ring all-reduce 50 User

% 45 Sys

n 40 i 35 e

g 30 a

s 25 u

20

U 15 P 10 C 5

4.5 RxTotal

B 4 TxTotal

M 3.5 RxErrsTotal

n 3 i

k 2.5 r 2 o

w 1.5 t

e 1

N 0.5 0

100 GPU Util %

n 80 i

e 60 g a

s 40 u

U 20 P G 0 29.50 30.00 30.10 30.20 30.30 30.40 30.50 31.00 31.10 31.20 31.30 31.40 31.50 time

(a) System usage for ring all-reduce

Parameter server 100 User

% 90 Sys

n 80 i

70 e 60 g

a 50 s 40 u 30 U 20 P

C 10 0

2500 RxTotal B TxTotal

M 2000 RxErrsTotal n i

1500 k r

o 1000 w t

e 500 N 0

100 GPU Util %

n 80 i

e 60 g a

s 40 u

U 20 P G 0 47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00 time

(b) System usage for parameter server

Figure 4.7: Ring all-reduce (a) and parameter server (b) system usage comparison on inception4 with 10 workers. 42 CHAPTER 4. RESULTS

One parameter server 100 User

% 90 Sys

n 80 i

70 e 60 g

a 50 s 40 u 30 U 20 P

C 10 0

2500 RxTotal B TxTotal

M 2000 RxErrsTotal n i

1500 k r

o 1000 w t

e 500 N 0

100 GPU Util %

n 80 i

e 60 g a

s 40 u

U 20 P G 0 47.00 48.00 49.00 50.00 51.00 52.00 53.00 54.00 55.00 time

(a) System usage for one parameter server

Two parameter servers 90 User

% 80 Sys

n 70 i 60 e

g 50 a

s 40 u

30

U 20 P 10 C 0

2500 RxTotal B TxTotal

M 2000 RxErrsTotal n i

1500 k r

o 1000 w t

e 500 N 0

100 GPU Util %

n 80 i

e 60 g a

s 40 u

U 20 P G 0 33.00 34.00 35.00 36.00 37.00 38.00 39.00 40.00 time

(b) System usage for two parameter servers

Figure 4.8: One and two parameter servers system usage for incep- tion4 with 10 workers. Chapter 5

Discussion and Conclusion

Training deep learning models require big data, making it computa- tionally intensive and time consuming. Distributing computation on multiple devices helps reduce the time needed to train deep learning models. In this paper two widely used distributed deep learning train- ing algorithms are analyzed with respect to scalability, fault tolerance, resource utilization, and ease of deployment in a shard hadoop cluster environment. The main focus of the study was data parallelism. Experiments on parameter server and ring all-reduce algorithms, using TensorFlowon- Spark and horovod frameworks, were performed on a cluster with two nodes and 20 GPUs. The goal was to find limitations and bottlenecks in the algorithms when deployed on a hadoop cluster. A method for deploying distributed training using MPI was also explored and a pro- totype developed and tested. In this section the results presented in the previous section are an- alyzed, discussed and conclusions are given on the feasibility of run- ning distributed deep learning algorithms as-a-service in a shared clus- ter environment.

5.1 Scalability

From the results shown in Figures 4.5 and 4.6 it is easy to see that ring all-reduce (horovod) scales better on all models. As discussed in section 2.3 ring all-reduce is a bandwidth optimal algorithm and this played a role in its scalability. Figure 4.7 shows the CPU usage, net- work I/O, and GPU utilization of inception4 with 10 workers for ring

43 44 CHAPTER 5. DISCUSSION AND CONCLUSION

all-reduce and parameter server. In the case of ring all-reduce workers have a high bandwidth P2P PCIe link, so the network I/O is not rele- vant. On the other hand parameter server uses gRPC and the network I/O shows how much data is being transferred between the parameter server and workers. Another thing that is noteworthy in Figure 4.7 is the GPU utilization. In ring all-reduce the utilization stays at a 100% for the majority of the execution, while in parameter server it bounces up and down with the network I/O. This also suggests that parame- ter server is spending much more time on communication where the GPUs are not in use. These experiments were only done using synthetic data. If real data was used the GPU utilization of the ring all-reduce algorithm would have been lower. When all workers try to read data from disk, a simi- lar bottleneck that exists in the parameter server model would be cre- ated on the disk I/O. By overlapping network and disk I/O parameter server can probably maintain similar performance which is not possi- ble in ring all-reduce.

5.1.1 Number of Parameter Servers The difference in speedup between one and two parameter servers is not that significant as can be seen in Figures 4.1 and 4.5 and can also be explained by the extra network bandwidth available for workers to communicate with the parameter server. Figure 4.8 shows a clear dif- ference in the network I/O (in MB total) between one and two param- eter servers with ten workers. The one parameter server run (Figure 4.8 a) have network I/O that is slightly higher than the two parameter servers run (Figure 4.8 b) and lasted for the entire run. The two param- eter servers run shows wider deeps with no network I/O that suggest communication is taking less time. The results here only show that there is a performance gain to be made by adding parameter servers. But more experiments are needed to establish a good ratio of parameter servers to workers. The added CPU core on the second parameter server might also account for some of the performance gain. CHAPTER 5. DISCUSSION AND CONCLUSION 45

5.1.2 Asynchronous Update Asynchronous update did not show any speedup in performance com- pared to synchronous update. For this experiment inception4 and resnet50 were used and the results are shown in Figures 4.2 and 4.5b. One reason, for the asynchronous update performing similar to the synchronous one, can be the fact that all workers were started at the same time and have the same computational resources. Thus, even without any synchronization constraint all workers will finish and try to update the parameter server at the same time.

5.1.3 Ring All-Reduce The ring all-reduce experiments scaled almost linearly on some mod- els but did not do as well for the bigger models like vgg16 and vgg19. The results for ring all-reduce experiments are shown in Figures 4.4 and 4.6. There are two deeps in the graph of Figure 4.6 when moving from 10 to 11 and 5 to 6 workers (on the x-axis). The first one is the InfiniBand network in the ring formed across machines and it is more visible on the bigger models vgg16 and vgg19. The second one be- tween 5 and 6 on the x-axis is only visible on the big models and is the PCIe Host Bridge (typically the CPU) between GPUs in different group (see section 3.2.1 for GPU placement). As discussed in section 2.4.4 the speed of ring all-reduce algorithm is limited by the slowest communi- cation link between neighbors, which can be seen clearly in this experi- ment. This shows that for big models even a 40Gb/s (32Gb/s effective bandwidth) link can be inadequate. Figure 5.1 shows the data sent via InfiniBand for inception4 which was least affected by the multi- node ring and vgg16 which was affected the most. For vgg16 between 1000MB - 1700MB both in and out total is registered every second for the entire run. This is a clear sign that the processing power of the GPU is sub- stantially outstripping the capabilities of the network I/O (InfiniBand + PCIe) on big models (i.e. models with large number of parameters compared to the number of operations). 46 CHAPTER 5. DISCUSSION AND CONCLUSION

Machine 1 Inception4 700 InTotal 600 OutTotal ErrTotal 500 B

M 400

n i 300 B I 200

100

0 VGG16

1800 InTotal 1600 OutTotal ErrTotal 1400 1200 B

M 1000

n i

800 B I 600 400 200 0 40.30 40.40 40.50 41.00 41.10 41.20 41.30 41.40 41.50 42.00 42.10 42.20 42.30 42.40

(a) Infiniband usage on Machine 1

Machine 2 Inception4 700 InTotal 600 OutTotal ErrTotal 500 B

M 400

n i 300 B I 200

100

0 VGG16

1800 InTotal 1600 OutTotal ErrTotal 1400 1200 B

M 1000

n i

800 B I 600 400 200 0 40.40 40.50 41.00 41.10 41.20 41.30 41.40 41.50 42.00 42.10 42.20 42.30 42.40

(b) Infiniband usage on Machine 2

Figure 5.1: InfiniBand usage for inception4 and vgg16 with 11 work- ers. (a) Machine 1 running workers 1 to 10 and (b) Machine 2 running worker 11. With ring all-reduce algorithm (horovod). CHAPTER 5. DISCUSSION AND CONCLUSION 47

5.2 Fault Tolerance

Both ring all-reduce and parameter server with synchronous update are not fault tolerant i.e. if one worker fails the job cannot recover. On the other hand parameter server with asynchronous update can cope with the loss of a worker because workers update model gradients independently. This makes it more suitable for a dynamic and shared cluster environment where workers can be killed or die due to resource starvation. Most implementations of deep learning training checkpoint progress and can recover from a checkpoint after failure. YARN and Spark both have support for re-submitting failed jobs by setting a property, that control the number of retries, when starting an application. So by checkpointing training progress and using the retry capability of YARN and Spark some level of fault tolerance can be achieved even when synchronous training methods are used.

5.3 Resource Utilization

GPUs are valuable resource on a cluster, but cannot be shared by mul- tiple jobs; even when a GPU is underutilized the job that reserved it might be using all available memory on the GPU. This makes under- utilized GPUs wasted computational resources. As Figures 4.7 and 4.8 show the available communication bandwidth affects the GPU utiliza- tion considerably, and thus should be the main consideration when training is distributed. However, the communication bandwidth required for optimal uti- lization of the GPUs is not only determined by the size of the model to be trained. Other factors like the number of floating point opera- tions needed to compute the gradients and the computational power of the GPUs also determine the amount of data communicated, and consequently the bandwidth required. Therefore, when configuring a cluster for distributed training the model parameter size and floating point operations, the GPU power, the network bandwidth, and disk I/O need to be examined to maximize GPU utilization. 48 CHAPTER 5. DISCUSSION AND CONCLUSION

5.4 Model Deployment

The MPI wrapper prototype built and used to deploy deep learning training with horovod work for the purpose it was built for, but it adds an extra point of failure and an extra service to maintain in a system that already contain many services and is complicated. Being able to run MPI processes inside a YARN container will eliminate the need for an extra system to monitor MPI processes. This will also give a stronger process isolation.

5.5 Possible Sources of Error

The results reported in this paper did not account for all factors that can affect the performance of a distributed training. Disk I/O is one such factor that was not considered, but can have considerable impact on the scalability of the architectures. In particular the disk I/O will have more impact on ring all-reduce. This is mainly because there are no other considerable bottlenecks in the architecture. To eliminate any interference from other processes all experiments were performed in an isolated cluster environment. This helped in get- ting results with low variance in fewer experiments (all results with standard deviation can be seen in Appendix A). However, this also made the results less relevant to a real-world dynamic cluster environ- ment.

5.6 Conclusion

Distributing deep learning training did improve performance in most of the experiments done for this work. Although there is a clear differ- ence in performance between the algorithms, there is no one method that is better than the others in all aspects. Ring all-reduce scales best in all tests, but parameter server is easier to deploy in a hadoop cluster and parameter server with asynchronous update mode is more fault tolerant, thus well suited for a dynamic environment. The choice of distributed deep learning algorithm should take into consideration:

• The size of the model to be trained and the resources available. CHAPTER 5. DISCUSSION AND CONCLUSION 49

For example: to train a big model like vgg16 the network band- width, PCIe bus speed, and GPU placement need to be consid- ered but for a smaller one we might only need to consider net- work bandwidth.

• The trade-off between scalability and fault tolerance. If the clus- ter in use is not shared, fault tolerance might not be as important to consider and we can choose a scalable algorithm like ring all- reduce. On the other hand in a shared dynamic environment, workers might run slow or even die because of resource starva- tion or network congestion. In these situations fault tolerance may be more desirable.

• When it comes to deployment both algorithms use a well known framework (Spark) that is relatively easy to use, but ring all- reduce involves extra services like MPI and the MPI wrapper. This might require additional effort for installation and mainte- nance.

These conclusions are based on the limited number of experiments that were performed on a single cluster configuration. Furthermore, all the variables that can affect the performance were not included in the experiments (e.g. disk I/O). Despite these limitations, this work will contribute in identifying possible bottlenecks and pointing out considerations to be made when training is distributed.

5.7 Future Work

This work only explored the effects of communication on distributed deep learning training, but the effects of disk I/O have not been con- sidered. In the future disk I/O with different disk types, distributed file systems, and reading methods like Spark or TensorFlow can be investigated. The ratio of parameter servers to workers is something that needs more experiment but was not considered in this work. It was shown in this paper that adding parameter servers can improve performance. As a future work this ratio can be studied for different network con- figurations. Parameter server with asynchronous update was only tested with workers running on devices with the same computational power and 50 CHAPTER 5. DISCUSSION AND CONCLUSION

that were started simultaneously. It would be interesting to investigate the performance improvement if workers were to run on devices with different power and/or started out of sync. The experiments did not include model parallelism, which is in- teresting for big models that can not fit in the memory of a single GPU. The effects of communication, disk I/O, device placement, and scheduling on model parallelism can be interesting to look into in the future. Model deployment for MPI processes uses a simple resource moni- toring on top of an existing well-tested YARN resource manager. Thus, a method that will allow MPI processes to run inside YARN containers is worth investigating. Finally by performing similar experiments on different hardware setups clear guidelines can be developed. These guidelines can be used to predict the expected speedup and resource utilization of a distributed deep learning training given as input: a hardware setup, model size, and the operational complexity of a model. Bibliography

[1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefow- icz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems”. In: Computing Research Repository abs/1603.04467 (2016). arXiv: 1603.04467. URL: http: //arxiv.org/abs/1603.04467. [2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. “Scalable Inference in Latent Variable Models”. In: Pro- ceedings of the Fifth ACM International Conference on Web Search and Data Mining. WSDM ’12. Seattle, Washington, USA: ACM, 2012, pp. 123–132. ISBN: 978-1-4503-0747-5. DOI: 10.1145/2124295. 2124312. URL: http://doi.acm.org/10.1145/2124295. 2124312. [3] T. Akiba, S. Suzuki, and K. Fukuda. “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”. In: Com- puting Research Repository abs/1711.04325 (2017). arXiv: 1711. 04325. URL: http://arxiv.org/abs/1711.04325. [4] G. M. Amdahl. “Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities”. In: Proceedings of the April 18-20, 1967, Spring Joint Computer Conference. AFIPS ’67 (Spring). Atlantic City, New Jersey: ACM, 1967, pp. 483–485. DOI: 10.1145/1465482.1465560. URL: http://doi.acm. org/10.1145/1465482.1465560.

51 52 BIBLIOGRAPHY

[5] R. Andersson. “GPU integration for Deep Learning on YARN”. MA thesis. KTH, School of Information and Communication Tech- nology (ICT), 2017. [6] B. Baker, O. Gupta, N. Naik, and R. Raskar. “Designing Neural Network Architectures using Reinforcement Learning”. In: Com- puting Research Repository abs/1611.02167 (2016). arXiv: 1611. 02167. URL: http://arxiv.org/abs/1611.02167. [7] L. Bottou. “Stochastic gradient learning in neural networks”. In: Proceedings of Neuro-Nımes 91.8 (1991), p. 0. [8] T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. “Project Adam: Building an Efficient and Scalable Deep Learning Train- ing System”. In: 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14). Broomfield, CO: USENIX Association, 2014, pp. 571–582. ISBN: 978-1-931971-16-4. URL: https: //www.usenix.org/conference/osdi14/technical- sessions/presentation/chilimbi. [9] F. Chollet. keras. https://github.com/fchollet/keras. 2015. [10] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhu- ber. “Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”. In: Computing Research Repository abs/1003.0358 (2010). arXiv: 1003.0358. URL: http://arxiv.org/abs/ 1003.0358. [11] A. Coates, A. Ng, and H. Lee. “An analysis of single-layer net- works in unsupervised feature learning”. In: Proceedings of the fourteenth international conference on artificial intelligence and statis- tics. 2011, pp. 215–223. [12] R. Collobert, K. Kavukcuoglu, and C. Farabet. “Torch7: A Matlab- like Environment for Machine Learning”. In: BigLearn, NIPS Work- shop. 2011, pp. 1–6. URL: http://publications.idiap.ch/ downloads / papers / 2011 / Collobert _ NIPSWORKSHOP _ 2011.pdf. [13] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. “Large Scale Distributed Deep Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Associates, Inc., BIBLIOGRAPHY 53

2012, pp. 1223–1231. URL: http://papers.nips.cc/paper/ 4687-large-scale-distributed-deep-networks.pdf. [14] J. Dean and S. Ghemawat. “MapReduce: Simplified Data Pro- cessing on Large Clusters”. In: Commun. ACM 51.1 (Jan. 2008), pp. 107–113. ISSN: 0001-0782. DOI: 10.1145/1327452.1327492. URL: http://doi.acm.org/10.1145/1327452.1327492. [15] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. “Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation”. In: Proceedings, 11th European PVM/MPI Users’ Group Meeting. Budapest, Hungary, 2004, pp. 97–104. [16] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. “Accurate, Large Mini- batch SGD: Training ImageNet in 1 Hour”. In: Computing Re- search Repository abs/1706.02677 (2017). arXiv: 1706.02677. URL: http://arxiv.org/abs/1706.02677. [17] S. Gupta, W. Zhang, and F. Wang. “Model Accuracy and Run- time Tradeoff in Distributed Deep Learning: A Systematic Study”. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. IJCAI’17. Melbourne, Australia: AAAI Press, 2017, pp. 4854–4858. ISBN: 978-0-9992411-0-3. URL: http://dl.acm. org/citation.cfm?id=3171837.3171972. [18] A. Halevy, P. Norvig, and F. Pereira. “The Unreasonable Effec- tiveness of Data”. In: IEEE Intelligent Systems 24.2 (2009), pp. 8– 12. ISSN: 1541-1672. DOI: 10.1109/MIS.2009.36. [19] K. He, X. Zhang, S. Ren, and J. Sun. “Deep Residual Learning for Image Recognition”. In: Computing Research Repository abs/1512.03385 (2015). arXiv: 1512.03385. URL: http://arxiv.org/abs/ 1512.03385. [20] Q. Ho, J. Cipar, H. Cui, J. K. Kim, S. Lee, P. B. Gibbons, G. A. Gib- son, G. R. Ganger, and E. P. Xing. “More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server”. In: Pro- ceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1. NIPS’13. Lake Tahoe, Nevada: Cur- ran Associates Inc., 2013, pp. 1223–1231. URL: http : / / dl . acm.org/citation.cfm?id=2999611.2999748. 54 BIBLIOGRAPHY

[21] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. “More Effective Distributed ML via a Stale Synchronous Parallel Parameter Server”. In: Ad- vances in Neural Information Processing Systems 26. Ed. by C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Wein- berger. Curran Associates, Inc., 2013, pp. 1223–1231. URL: http: / / papers . nips . cc / paper / 4894 - more - effective - distributed-ml-via-a-stale-synchronous-parallel- parameter-server.pdf. [22] K. Hornik. “Approximation capabilities of multilayer feedfor- ward networks”. In: Neural Networks 4.2 (1991), pp. 251 –257. ISSN: 0893-6080. DOI: https://doi.org/10.1016/0893- 6080(91)90009- T. URL: http://www.sciencedirect. com/science/article/pii/089360809190009T. [23] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Don- ahue, A. Razavi, O. Vinyals, T. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. “Population Based Training of Neural Networks”. In: Computing Research Repository abs/1711.09846 (2017). arXiv: 1711.09846. URL: http://arxiv.org/abs/ 1711.09846. [24] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. B. Gir- shick, S. Guadarrama, and T. Darrell. “Caffe: Convolutional Ar- chitecture for Fast Feature Embedding”. In: Computing Research Repository abs/1408.5093 (2014). arXiv: 1408.5093. URL: http: //arxiv.org/abs/1408.5093. [25] A. Krizhevsky. “One weird trick for parallelizing convolutional neural networks”. In: Computing Research Repository abs/1404.5997 (2014). arXiv: 1404.5997. URL: http://arxiv.org/abs/ 1404.5997. [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. “ImageNet Clas- sification with Deep Convolutional Neural Networks”. In: Ad- vances in Neural Information Processing Systems 25. Ed. by F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger. Curran Asso- ciates, Inc., 2012, pp. 1097–1105. URL: http://papers.nips. cc / paper / 4824 - - classification - with - deep-convolutional-neural-networks.pdf. BIBLIOGRAPHY 55

[27] Q. V. Le, R. Monga, M. Devin, G. Corrado, K. Chen, M. Ran- zato, J. Dean, and A. Y. Ng. “Building high-level features us- ing large scale ”. In: Computing Research Repository abs/1112.6209 (2011). arXiv: 1112.6209. URL: http: //arxiv.org/abs/1112.6209. [28] Q. V. Le, J. Ngiam, A. Coates, A. Lahiri, B. Prochnow, and A. Y. Ng. “On optimization methods for deep learning”. In: Proceed- ings of the 28th International Conference on International Conference on Machine Learning. Omnipress. 2011, pp. 265–272. [29] Y. Lee, S. Jun, C. Bobbie, F. Andy, and Yahoo Big ML team. Ten- sorFlowOnSpark. https://github.com/yahoo/TensorFlowOnSpark. 2016. [30] M. Li, D. G. Anderson, J. W. Park, A. J. Smola, A. Ahmed, V. Josi- fovski, J. Long, E. J. Shekita, and B.-Y. Su. “Scaling Distributed Machine Learning with the Parameter Server”. In: Operating Sys- tems Design and Implementation (OSDI). 2014, pp. 583–598. [31] H. Ma, F. Mao, and G. W. Taylor. “Theano-MPI: a Theano-based Distributed Training Framework”. In: Computing Research Repos- itory abs/1605.08325 (2016). arXiv: 1605.08325. URL: http: //arxiv.org/abs/1605.08325. [32] R. Mayer, C. Mayer, and L. Laich. “The TensorFlow Partitioning and Scheduling Problem: It’s the Critical Path!” In: Computing Research Repository abs/1711.01912 (2017). arXiv: 1711.01912. URL: http://arxiv.org/abs/1711.01912. [33] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou, N. Kumar, M. Norouzi, S. Bengio, and J. Dean. “Device Place- ment Optimization with Reinforcement Learning”. In: Comput- ing Research Repository abs/1706.04972 (2017). arXiv: 1706.04972. URL: http://arxiv.org/abs/1706.04972. [34] B. C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. H. Tung, Y. Wang, Z. Xie, M. Zhang, and K. Zheng. “SINGA: A Distributed Deep Learning Platform”. In: ACM Mul- timedia. 2015. 56 BIBLIOGRAPHY

[35] F. Pellegrini. “Distillating knowledge about SCOTCH”. In: Com- binatorial Scientific Computing. Ed. by U. Naumann, O. Schenk, H. D. Simon, and S. Toledo. Dagstuhl Seminar Proceedings 09061. Dagstuhl, Germany: Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2009. URL: http://drops.dagstuhl. de/opus/volltexte/2009/2091. [36] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning Internal Representations by Error Propagation”. In: Parallel Dis- tributed Processing: Explorations in the Microstructure of Cognition, Vol. 1 (1986), pp. 318–362. URL: http://dl.acm.org/citation. cfm?id=104279.104293. [37] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li. “ImageNet Large Scale Visual Recognition Challenge”. In: Computing Research Repository abs/1409.0575 (2014). arXiv: 1409.0575. URL: http://arxiv.org/abs/1409.0575. [38] A. Sergeev and M. D. Balso. “Horovod: fast and easy distributed deep learning in TensorFlow”. In: CoRR abs/1802.05799 (2018). arXiv: 1802.05799. URL: http://arxiv.org/abs/1802. 05799. [39] S. Shams, R. Platania, K. Lee, and S. J. Park. “Evaluation of Deep Learning Frameworks Over Different HPC Architectures”. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS). 2017, pp. 1389–1396. DOI: 10.1109/ICDCS. 2017.259. [40] S. Shi and X. Chu. “Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs”. In: Comput- ing Research Repository abs/1711.05979 (2017). arXiv: 1711.05979. URL: http://arxiv.org/abs/1711.05979. [41] K. Simonyan and A. Zisserman. “Very Deep Convolutional Net- works for Large-Scale Image Recognition”. In: Computing Re- search Repository abs/1409.1556 (2014). arXiv: 1409.1556. URL: http://arxiv.org/abs/1409.1556. [42] A. Smola and S. Narayanamurthy. “An Architecture for Parallel Topic Models”. In: Proc. VLDB Endow. 3.1-2 (Sept. 2010), pp. 703– 710. ISSN: 2150-8097. DOI: 10.14778/1920841.1920931. URL: http://dx.doi.org/10.14778/1920841.1920931. BIBLIOGRAPHY 57

[43] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. “Re- thinking the Inception Architecture for Computer Vision”. In: CoRR abs/1512.00567 (2015). arXiv: 1512.00567. URL: http: //arxiv.org/abs/1512.00567. [44] Tensorflow benchmarks. https://github.com/tensorflow/ benchmarks/tree/master/scripts/tf_cnn_benchmarks. 2018. [45] V.K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O’Malley, S. Radia, B. Reed, and E. Baldeschwieler. “Apache Hadoop YARN: Yet Another Resource Negotiator”. In: Proceed- ings of the 4th Annual Symposium on Cloud Computing. SOCC ’13. Santa Clara, California: ACM, 2013, 5:1–5:16. ISBN: 978-1-4503- 2428-1. DOI: 10 . 1145 / 2523616 . 2523633. URL: http : / / doi.acm.org/10.1145/2523616.2523633. TM [46] Welcome to Apache Hadoop R !. URL: http://hadoop.apache. org/ (visited on 01/30/2018). [47] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, A. Ghodsi, J. Gonzalez, S. Shenker, and I. Stoica. “Apache Spark: A Unified Engine for Big Data Processing”. In: Commun. ACM 59.11 (Oct. 2016), pp. 56–65. ISSN: 0001-0782. DOI: 10.1145/2934664. URL: http://doi.acm.org/10.1145/2934664. [48] S. Zheng, Q. Meng, T. Wang, W. Chen, N. Yu, Z. Ma, and T. Liu. “Asynchronous Stochastic Gradient Descent with Delay Com- pensation for Distributed Deep Learning”. In: Computing Research Repository abs/1609.08326 (2016). arXiv: 1609.08326. URL: http: //arxiv.org/abs/1609.08326. [49] M. A. Zinkevich, M. Weimer, A. Smola, and L. Li. “Parallelized Stochastic Gradient Descent”. In: Proceedings of the 23rd Interna- tional Conference on Neural Information Processing Systems - Volume 2. NIPS’10. Vancouver, British Columbia, Canada: Curran Asso- ciates Inc., 2010, pp. 2595–2603. URL: http://dl.acm.org/ citation.cfm?id=2997046.2997185. [50] L. Zou. Bringing HPC Techniques to Deep Learning. en-US. Feb. 2017. URL: http://research.baidu.com/bringing-hpc- techniques-deep-learning/ (visited on 03/03/2018). Appendix A

Complete Results

The average images/sec results shown in histograms are shown here in tabular form with the standard deviation. Models that were not included in the result, for having a high images/sec values that made the plots hard to read, are also presented here.

Table A.1: Average Images/sec of 3-5 runs with standard deviation for each model with parameter server synchronous update mode (TFoS).

AlexNet VGG16 VGG19 Resnet50 Resnet152 googlenet inception4 # workers i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std 1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.00 2 105.27 0.15 0.0 0.0 27.34 9.27 0.0 0.0 119.67 0.00 699.19 0.00 104.86 0.00 3 131.81 28.11 67.86 1.43 64.88 1.50 355.68 0.99 150.39 4.11 944.22 15.99 145.05 0.41 4 120.27 20.76 84.06 0.20 80.03 1.77 402.40 0.43 167.64 6.72 1123.82 28.62 178.42 0.34 5 136.27 22.35 95.79 0.09 91.41 1.47 432.12 3.67 179.95 8.67 1243.83 29.63 194.71 7.20 6 148.18 23.77 105.66 1.36 102.64 1.80 455.15 1.86 187.93 9.71 1306.78 45.10 211.75 8.94 7 155.85 24.44 107.85 1.71 106.27 3.59 461.55 5.68 196.67 11.76 1361.72 41.84 215.75 12.75 8 260.30 0.00 114.73 3.38 111.24 3.13 471.17 2.39 193.01 10.50 1393.88 52.32 227.42 11.03 9 153.94 23.26 116.74 1.66 114.03 1.63 457.71 6.06 185.67 8.43 1308.25 81.43 222.10 10.06 10 153.66 4.43 116.34 2.49 114.34 0.72 466.43 3.90 185.43 2.75 1370.13 7.31 221.94 1.66

58 APPENDIX A. COMPLETE RESULTS 59

Table A.2: Average Images/sec of 3-5 runs with standard deviation for each model with two parameter servers in synchronous update mode (TFoS).

AlexNet VGG16 VGG19 Resnet50 Resnet152 googlenet inception4 # workers i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std 1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.00 3 164.08 0.00 0.0 0.0 64.80 0.20 0.0 0.0 178.81 0.00 1020.55 0.00 155.45 0.00 4 211.52 2.84 85.00 1.08 68.15 3.24 455.91 3.13 194.71 7.60 1249.83 0.00 187.90 1.58 5 236.42 4.22 95.77 1.29 77.44 4.11 489.91 4.18 207.11 10.20 1343.26 33.11 216.91 3.06 6 257.29 3.79 104.80 0.61 85.99 3.92 512.80 5.59 215.97 10.74 1432.41 32.33 238.84 4.18 7 266.63 5.08 110.76 5.31 93.03 3.56 524.54 2.43 217.64 11.08 1482.79 44.52 249.46 7.26 8 274.71 3.98 108.08 0.71 105.37 8.02 532.28 0.59 221.90 7.86 1455.12 67.11 263.29 8.74 9 0.0 0.0 96.72 0.82 114.39 0.08 526.31 5.90 216.96 1.19 1451.30 3.70 260.49 0.31 10 276.99 4.77 97.71 0.02 115.59 0.09 529.85 1.69 217.35 0.17 1458.97 3.01 264.38 2.20

Table A.3: Average Images/sec of 3-5 runs with standard deviation for each model with parameter servers asynchronous update mode (TFoS).

AlexNet VGG16 Resnet50 inception4 # workers i/sec std i/sec std i/sec std i/sec std 1 1432.81 0.00 128.90 0.00 188.56 0.00 63.80 0.00 3 166.77 0.26 70.71 0.13 349.65 0.52 139.60 0.10 4 207.14 0.30 87.68 0.18 395.99 0.88 172.60 0.37 5 217.33 0.19 96.33 0.07 449.37 0.85 198.29 0.30 6 248.82 0.30 109.54 0.02 455.33 1.11 219.02 0.35 7 266.32 0.24 113.63 0.11 470.66 0.63 227.57 0.36 8 246.53 0.19 112.33 0.15 463.88 0.80 227.75 0.29 9 272.31 0.33 113.23 0.22 481.72 0.67 223.26 0.36 10 283.62 0.27 115.62 0.25 484.82 0.57 225.68 0.23 60 APPENDIX A. COMPLETE RESULTS

Table A.4: Average Images/sec of 3-5 runs with standard deviation for each model with ring all-reduce (horovod).

AlexNet VGG16 VGG19 Resnet50 Resnet152 googlenet inception4 # workers i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std i/sec std 1 1432.81 0.00 128.90 0.00 109.85 0.00 188.56 0.00 82.03 0.00 440.02 0.00 63.80 0.00 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 2007.18 1.89 335.42 0.41 294.34 0.62 504.38 0.61 224.90 0.42 1149.66 6.22 178.95 0.25 4 2414.63 8.72 439.02 0.63 387.88 0.70 647.68 0.71 293.05 0.50 1497.01 7.08 231.73 0.57 5 3039.77 5.68 543.53 0.95 482.45 0.61 784.26 1.92 358.87 0.42 1816.52 4.24 283.13 0.32 6 2717.42 8.20 499.80 1.08 436.19 0.99 908.32 2.95 399.34 0.63 2113.93 12.79 330.07 1.28 7 3099.85 5.31 582.00 1.60 506.65 1.70 1050.55 1.62 460.18 0.56 2481.44 10.41 383.24 0.24 8 3487.69 4.58 670.69 1.16 584.85 2.46 1188.63 1.98 523.12 1.93 2826.91 15.74 433.26 0.62 9 3879.63 13.60 762.74 0.61 668.27 0.47 1318.68 4.28 584.56 0.87 3125.01 5.76 481.81 0.94 10 4746.46 168.26 848.39 1.12 751.36 0.72 1443.92 2.54 639.42 0.97 3461.96 14.83 527.54 1.05 11 1234.02 12.97 459.20 7.82 453.44 5.24 1379.67 5.42 650.78 0.58 3300.61 47.66 556.51 2.84 12 1276.70 66.53 497.18 9.45 489.14 6.99 1520.42 5.44 709.47 0.41 3554.50 31.63 606.54 2.02 13 1434.08 7.19 535.77 12.78 521.51 8.08 1652.02 6.88 766.07 1.53 3705.32 108.70 657.27 1.30 14 1485.50 78.80 587.30 15.31 565.86 3.83 1771.81 3.32 823.95 3.07 4052.64 105.83 706.63 2.52 15 1616.53 11.30 618.42 9.02 587.61 8.59 1890.97 5.72 875.53 3.78 4250.37 116.31 753.88 2.76 16 1695.17 18.24 629.99 8.87 607.07 4.36 2024.21 4.60 934.79 2.33 4489.39 230.90 804.55 2.91 17 1805.59 14.00 640.09 19.58 606.11 6.86 2152.22 3.95 990.19 1.73 4826.58 149.50 852.91 1.67 18 1915.58 6.32 672.51 7.51 625.13 2.40 2278.12 4.39 1039.37 3.50 5056.34 156.32 899.75 2.38 19 1997.52 6.99 688.86 6.19 649.60 3.51 2380.86 5.29 1095.00 2.35 5257.19 216.22 946.83 1.63 20 2397.04 70.11 725.06 3.34 668.09 6.68 2491.28 11.02 1145.63 3.42 5718.07 114.83 992.43 2.98 www.kth.se