Optimizing Network Performance in Distributed Machine Learning

Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa Imperial College London Microsoft Research Microsoft Research Abstract model across many servers and feed each replica with a subset of the input data. Since the model replicas are To cope with the ever growing availability of training trained using different input data, their model parameters data, there have been several proposals to scale machine will typically diverge. To reconcile these parameters and learning computation beyond a single server and dis- ensure that all model replicas eventually converge, each tribute it across a cluster. While this enables reducing replica periodically pushes its set of parameter values to a the training time, the observed speed up is often limited centralized server, called the parameter server [24]. The by network bottlenecks. latter aggregates all the received updates for each param- To address this, we design MLNET, a host-based com- eter (e.g., by averaging them) and then sends back to all munication layer that aims to improve the network per- replicas the newly computed set of values, which will be formance of distributed machine learning systems. This used at the beginning of the next iteration. As the total is achieved through a combination of traffic reduction numbers of parameters can be very high (up to 1012 [9]), techniques (to diminish network load in the core and multiple parameter servers are used, with each one being at the edges) and traffic management (to reduce average responsible for a subset of the parameters. training time). A key feature of MLNET is its compati- bility with existing hardware and software infrastructure A major challenge of this approach is the high com- so it can be immediately deployed. munication cost. Model replicas must frequently read and write global shared parameters. This generates a We describe the main techniques underpinning ML- large amount of network traffic and, due to the sequen- NET and show through simulation that the overall train- tial nature of many of the machine learning algorithms ing time can be reduced by up to 78%. While prelimi- used, it may also stall the computation if the synchro- nary, our results indicate the critical role played by the nization latency is high. Therefore, the network is of- network and the benefits of introducing a new commu- ten regarded as one of the main bottlenecks for dis- nication layer to increase the performance of distributed tributed machine learning systems [13, 16, 25]. To al- machine learning systems. leviate this issue, these systems are often deployed on high-performance network fabrics such as Infiniband or 1 Introduction RoCE [13, 23], while others have proposed to trade-off algorithm training efficiency for system performance by Over the last decade, machine learning has witnessed an introducing asynchronous communication [10, 24], thus increasing wave of popularity across several domains, in- removing some of the barriers. Unfortunately, neither of cluding web search, image and speech recognition, text these approaches is completely satisfactory as the former processing, gaming, and health care. A key factor caus- significantly increases infrastructure costs while the lat- ing this trend is the availability of large amounts of data ter reduces overall training efficiency. that can be used for training purposes. This has led In this paper, we explore a different yet complemen- to the appearance of several proposals aiming at scal- tary point of the design space. We argue that network ing out the computation by distributing it across many bottlenecks can be greatly reduced through a customized servers [1, 13, 16, 24, 34, 35]. communication layer. To demonstrate this, we designed Typically, these systems adopt an approach referred to MLNET, a novel communication layer for distributed as data parallelism [16]. Rather than training a single machine learning. MLNET uses tree-based overlays to model with all the available input data, they replicate the implement distributed aggregation and multicast and reduce network traffic, and relies on traffic control and pri- In a distributed setting, a server iteratively refines a oritization to improve average training time. shared model by learning from a local data partition, A key constraint underlying our design is that we and periodically synchronizes this model with the other wanted MLNET to be a drop-in solution for existing ma- servers. More specifically, after each iteration, it calcu- chine learning deployments. Therefore, we implemented lates a refinement DWi to the model W. To make sure that MLNET as a user-space process running on hosts with- all servers eventually converge to the same model, they out requiring any changes in the networking hardware, in can synchronize every iteration, every n iterations, or the OS stack, or in the training algorithm code. Further, completely asynchronously. When machine learning al- by sitting in between workers and parameter servers, it gorithms are implemented on traditional distributed plat- decouples these two classes of servers, enabling scal- forms such as Hadoop [38] or Spark [39], servers have to ing each one independently and efficiently masking net- synchronize every iteration. This requires placing a bar- work and server failures. We evaluate its effectiveness rier at the end of a iteration, incurring increasing over- in Section 4 by means of large-scale simulations with head as the system scales. 800 servers and 50 to 400 parameter servers. The results Parameter Server [24] is another approach to imple- show that MLNET reduces the training time by a factor ment synchronization in distributed machine learning. of up to 5x. It outperforms the aforementioned platforms thanks to domain-specific engineering and algorithmic optimiza- 2 Background tions. In this approach, a set of servers act as parameter servers that store the model W. The other servers In this section, we provide a brief introduction to ma- process the training data and act as workers. After DWi chine learning, motivate the need for its distributed exe- are calculated, workers do not communicate with each cution, and discuss the use of parameter servers for scal- other directly, but push DWi to the parameter servers ing distributed machine learning. and then pull a new W to be used in the next iteration. By tuning push/pull frequencies, programmers can balance the training efficiency and system performance. 2.1 Machine Learning For example, in the Stale Synchronous Parallel (SSP) The goal of a machine learning algorithm is to construct model [12, 15], workers are allowed to cache W and use a prediction model that extracts useful knowledge from it in the next iteration while sending the DWi of the pre- training data, and uses it to make inferences about future vious iteration to servers, as long as the cached version arrival data. This can be formalized as an optimization is within a staleness threshold s. In this way, communi- problem: Given a set of training data X, it tries to find a cation can be overlapped with computation. Neverthe- model W that minimizes the error of a prediction func- less, as workers still have to periodically synchronize the tion F(X;W). Typically, a machine learning algorithm model, the network can quickly become a bottleneck. In approaches this problem iteratively, starting from a ran- the next section we show how MLNET can alleviate this domly generated W and then refining its solution gradu- problem using a customized communication layer. ally as more and more data are processed. Complex models are usually able to capture the 3 MLNET Design knowledge hidden in training data. To an extreme, a sufficiently complex model can “memorize” all the infor- We begin the section by describing the MLNET archi- mation contained in the data. In this case, it can give the tecture and then we show how this makes it easy to im- correct prediction for any sample it has seen before, but plement our two techniques to optimize network perfor- may perform poorly for unseen samples. This is called mance, namely traffic reduction and traffic prioritization. over-fitting: a model fits its training data well, but does While we believe that this is a contribution per se, we not generalize to others. This is why a large amount of also see this a first step towards a deeper rethinking of training data is necessary for machine learning. By using the network design for distributed machine learning. We more data, a model can generalize sufficiently, reducing will elaborate on this point in Section 5. the risk of over-fitting. 3.1 Architecture 2.2 Distributed Machine Learning MLNET is a communication layer, running as a local As the size of training data can significantly affect pre- process on workers and parameter servers, similar to diction accuracy, it has become common practice to train Facebook mcrouter setup [30]. These local processes models with large datasets. To speedup these training behave like proxy, intercepting all exchanges between tasks, they are often distributed across many servers. workers and parameter servers. To push training results, 2 a worker initiates a normal TCP connection to a param- crementally and the final result is still correct. We lever- eter server that is actually emulated by a local MLNET age this property to reduce traffic during the push phase. process. Regardless of the actual numbers of parameter For each parameter server, MLNET uses it as the root servers and workers, MLNET maintains a single param- and builds a spanning tree connecting all workers. The eter server abstraction to workers, and symmetrically a workers in the leaves send gradients to their parent nodes. single worker abstraction to parameter servers. This de- The latter aggregate all received gradients and push the sign decouples workers and parameter servers and allows results upstream towards the root where the last step of them to scale in and out independently.

Load more