Pytorch Distributed: Experiences on Accelerating Data Parallel Training

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li† Yanli Zhao† Rohan Varma† Omkar Salpekar† Pieter Noordhuis∗ Teng Li† Adam Paszke‡ Jeff Smith† Brian Vaughan† Pritam Damania† Soumith Chintala†

{shenli, yanlizhao, rvarm1, osalpekar}@fb.com, [email protected], [email protected], [email protected], {jeffksmith, bvaughan, pritam.damania, soumith}@fb.com †Facebook AI ‡University of Warsaw

ABSTRACT 1. INTRODUCTION This paper presents the design, implementation, and evalu- Deep Neural Networks (DNN) have powered a wide spec- ation of the PyTorch distributed data parallel module. Py- trum of applications, ranging from image recognition [20], Torch is a widely-adopted scientific computing package used language translation [15], anomaly detection [16], content in deep learning research and applications. Recent advances recommendation [38], to drug discovery [33], art genera- in deep learning argue for the value of large datasets and tion [28], game play [18], and self-driving cars [13]. Many large models, which necessitates the ability to scale out applications pursue higher intelligence by optimizing larger model training to more computational resources. Data par- models using larger datasets, craving advances in distributed allelism has emerged as a popular solution for distributed training systems. Among existing solutions, distributed data training thanks to its straightforward principle and broad parallel is a dominant strategy due to its minimally intru- applicability. In general, the technique of distributed data sive nature. This paper presents the design, implementa- parallelism replicates the model on every computational re- tion, and evaluation of the distributed data parallel package source to generate gradients independently and then com- in PyTorch v1.5 [30]. municates those gradients at each iteration to keep model Training a DNN model usually repeatedly conducts three replicas consistent. Despite the conceptual simplicity of steps [26], the forward pass to compute loss, the backward the technique, the subtle dependencies between computa- pass to compute gradients, and the optimizer step to update tion and communication make it non-trivial to optimize the parameters. The concept of data parallelism is universally distributed training efficiency. As of v1.5, PyTorch natively applicable to such frameworks. Applications can create mul- provides several techniques to accelerate distributed data tiple replicas of a model, with each model replica working on parallel, including bucketing gradients, overlapping compu- a portion of training data and performing the forward and tation with communication, and skipping gradient synchro- backward passes independently. After that, model replicas nization. Evaluations show that, when configured appropri- can synchronize either their gradients or updated parame- ately, the PyTorch distributed data parallel module attains ters depending on the algorithm. It’s nominally possible to near-linear scalability using 256 GPUs. build a working version of data parallel purely on the application side, as it only requires inserting appropriate com- PVLDB Reference Format: munications into every iteration. However, squeezing out Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter No- ordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pri- the last bit of performance takes an enormous amount of ef- tam Damania, Soumith Chintala. PyTorch Distributed: Experi- fort in design and tuning. Providing native distributed data ences on Accelerating Data Parallel Training. PVLDB, 13(12): parallel APIs on the platform side would help application 3005-3018, 2020. developers focus on optimizing their models, while the plat- DOI: https://doi.org/10.14778/3415478.3415530 form developing team could continuously and transparently improve the training speed. To provide a general distributed data parallel package, the challenges are three-fold. ∗This work was conducted when Pieter Noordhuis was an employee at Facebook. • Mathematical equivalence: The purpose of data parallel is to speed up training on large datasets. Ap- plications expect to harvest the same result model as if This work is licensed under the Creative Commons Attribution- NonCommercial-NoDerivatives 4.0 International License. To view a copy all training had been performed locally without model of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For replication. This requires mathematical equivalence to any use beyond those covered by this license, obtain permission by emailing local training despite its distributed nature. [email protected]. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. • Non-intrusive and interceptive API: Application Proceedings of the VLDB Endowment, Vol. 13, No. 12 ISSN 2150-8097. developments usually start from local models and then DOI: https://doi.org/10.14778/3415478.3415530 scale out when necessary. To avoid the exorbitant

3005 hurdles during the transition, the API must be non- 2. BACKGROUND intrusive in application code. On the other hand, the Before diving into distributed training, let us briefly dis- API needs to allow the internal implementation to cuss the implementation and execution of local model train- timely intercept signals to carry out communications ing using PyTorch. Then, we explain and justify the idea of and system optimizations. data parallelism and describe communication primitives. • High performance: Data parallel training is sub- 2.1 PyTorch ject to subtle dependencies between computations and PyTorch organizes values into Tensors which are generic communications. The design and implementation have n-dimensional arrays with a rich set of data manipulating to explore the solution space to efficiently convert more operations. A Module defines a transform from input val- resources into higher training throughput. ues to output values, and its behavior during the forward PyTorch provides distributed data parallel as an nn.Module pass is specified by its forward member function. A Module class, where applications provide their model at construction can contain Tensors as parameters. For example, a Linear time as a sub-module. To guarantee mathematical equiva- Module contains a weight parameter and a bias parameter, lence, all replicas start from the same initial values for model whose forward function generates the output by multiplying parameters and synchronize gradients to keep parameters the input with the weight and adding the bias. An appli- consistent across training iterations. To minimize the intru- cation composes its own Module by stitching together native siveness, the implementation exposes the same forward [7] Modules (e.g., linear, convolution, etc.) and Functions (e.g., API as the user model, allowing applications to seamlessly relu, pool, etc.) in the custom forward function. A typi- replace subsequent occurrences of a user model with the dis- cal training iteration contains a forward pass to generate tributed data parallel model object with no additional code losses using inputs and labels, a backward pass to compute changes. Several techniques are integrated into the design to gradients for parameters, and an optimizer step to update deliver high-performance training, including bucketing gra- parameters using gradients. More specifically, during the dients, overlapping communication with computation, and forward pass, PyTorch builds an autograd graph to record skipping synchronization. actions performed. Then, in the backward pass, it uses the Evaluations were conducted on an exclusive 32-GPU clus- autograd graph to conduct backpropagation to generate gra- ter and on 256 GPUs from a much larger shared entitlement. dients. Finally, the optimizer applies the gradients to update We developed benchmarks to evaluate the distributed pack- parameters. The training process repeats these three steps age across different scales to present an in-depth view of until the model converges. the performance implications of different optimization tech- 2.2 Data Parallelism niques and configurations. Experiments also cover the com- parison between NCCL and Gloo communication libraries. PyTorch offers several tools to facilitate distributed train- The results show that 1) communication is the dominant ing, including DataParallel for single-process multi-thread training latency contributor, and its impact increases with data parallel training using multiple GPUs on the same model sizes; 2) bucket sizes considerably affect communica- machine, DistributedDataParallel for multi-process data tion efficiency, which could lead to more than 2X speedup if parallel training across GPUs and machines, and RPC [6] configured properly; 3) skipping synchronizations appropri- for general distributed model parallel training (e.g., param- ately would significantly reduce amortized communication eter server [27]). The remainder of this paper focuses on overhead without noticeably degrading convergence speed. DistributedDataParallel. Data parallelism enables dis- Techniques described in this paper were first released in tributed training by communicating gradients before the op- PyTorch v1.1. During the past year, we have seen significant timizer step to make sure that parameters of all model repli- adoption both internally and externally. Within Facebook, cas are updated using exactly the same set of gradients, and a workload study from 05/11/20 to 06/05/20 shows that hence model replicas can stay consistent across iterations. more than 60% of production GPU hours during that period Parameter averaging is another popular technique to scale were spent on the PyTorch distributed data parallel pack- out model training. Similarly, it can launch multiple pro- age across a wide variety of applications, including speech, cesses across multiple machines, but instead of synchroniz- vision, mobile vision, translation, etc. There are three main ing gradients, parameter averaging directly computes the contributions in this paper. First, this paper reveals the average of all model parameters. This occurs after the lo- design and implementation of a widely adopted industrial cal optimizer step, meaning that parameter averaging can state-of-the-art distributed training solution. Second, this be implemented completely as an auxiliary step and does paper highlights real-world caveats (e.g., due to pluralized not need to interact with local training steps at all, which is graphs) that were overlooked by prior work. Third, we share attractive as it can easily and cleanly decouple the code of performance tuning experiences collected from serving in- distributed training and local iterations. There are several ternal teams and open-source community users and summa- caveats with parameter averaging. rized several directions for future improvements. • Parameter averaging can produce vastly different re- The remainder of the paper is organized as follows. Sec- sults compared to local training, which, sometimes, tion 2 briefly introduces PyTorch and data parallelism. Sec- can be detrimental to model accuracy. The root cause tion 3 elaborates the design for the PyTorch distributed data is that parameter averaging is not mathematically equ- parallel module. Implementations and evaluations are pre- ivalent to processing all input data locally, especially sented in Section 4 and Section 5 respectively. Then, Sec- when the optimizer relies on past local gradients val- tion 6 discusses lessons learned and opportunities for future ues (e.g., momentum). As different model replicas are improvements, and Section 7 surveys related work. Finally, likely to see different gradients, the states in optimiz- Section 8 concludes the paper. ers can gradually diverge, causing conflicting gradient

3006 descent directions. This can result in inexplicable dif- library. The following ferences in performance when switching from locally sections are presented DistributedDataParallel optimized models to large scale deployed models. in the top-down order of this stack graph. Python API • The structure of parameter averaging orchestrates com- Section 3.1 presents putation (i.e., backward pass) and communication (i.e., general principles that computing average) into non-overlapping phases, using Gradient Reduction drive DDP API de- optimizer step() functions as a hard separation point. sign. Section 3.2 ex- Regardless of how vigorously we optimize the compu- plains gradient reduc- tation or communication, one type of resource will stay Collective Communication tion techniques used idle at any given time instance, giving up a substantial in the PyTorch dis- NCCL Gloo MPI performance optimization opportunity. tributed data parallel package. Finally, Sec- Given the above fundamental pitfalls, we decided to im- tion 3.3 discusses the plement distributed training using data parallelism to syn- collective communica- chronize gradients instead of parameters. Note that, ap- Figure 1: Building Blocks tion backend options plications can still easily build parameter averaging using for DDP. PyTorch. In fact, the collective communication feature described in Section 3.3 is an appropriate solution for this use 3.1 API case. Applications just need to explicitly launch AllReduce When designing the API, we have defined two design goals operations to calculate averaged parameters accordingly. to achieve the necessary functionality. 2.3 AllReduce • Non-intrusive: The API must be non-intrusive to AllReduce is the primitive communication API used by applications. Application developers usually start from DistributedDataParallel to compute gradient summation writing local training scripts, and scale out when hit- across all processes. It is supported by multiple communi- ting the resource limit on a single machine. At that cation libraries, including NCCL [2], Gloo [1], and MPI [4]. point, it is unacceptable to ask developers to rewrite The AllReduce operation expects each participating pro- the entire application to enable distributed data par- cess to provide an equally-sized tensor, collectively applies allel training. Instead, the developer should be able to a given arithmetic operation (e.g., sum, prod, min, max) to reuse the local training script with minimal modifica- input tensors from all processes, and returns the same re- tions. sult tensor to each participant. A naive implementation could simply let every process broadcast its input tensor • Interceptive: The API needs to allow the implemen- to all peers and then apply the arithmetic operation in- tation to intercept various signals and trigger appro- dependently. However, as AllReduce has significant im- priate algorithms promptly. Distributed data parallel pact on distributed training speed, communication libraries aims at accelerating training by using more compu- have implemented more sophisticated and more efficient al- tational resources. This process requires subtle opti- gorithms, such as ring-based AllReduce [2] and tree-based mizations in both computations and communications AllReduce [23]. As one AllReduce operation cannot start to achieve the best performance. Hence, the API must until all processes join, it is considered to be a synchronized expose as many optimization opportunities as possible communication, as opposed to the P2P communication used to the internal implementation. in parameter servers [27]. Given the above requirements, we implemented distributed 3. SYSTEM DESIGN data parallel as an nn.Module, which takes the local model as 1 a constructor argument and transparently synchronizes gra- PyTorch [30] provides a DistributedDataParallel (DDP ) dients in the backward pass. The code snippet below shows module to help easily parallelize training across multiple pro- an example of using DDP module. This example uses an cesses and machines. During distributed training, each pro- nn.Linear layer to create a local model on line 10. Then, it cess has its own local model replica and local optimizer. converts the local model into a distributed training model on In terms of correctness, distributed data parallel training line 11 and sets up the optimizer on line 12. Line 14 through and local training must be mathematically equivalent. DDP 23 are typical forward pass, backward pass, and optimizer guarantees the correctness by making sure that all model step implementations. In this toy distributed training ex- replicas start from the exact same model state, and see ample, line 11 is the only difference that converts a local the same parameter gradients after every backward pass. training application into a distributed one, which satisfies Therefore, even though optimizers from different processes the non-intrusive requirement. It also fulfills the intercep- are all independent, they should be able to bring their local tive requirement. The constructor allows DDP to inspect the model replicas to the same state at the end of every itera- 2 model structure and parameters. After construction, the lo- tion . Fig. 1 illustrates building blocks of DDP, which con- cal model is replaced by the distributed one, which can then tains a Python API frontend, C++ gradient reduction core easily intercept the forward() call to perform necessary ac- algorithm, and employs the c10d collective communication tions accordingly. For the backward pass, DDP relies on back- 1For simplicity, the rest of the paper uses the acronym DDP ward hooks to trigger gradient reduction, which will be in- to represent DistributedDataParallel henceforth. voked by the autograd engine when executing backward() 2For optimizers with intrinsic randomness, different pro- on the loss tensor. cesses can initialize their states using the same random seed.

3007 0 1 after the local backward pass and before updating local pa- 10 10 rameters. However, the API shown in Section 3.1 does not

1 10 provide an explicit entry point for this phase as there is nothing between backward() and step(). Fortunately, the 2 0 10 10 PyTorch autograd engine accepts custom backward hooks. DDP can register autograd hooks to trigger computation after 3 10 every backward pass. When ﬁred, each hook scans through Total Gloo Execution Time (Sec) Total NCCL Execution Time (Sec) all local model parameters, and retrieves the gradient tensor 4 10 1 10 from each parameter. Then, it uses the AllReduce collec- 1K 10K 100K 1M 10M 1K 10K 100K 1M 10M Number of Parameters per AllReduce Number of Parameters per AllReduce tive communication call to calculate the average gradients (a) NCCL (b) GLOO on each parameter across all processes, and writes the result

0.25 back to the gradient tensor. Measured Range Measured Range 6 Median Time Median Time The naive solution works correctly, but there are two per- 0.20 5 formance concerns.

0.15 4 • Collective communication performs poorly on small 3 tensors, which will be especially prominent on large 0.10 2 models with a massive number of small parameters.

0.05 1 • Separating gradient computation and synchronization Time Elapsed in Backward on GPU (Sec) Time Elapsed in Backward on CPU (Sec) 0.00 0 forfeits the opportunity to overlap computation with 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Number of Parameters 1e7 Number of Parameters 1e7 communication due to the hard boundary in between. (c) GPU (d) CPU The following sections elucidates solutions to address the Figure 2: Communication vs Computation Delay above two concerns. 3.2.2 Gradient Bucketing

1 import torch The idea of gradient bucketing is motivated by the ob- 2 import torch.nn as nn servation that collective communications are more eﬃcient 3 import torch.nn.parallel as par on large tensors. Fig. 2 (a) and (b) provide a quantitative 4 import torch.optim as optim 5 view, which show the total execution time to AllReduce 6 # initialize torch.distributed properly 60M torch.float32 parameters with diﬀerent numbers of 7 # with init_process_group parameters per AllReduce. To maximize the bandwidth uti- 8 9 # setup model and optimizer lization, AllReduce operations are launched asynchronously 10 net = nn.Linear(10, 10) and block waiting on all of them together, mimicking DDP’s 11 net = par .DistributedDataParallel(net) gradient reduction algorithm. The experiments are con- 12 opt = optim.SGD(net.parameters(), lr=0.01) 13 ducted on an NVLink [3] enabled server with two NVIDIA 14 # run forward pass Quadro GP100 GPUs. NCCL [2] AllReduce runs on CUDA 15 inp = torch.randn(20, 10) input tensors directly, while Gloo [1] AllReduce runs on 16 exp = torch.randn(20, 10) CPU input tensors to eliminate the overhead of copying be- 17 out = net(inp) 18 tween CUDA memory to CPU memory when using Gloo 19 # run backward pass backend. The total communication time clearly decreases 20 nn.MSELoss()(out, exp).backward() when using larger input tensors, for both NCCL and Gloo. 21 22 # update parameters Gloo reaches pinnacle speed at around 500K parameters per 23 opt . step () input tensor, while there is no clear saturation signal for NCCL on NVLink with even 20M-parameter GPU tensors. These experiments suggest that, instead of launching a 3.2 Gradient Reduction dedicated AllReduce immediately when each gradient ten- The gradient reduction algorithm in DDP has evolved over sor becomes available, DDP can achieve higher throughput the past releases. To introduce the structure of the current and lower latency if it waits for a short period of time and implementation, let us start from a naive solution, gradually buckets multiple gradients into one AllReduce operation. introduce more complexities, and land in the current version This would be especially helpful for models with many small as of today in PyTorch v1.5.0. This will also explain how parameters. However, DDP should not communicate all gra- the same simple API described in Section 3.1 allows us to dients in one single AllReduce, otherwise, no communication install various performance optimization algorithms. can start before the computation is over. Fig. 2 (c) and (d) show the GPU and CPU backward computations time of a 3.2.1 A Naive Solution ResNet152 [20] that contains roughly 60M parameters. The As mentioned in the beginning of Section 3, DDP guaran- X axis is the number of ready gradients and the Y axis the tees correctness by letting all training processes (1) start time elapsed since the beginning of the backward pass. The from the same model state and (2) consume the same gra- backward on GPU takes about 250ms to complete, which is dients in every iteration. The former can be achieved by in the same order of magnitude as NCCL on NVLink. This broadcasting model states from one process to all others at conclusion also applies to Gloo and CPU backward. These the construction time of DDP. To implement the latter, a measurements herald that, with relatively small bucket sizes, naive solution can insert a gradient synchronization phase DDP can launch AllReduce operations concurrently with the

3008 Ready Skipped Ready Bucket AllReduce bucket as ready. As a result, the backward pass could hang. Gradient Gradient Time Fig. 3 (b) shows an example, where the parameter corre- t t t t sponding to gradient g3 is skipped in one iteration, leading g1 g1 g1 g1 to the absent of the ready signal for g3. To address this g2 g2 g2 g2 problem, DDP traverses the autograd graph from the output tensors of the forward pass to find all participating param- g g g g 3 3 3 3 eters. The readiness of those participating tensors is a suf- g4 g4 g4 g4 ficient signal to conclude the completion of the backward pass. Therefore, DDP can avoid waiting for the rest of the Process 1 Process 2 Process 1 Process 2 parameter gradients by proactively marking them ready at (a) (b) the end of the forward pass. Note that, this change does not prevent us from developing non-intrusive APIs, because ap- Figure 3: Gradient Synchronization Failures plication directly invokes the forward function on DDP and hence DDP can easily insert this step in its member function. backward pass to overlap communication with computation, which would make a difference in per iteration latency. Algorithm 1: DistributedDataParallel Input: Process rank r, bucket size cap c, local model 3.2.3 Overlap Computation with Communication net 1 Function constructor(net): The AllReduce operation on gradients can start before 2 if r=0 then the local backward pass finishes. With bucketing, DDP only 3 broadcast net states to other processes needs to wait for all contents in the same bucket before 4 init buckets, allocate parameters to buckets in the launching communications. Under such settings, trigger- reverse order of net.parameters() ing AllReduce at the end of the backward pass is no longer 5 for p in net.parameters() do sufficient. It needs to react to more frequent signals and 6 acc ← p.grad accumulator launches AllReduce more promptly. Therefore, DDP regis- 7 acc → add post hook(autograd hook) ters one autograd hook for each gradient accumulator. The 8 Function forward(inp): hook fires after its corresponding accumulator updating the 9 out = net(inp) gradients, and will inspect the bucket it pertains. If hooks 10 traverse autograd graph from out and mark of all gradients in the same buckets have fired, the last hook unused parameters as ready will trigger an asynchronous AllReduce on that bucket. 11 return out Two caveats require caution. First, the reducing order 12 Function autograd hook(param index): must be the same across all processes, otherwise, AllReduce 13 get bucket bi and bucket offset using param index contents might mismatch, resulting in incorrect reduction 14 get parameter var using param index result or program crash. However, PyTorch dynamically 15 view ← bi.narrow(offset, var.size()) builds the autograd graph in every forward pass, and differ- 16 view.copy (var.grad) 17 if all grads in b are ready then ent processes might not agree on the gradient ready order. i 18 mark bi as ready Fig. 3 (a) shows one example, where the two vertical axes represent time and dotted lines indicate when a gradient is 19 launch AllReduce on ready buckets in order 20 if all buckets are ready then ready. In process 1, the four gradients are computed in or- 21 block waiting for all AllReduce ops der, but the gradient g2 are computed after g3 and g4 on process 2. In this case, if all processes AllReduce buckets as soon as they become ready, the AllReduce content would mismatch. Therefore, all processes must use the same buck- Algorithm 1 presents the pseudo-code of DDP. The con- eting order, and no process can launch AllReduce on bucket structor contains two major steps, broadcasting model states i+1 before embarking bucket i. If bucket 0 is the last one and installing autograd hooks. DDP’s forward function is a that becomes ready, there is no way that communication can simple wrapper of the local model’s forward. It traverses overlap with computation. PyTorch v1.5.0 addresses this the autograd graph to mark unused parameters accordingly. problem by using the reverse order of model.parameters() The autograd hook takes the internal parameter index as in- as the bucketing order, assuming that, layers are likely regis- put, which helps to find the parameter tensor and its belong- tered according to the same order as they are invoked in the ing bucket. It writes the local gradient to the correct offset in forward pass. Hence, the reverse order should approximately the bucket and then launches the asynchronous AllReduce represent the gradient computation order in the backward operation. There is an additional finalizing step omitted in pass. Admittedly, this is not a perfect solution, but is an ap- the pseudo-code that waits for AllReduce operations and proximation that we can rely on with minimum engineering writes the value back to gradients at the end of the back- overhead. ward pass. Fig. 4 elucidates how DDP interacts with the local Second, it is possible that one training iteration only in- model during the forward and backward passes. volves a sub-graph in the model and the sub-graph can be The above solution works for most use cases. However, as different from iteration to iteration, meaning that some gra- DDP always computes the average of all gradients and writes dients might be skipped in some iterations. However, as them back to parameter .grad field, an optimizer cannot gradient-to-bucket mapping is determined at the construc- distinguish whether a gradient has participated in the last tion time, those absent gradients would leave some buckets backward pass or not. Due to the decoupled design of DDP never seeing the final autograd hook and failing to mark the and the optimizer, there is no side channel for DDP to allude

3009 local model 1 DDP1 DDP2 local model 2 will synchronize the accumulated gradients altogether. The information of globally unused parameters also accumulates g g w1 gw1 gb1 b1 w1 allreduce2 w1 b1 gb1 gw1 w1 in the bitmap, and serves when the next communication

addmm1 bucket2 gb1 gb1 bucket2 addmm1 takes place. Below is an example code snippet.

w2 gw2 gb2 b2 g b2 gb2 gw2 w2 gw2 allreduce1 w2 1 ddp = DistributedDataParallel(net) addmm2 g addmm2 2 with ddp.no_sync(): bucket1 gb2 b2 bucket1 3 for inp, exp in zip(inputs, expected_outputs): 4 # no synchronization, accumulate grads mse_loss mse_loss 5 loss_fn(ddp(inp), exp).backward() 6 # synchronize grads 7 loss_fn(ddp(another_inp), another_exp).backward() loss loss 8 opt . step () Process 1 Process 2

Parameter Gradient Autograd Edge Copy Communication 3.3 Collective Communication Distributed data parallel training uses a special communi- Figure 4: Distributed Gradient Reduction cation pattern, where every participant provides an equally- sized tensor and collects the global sum across all participants. This can certainly be implemented as a gather oper- that information to the optimizer. Without this informa- ator followed by local reductions on every participant using tion, the training process could suffer from regressions on point-to-point communication, but that would forfeit op- model accuracy, e.g., when the optimizer uses gradient ab- portunities for performance optimizations [23]. DDP is built sence information to skip updating momentum values. To on top of collective communication libraries, including three tackle this problem, DDP should only touch gradients that options, NCCL [2], Gloo [1], and MPI [4]. 3 DDP takes the are indeed involved in the backward pass. Nevertheless, this APIs from the three libraries and wraps them into the same information cannot be extracted from the local autograd ProcessGroup API. The name heralds that ProcessGroup graph alone, because locally absent gradients might still be expects multiple processes to work collectively as a group. involved in the forward/backward pass in a peer DDP process. All ProcessGroup instances construct at the same time by Therefore, DDP uses a bitmap to keep track of local param- using a rendezvous service, where the first arrival will block eter participants and launches one additional AllReduce to waiting until the last instance joins. For NCCL backend, the collect globally unused parameters. Unfortunately, DDP can- ProcessGroup maintains a dedicated set of CUDA streams not coalesce this bitmap into other gradient AllReduce oper- for communication, so that communications will not block ations due to the potential mismatch in element types. Such the computation in the default stream. As all communica- additional overhead only materializes when the application tions are collective operations, subsequent operations on all explicitly tells DDP to look for unused parameters, and hence ProcessGroup instances must match in size and type and the price is only paid when necessary. follow the same order. Using the same ProcessGroup API 3.2.4 Gradient Accumulation for all libraries allows us to experiment with different communication algorithms with the same DDP implementation. One common technique to speed up distributed data par- For example, PyTorch v1.5 provides a composite round- allel training is to reduce gradient synchronization frequen- robin ProcessGroup implementation, which takes a list of cies. Instead of launching AllReduce in every iteration, the ProcessGroup instances and dispatches collective communi- application can conduct n local training iterations before cations to those ProcessGroup instances in a round-robin synchronizing gradients globally. This is also helpful if the manner. By using round-robin ProcessGroups, DDP can at- input batch is too large to fit into a device, where the ap- tain higher bandwidth utilization if a single NCCL, Gloo, or plication could split one input batch into multiple micro- MPI ProcessGroup is unable to saturate the link capacity. batches, run local forward and backward passes on every micro-batch, and only launch gradient synchronization at the boundaries of large batches. Theoretically, this should 4. IMPLEMENTATION produce the same results as if all data in the large batch The implementation of DDP has evolved several times in is processed in one shot, as gradients will simply be accu- the past few releases. This section focus on the current mulated to the same tensor. However, this conflicts with status as of PyTorch v1.5.0. DDP implementation lives in the gradient reduction algorithm discussed in Section 3.2.3 both Python and C++ files, with Python exposing the API to some degree. That algorithm would mark unused pa- and composing non-performance-critical components, and rameters as ready at the end of every forward pass, while C++ serving the core gradient reduction algorithm. The those unused parameters in one iteration still could partici- Python API calls into C++ core through Pybind11 [5]. pate in subsequent iterations. Moreover, DDP cannot distin- 4.1 Python Front-end guish whether the application plans to immediately invoke optimizer.step() after backward or accumulate gradients The DDP nn.module is implemented in distributed.py, through multiple iterations. Therefore, we need to introduce which contains user-facing components, including the con- one additional interface (i.e., no sync) for this use case. structor, the forward function, and the no sync context Under the hood, the implementation for no sync is very manager. Besides the general ideas highlighted in Section 3, simple. The context manager just toggles a flag on entering there are several implementation details in the Python front- and exiting the context, and the flag is consumed in the end that shapes the behavior of DDP. forward function of DDP. In no sync mode, all DDP hooks 3Please refer to documents of the three libraries for their are disabled, and the first backward pass out of the context design and implementation.

3010 Configurable Knobs are exposed in the DDP constructor dients. Each post-hook function decrements the count, and API, including 1) process group to specify a process group DDP marks a bucket as ready when that count reaches zero. instance for DDP to run AllReduce, which helps to avoid In the next forward pass, DDP replenishes the pending gra- messing up with the default process group, 2) bucket cap mb dient count for every bucket. to control the AllReduce bucket size, where applications Bucket AllReduce is the main source of communication should tune this knob to optimize training speed, and 3) overhead in DDP. On one hand, packing more gradients into find unused parameters to toggle whether DDP should de- the same bucket would reduce the amortized system over- tect unused parameters by traversing the autograd graph. head of communication. One the other hand, using a large Model Device Affinity in the local model also governs bucket size would result in longer lead time for reduction, as DDP’s behavior, especially if the model spans multiple de- each bucket needs to wait for more gradients. Hence, bucket vices, which is common when the model is too large to fit size is the key trade-off. By default, each bucket is 25MB in into a single device. For large models, applications can place size. Applications should measure their impact empirically different layers of the model onto difference devices, and use and set it to the optimal value for their use cases. Tensor.to(device) API to move intermediate output from Globally Unused Parameters’ gradients should stay one device to another. DDP also works with multi-device intact during the forward and the backward passes. Detect- models. As long as the device ids argument is None or ing unused parameters requires global information, as one an empty list, DDP will inspect the model, perform sanity parameter could be absent in one DDP process during one it- checks and apply configurations accordingly. Then, it treats eration, but participates training in the same iteration in an- the multi-device model as one entirety. other process. DDP maintains local unused parameter infor- Model Buffers are necessary when layers need to keep mation in a bitmap, and launches an additional AllReduce track of states like the running variance and the running to gather a global bitmap. As the bitmap is much smaller mean (e.g., BatchNorm). DDP supports model buffers by let- than tensor sizes, instead of creating per-bucket bitmaps, ting the process with the rank 0 to take the authority. If the all parameters in the model share the same bitmap. The model contains buffers, DDP will broadcast the buffer values bitmap lives on CPU to avoid launching dedicated CUDA from rank 0 process to all other processes before starting kernels for each update. However, some ProcessGroup back- the forward pass on the local model. This behavior is also ends might not be able to run AllReduce on CPU ten- compatible with the no sync mode. When no sync mode is sors. For example, ProcessGroupNCCL only supports CUDA enabled, it sets a flag in the forward pass properly to indi- tensors. Moreover, as DDP should work with any custom cate whether it expects gradient reductions in the immediate ProcessGroup backend, it cannot make assumptions that backward pass. If the communication takes place, DDP will all backends support CPU tensors. To address this prob- then broadcast buffers prior to the subsequent forward pass. lem, DDP maintains another bitmap on the same device as the first model parameter, and invokes a non-blocking copy 4.2 Core Gradient Reduction to move the CPU bitmap to the device bitmap for collective communications. Major development efforts are spent in gradient reduction as it is the most performance-critical step in DDP. The implementation lives in reducer.cpp which consists of four main components, namely, building parameter-to-bucket map, in- 5. EVALUATION stalling autograd hooks, launching bucket AllReduce, and This section presents the evaluation results of PyTorch detecting globally unused parameters. This section expati- DDP using an exclusive 32 GPU cluster and a shared enti- ates on these four components. tlement. Fig. 5 shows the interconnection of the 8 GPUs Parameter-to-Bucket Mapping has a considerable im- within the same server. In the exclusive cluster, the GPUs pact on DDP speed. In every backward pass, tensors are are located on 4 servers, connected using Mellanox MT27700 copied from all parameter gradients to buckets, and aver- ConnectX-4 100GB/s NIC. All 4 servers reside in the same aged gradients are copied back after AllReduce. To acceler- rack, and each server is equipped with 8 NVIDIA Tesla V100 ate copy operations, buckets are always created on the same GPUs. We only use device as the parameters. If the model spans multiple de- the shared entitlement vices, DDP takes device affinity into consideration to make when a set of exper- NV2 NV1 NODE sure that all parameters in the same bucket are on the same iments require more than device. The order of AllReduce also makes a difference, as 32 GPUs. In the shared 7 it dictates how much communication can overlap with com- entitlement, we submit 6 putation. DDP launches AllReduce in the reverse order of jobs to run on different 5 model.parameters(). numbers of GPUs where 4 Autograd Hook is the entry point for DDP in the back- different jobs can run ward pass. During construction, DDP loops over all param- on different machines, GPUs 3 eters in the model, finds the gradient accumulator on every and hence the hardware 2 parameter, and installs the same post-hook function to ev- and network connectiv- 1 ery gradient accumulator. The gradient accumulator will ity can vary from job to 0 fire post hooks when the corresponding gradient is ready, job. Although the dis- 0 1 2 3 4 5 6 7 and DDP will figure out when an entire bucket is ready to parity in the test envi- GPUs launch an AllReduce operation. However, as there is no ronment can lead to dif- guarantee on the order of gradient readiness, DDP cannot se- ferent latency measures lectively pick parameters to install hooks. In the current even for the same code, Figure 5: GPU Topology implementation, each bucket keeps a count of pending gra- we pack the same set of

3011 is 25MB, which is our best effort estimation based experiences. The following experiments also confirm this is a rea- sonable choice for ResNet50 and BERT. This section compares per iteration latency across different bucket sizes using 16 GPUs on two machines. Zero bucket size means each gradient will be communicated on its own as soon as it is ready. This serves as a baseline on one extreme of the bucket size spectrum. The other extreme is communication all gradients in one short, which is skipped as results in Fig. 7 and Fig. 8 clearly show the best option for both ResNet50 and Figure 6: Per Iteration Latency Breakdown BERT is somewhere in the middle. Fig. 7 (a) uses box-whisker to illustrate how bucket size affects per iteration latency on ResNet50 with NCCL back- experiments into the same job, so that the trend shown in end. The x-axis is the bucket size in MBs, and Y-axis per the same curve is still meaningful. iteration latency in seconds. The outliers are the tiny delay We measure DDP per iteration latency and scalability us- spikes at 100 iteration boundaries caused by DDP instance ing two popular models, ResNet50 [20] and BERT [15], to re-construction and input data regeneration. Other than represent typical vision and NLP applications. Most ex- that, delays of most iterations concentrate in a very nar- periments use randomly generated synthetic inputs and la- row time range, which also agrees with the results shown bels, which are sufficient as the purpose is to compare per in Fig. 6 (a). The results show that the highest speed is iteration latency instead of model accuracy. Experiments achieved between 10MB and 25MB bucket sizes. Fig. 7 (b) compute losses using the CrossEntropyLoss function and presents the same measurements for Gloo backend. The re- update parameters using the SGD optimizer. Configurations sults are different from NCCL backend in two ways, 1) per for accuracy-related experiments will be explained in detail iteration latency falls into a large range, 2) the 5MB bucket close to their presentations. size attains higher speed compared to 10MB and 25MB. The 5.1 Latency Breakdown first difference matches with Fig. 6 (b). To understand the second difference, let us revisit Fig. 2 (b) on Gloo AllReduce A typical training iteration contains three steps: forward latency across different tensor sizes. It’s clear that the total pass to compute loss, backward pass to compute gradients, AllReduce time fluctuates around the same level when the and optimizer step to update parameters. With DDP, the bucket size is larger than 512KB. Therefore, larger bucket backward pass involves local computation and AllReduce sizes beyond 512KB with Gloo backend would only mean communication. To demonstrate the effectiveness of over- longer waiting time for gradients, which leads to longer per lapping computation with communication, Fig. 6 plots the iteration latency. Fig. 7 (c) and (d) show the measurements latency breakdown when using NCCL and Gloo backends for for BERT model. As BERT model contains 15X more pa- ResNet50 and BERT models respectively. All experiments rameters compared to ResNet50, intuitively, it should ben- are conducted using 32 GPUs across 4 machines. To visu- efit from larger buckets as larger communication overheads ally compare the speedup on different model and backend would dwarf the waiting time for the first bucket. The re- combinations, we normalize the total latency to 1 for all non- sults verified the intuition with NCCL backend, where 50MB overlapping cases. The results demonstrate that the back- bucket size leads to the best performance. However, with ward pass is the most time-consuming step with PyTorch Gloo backend, 5MB bucket size still wins with the lowest DDP training, as AllReduce communications (i.e., gradient per iteration latency. synchronization) are completed in this step. This observa- Fig. 8 presents the results of the same set of experiments tion justifies that the DDP backward pass deserves the most but on 32 GPUs. In this case, the outliers span a larger efforts for improvements. Within the backward pass, the range, which is not surprising as synchronizations usually communication step takes more than half of the total delay take longer with more participants and the impact of stran- and this is exacerbated with the increase of the model size. gler is more prominent. Fig. 8 (a) and (b) both suggest Between these two backends, NCCL is considerably faster that 0MB bucket size leads to obviously longer per itera- than GLOO. The speedup is most effective when the com- tion latency on 32 GPUs compared to 16 GPUs, as per- putation and communication take roughly the same amount gradient reductions on a larger cluster are expected to be of time as they can overlap more. The overlapping approach slower. However, when bucket size is set to above 5MB, helps ResNet and BERT on NCCL attain 38.0% and 35.2% scaling from 16 GPUs to 32 GPUs does not lead to a notice- speedup. With GLOO backend, the gain shrinks to 26.8% able speed regression. This is probably because although and 21.5% respectively, as GLOO communication becomes individual AllReduce operations is expected to be slower, the dominating delay in the backward pass. asynchronous execution and parallelism could help to hide 5.2 Bucket Size the overall delay. To avoid launching an excessive number of AllReduce operations, DDP organizes small gradients into larger buckets 5.3 Scalability and synchronizes each bucket using an AllReduce opera- To understand the scalability of DDP, we measure per iter- tion. With this design, bucket size is an important configu- ation training latency of ResNet50 and BERT using NCCL ration knob. DDP exposes this knob to applications through and Gloo backend on up to 256 GPUs in the shared enti- bucket cap mb argument. No single bucket size can best tlement. Results are presented in Fig. 9. The X-axis is the serve all applications. This value should be measured and number of GPUs, and Y-axis the latency. Figure 9 (a) shows determined empirically. The default value of bucket cap mb that the per iteration latency steadily increases as it scales

3012 0.40 0.40 1.4 1.4

0.35 0.35 1.2 1.2

0.30 0.30 1.0 1.0

0.25 0.25 0.8 0.8

0.20 0.20 0.6 0.6 Per Iteration Latency (Sec) Per Iteration Latency (Sec) Per Iteration Latency (Sec) Per Iteration Latency (Sec) 0.15 0.15 0.4 0.4

0.10 0.10 0.2 0.2 0 5 10 25 50 0 5 10 25 50 0 5 10 25 50 100 200 0 5 10 25 50 100 200 Bucket Size (MB) Bucket Size (MB) Bucket Size (MB) Bucket Size (MB) (a) ResNet50 on NCCL (b) ResNet50 on Gloo (c) BERT on NCCL (d) BERT on Gloo

Figure 7: Per Iteration Latency vs Bucket Size on 16 GPUs

0.40 0.40 1.4 1.4

0.35 0.35 1.2 1.2

0.30 0.30 1.0 1.0

0.25 0.25 0.8 0.8

0.20 0.20 0.6 0.6 Per Iteration Latency (Sec) Per Iteration Latency (Sec) Per Iteration Latency (Sec) Per Iteration Latency (Sec) 0.15 0.15 0.4 0.4

Figure 8: Per Iteration Latency vs Bucket Size on 32 GPUs out. Using 256 GPUs leads to 100% slow down in each it- ify if the acceleration might be erased by convergence slow- eration compared to local training, meaning that the real down. The experiments use MNIST [25] dataset to train the scaling factor is 256 × 50% = 128. With the BERT model, ResNet. The learning rate is set to 0.02 and the batch size the per-iteration latency significantly increases due to the is 8. Results are plotted in Fig. 11 (a), which only contains larger model size. Another observation is that the 16-GPU the measurements for NCCL backend as the communica- case suffers a longer per-iteration delay compared to the 32- tion layer does not change the convergence speed. X-axis is GPU case in Figure 9 (c). We suspect this is because either the number of iterations and Y-axis the loss. Please note the 16-GPU experiments were on a slow or congested link that the goal of this experiment is not developing the best or there are other workflows in the shared entitlement com- model for MNIST, instead, it only aims to show the im- peting for resources with our job. Fig. 9 (b) and (d) show pact of skipping synchronization on the model convergence. the results for Gloo backend and the per-iteration slowdown The raw loss data oscillate severely, which are presented by is about 3X for ResNet and 6X for BERT when using 256 the tiny dots. Directly connecting them into a line would GPUs. The deteriorated training speed with larger model result in the last curve covering all previous drawn ones, sizes indicates that the network is the bottleneck resource making them less visible. Therefore, we apply an order 3 when using Gloo backend in this experiment. low pass filter by using filtfilt from SciPy [8] and plot In general, scaling out to more GPUs slows down indi- the smoothed loss curve. The figure confirms that using vidual iterations. One option to mitigate the overhead is no sync in this case only leads to negligible exacerbation to skipping gradient synchronizations, i.e., perform gradient the convergence speed. However, we must emphasize that reduction every n iterations. This approach helps to con- the impact of no sync could depend on the configuration. siderably reduce the amortized latency. Fig. 10 depicts the Fig. 11 (b) shows similar measurements by replacing batch average per iteration latency for conducting gradient reduc- size to 256 and learning rate to 0.06. As highlighted by the tion every 1, 2, 4, and 8 iterations. To visually compare red box in the right bottom corner, no sync hurts the fi- the effectiveness of this method, we consolidated different nal training loss. It is because large batch size and no sync skipping configurations for the same model and backend cause more gradients to be accumulated between consecu- combination into the same figure. ResNet50 on NCCL and tive communications and optimizer steps, which implicitly Gloo sees 38% and 57% speed up with 256 GPUs when con- requires using a smaller learning rate. In summary, when ducting gradient sync every 8 iterations. There is a sudden skipping synchronizations properly, DDP attains near linear jump in delay with NCCL backend when scaling from 128 scalability with negligible accuracy penalty. to 256 and this occurs to all experiments shown in this figure. We believe this is caused by slow or congested links among some of those 256 nodes which are not included in 5.4 Round-Robin Process Group the 128-GPU experiments. Besides the per iteration latency, Another technique to speed up training is to use multiple it’s also crucial to measure the convergence speed to ver- process groups to work around subtle intrinsic concurrency limitations in process group backend implementations. The

3013 0.6 0.6 3.0 3.0

2.5 2.5 0.5 0.5

2.0 2.0 0.4 0.4 1.5 1.5 0.3 0.3 1.0 1.0

Per Iteration Latency (Sec) 0.2 Per Iteration Latency (Sec) 0.2 Per Iteration Latency (Sec) Per Iteration Latency (Sec) 0.5 0.5

0.1 0.1 0.0 0.0 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 Number of GPUs Number of GPUs Number of GPUs Number of GPUs (a) ResNet50 on NCCL (b) ResNet50 on Gloo (c) BERT on NCCL (d) BERT on Gloo

Figure 9: Scalability

0.32 0.6 nccl gloo nccl nccl 0.30 no_sync_2 2.2 no_sync_2 2.2 no_sync_2 no_sync_2 0.5 0.28 no_sync_4 no_sync_4 no_sync_4 no_sync_4 0.26 no_sync_8 no_sync_8 no_sync_8 no_sync_8 0.4 2.0 2.0 0.24

0.22 Loss Loss 0.3 1.8 1.8 0.20

0.18 0.2 1.6 1.6 0.16 Average Per Iteration Latency (Sec) Average Per Iteration Latency (Sec) 0.14 0.1 1 2 4 8 16 32 64 128 256 1 2 4 8 16 32 64 128 256 0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350 Number of GPUs Number of GPUs Number Iterations Number Iterations (a) ResNet50 on NCCL (b) ResNet50 on Gloo (a) Batch Size = 8 (b) Batch Size = 256

Figure 10: Skip Gradient Synchronization Figure 11: Accuracy with Skipping Synchronization

0.14 0.22 0.50 1.2 rr1 rr1 rr1 rr1 0.45 0.13 rr3 0.20 rr3 rr3 1.0 rr3 rr5 rr5 0.40 rr5 rr5 0.18 0.12 0.35 0.8 0.16 0.11 0.30 0.14 0.6 0.25 0.10 0.12 0.20 0.4 0.09 0.10 0.15 Medium Per Iteration Latency (Sec) Medium Per Iteration Latency (Sec) Medium Per Iteration Latency (Sec)

Medium Per Iteration Latency (Sec) 0.2 0.08 0.08 0.10 1 2 4 8 16 24 32 1 2 4 8 16 24 32 1 2 4 8 16 24 32 1 2 4 8 16 24 32 Number of GPUs Number of GPUs Number of GPUs Number of GPUs (a) ResNet50 on NCCL (b) ResNet50 on Gloo (c) BERT on NCCL (d) BERT on Gloo

Figure 12: Round-Robin Process Group concurrency limitations could come from NCCL streams or 6. DISCUSSION Gloo threads, depending on the type of the backend, which This section discusses lessons learned from our experi- might prevent one process group instance to fully utilize ments and past experiences. We then present several ideas all link bandwidth. The PyTorch distributed package sup- for future improvements. ports composing a Round-Robin process group with multiple NCCL or Gloo process groups, which dispatches collective communications to different process group instances in 6.1 Lessons Learned Robin-Robin order. Fig. 12 plots the per iteration latency Distributed data parallel training is a conceptually simple of Round-Robin process group using 1, 3, and 5 NCCL or but practically subtle framework. There are various tech- Gloo process groups, where rrx stands for Round-Robin niques to improve the training speed, creating a complex with x process group instances. ResNet50 on NCCL back- configuration space. Based on our observations, there is end sees negligible differences with different amounts of pro- no single configuration that would work for all use cases, cess groups, meaning that for relatively small models like as it would highly depend on the model size, model struc- ResNet50, bandwidth is not the bottleneck resource. No- ture, network link bandwidth, etc. However, on individual ticeable difference can be observed in ResNet50 on Gloo, configuration dimensions, we summarized intuitions to help where rr3 consistently outperforms rr1. The most promi- application developers to quickly navigate to a small range nent acceleration occurs in BERT model with NCCL back- which likely contains the optimal solution for a given use end, where rr3 achieves 33% speedup compared to rr1 on case. The specific value of the configuration needs to be de- 16 GPUs, revealing that one NCCL group is incompetent to termined using empirical measurements for every different saturate the link capacity. deployment.

3014 • Communication Backend: NCCL is considerably instead of parameters to buckets and all processes skip the faster than Gloo in most use cases. When available, same bucket in the same iteration. Both options require applications should seek to use NCCL as the primary extra coordination across all DDP processes, which can be collective communication backend. implemented by using the same random seed or having an authority process to broadcast the plan. • Bucket Size: Both excessively small or large bucket sizes are detrimental to communication performance. 6.2.3 Gradient Compression The optimal value lives in between and depends on Another potential improvement for DDP is to reduce the the type of communication backend employed. The volume of data for communication by compressing gradients. optimal bucket sizes are likely to increase with the size The absolute value of gradients are usually small, which of the model in a sub-linear manner. might not require float32 or float64 types. Current DDP • Resource Allocation: There is a significant slow- implementation always uses the parameter type as the gra- down with NCCL backend when scaling models across dient type that can become an overkill especially when the machine boundaries, if the bandwidth across machines model is approaching convergence. In this case, DDP would is considerably lower than that between same-machine benefit from adaptive compression levels by only communi- GPUs. In such cases, it is recommended to keep the cating gradients with the necessary precision. Some recent DDP group within the same machine. If the train- research work [34] even proposes more aggressive compres- ing requires larger scale, developers can explore en- sion schemes, where by trading a tiny amount of model ac- abling no sync mode if it attains acceptable conver- curacy, applications can significantly accelerate distributed gence speed. training by communicating just 1 bit for each gradient. 6.2 Future Improvements 7. RELATED WORK While we implement and maintain the DDP package, sev- Distributed training algorithms can be categorized into eral ideas for improvements popped up. This section dis- different types from different perspectives. Below are three cusses the basic ideas behind those improvements. popular categorizations.

6.2.1 Gradient Order Prediction • Synchronous update vs Asynchronous update: With Although DDP cannot deterministically detect the back- the former, all model replicas can use AllReduce to col- ward computation order on all parameters at construction lectively communicate gradients or parameters, while time, the order usually does not change that often in prac- the asynchronous scheme employs P2P communication tice. One viable solution is to trace the backward order using to update gradients or parameters independently. autograd hooks and update parameter to bucket mapping • Cross-iteration vs Intra-iteration: Cross-iteration par- accordingly. As bucket re-allocation will introduce notice- allelism (e.g., pipeline parallelism) allows the lifetime able overhead, it should be conducted infrequently. Given of multiple iterations to overlap with each other, while the existing complexities in DDP, tracing overhead should be intra-iteration scheme focuses on parallelizing training negligible. Nevertheless, if there are disparities among trac- within one iteration. ing results from different iterations, additional complexities will be necessary to reach a consensus. • Data parallel vs Model parallel: Data parallel training distributes input data to multiple model replicas, 6.2.2 Layer Dropping while model parallelism divides the model into smaller One technique to accelerate training and avoid overfit- pieces, which is especially helpful when the model is ting is to randomly drop layers during the forward pass [17]. too large to fit in one device or machine. This works well with local training. As every forward pass would build a new autograd graph, those skipped layers will Table 1 summarizes some recent distributed training so- not participate in the backward pass either. This idea also lutions by marking which scheme they can support. Be- works with DDP, because parameters in skipped layers can be sides advances in training schemes, prior work has also ex- marked as ready in the forward pass and DDP will not wait for plored different communication algorithms, including tree- their autograd hooks during the backward pass. Although based AllReduce [23], heterogeneity-aware interconnection DDP would produce the correct result, this technique alone structure [39], and AllReduce decomposition [14]. As this is inadequate to accelerate distributed data parallel training paper focuses on DDP, the remainder of this section only elab- the same way as local training due to the fixed parameter- orates and compares closely related techniques, i.e., Syn- to-bucket mapping. As AllReduce uses a bucket as the min- chronous, Intra-iteration, and Data parallel training schemes. imum granularity, it cannot judiciously react to vacancies in The techniques presented in this paper were first imple- buckets (i.e., skipped layers or parameters). Consequently, mented and released in PyTorch v1.1. Similar computation- regardless of how the forward pass skips layers, there is al- communication overlap techniques are also introduced in ways the same amount of data to be communicated across TensorFlow v2.2 as the Multi Worker Mirrored Strategy [10]. the wire during the backward pass. Besides, DDP cannot af- This technique is researched in academia as well. Gradi- ford the luxury to adjust all buckets to cooperate with ran- entFlow [37] combines bucketing AllReduce with skipping domly skipped layers, as that would result in unacceptable parameter synchronizations. Compared to PyTorch DDP, memory allocation overhead. To tackle this problem, one so- instead of skipping the entire synchronization step in one lution is to keep bucket buffers intact but modify parameter- iteration, GradientFlow selectively communicates a subset to-bucket mappings accordingly. Another option is to per- of gradients. Although this strategy helps to reduce com- form layer skips at the bucket level, i.e., DDP can map layers munication overhead for gradients, it requires an additional

3015 communication phase to attain consensus on which gradients to synchronize. As a result, the overhead incurred to Table 1: Distributed Training Solutions: Six schemes are Synchronous-Update vs Asynchronous-Update, Cross- acquire consensus might overshadow the speedup achieved ¯ ¯ ¯ in gradient synchronizations, especially for small models or Iteration vs Intra-Iteration, Data-Parallel vs Model-Parallel ¯Scheme ¯ SACIDM¯ large network round-trip delays. √ √ √ PT DDP [9] Another approach to speed up distributed training is pre- √ √ √ √ √ PT RPC [6] empting and prioritizing communications based on the or- √ √ √ TF Mirrored Worker [10] der of downstream computations. Jayarajan et al. [22] pro- √ √ √ √ TF ParameterServer [11] posed to prioritize gradient synchronizations and parameter √ √ √ √ Mesh TensorFlow [36] updates based on the forward order instead of the back- √ √ √ GPipe [21] ward order, meaning that gradient buckets containing the √ √ √ Horovod [35] initial layers should receive higher priorities than those in √ √ √ GradientFlow [37] the final layers. Communications should still start from fi- √ √ √ SlowMo [40] nal layer gradients, as they will become ready earlier, but √ √ √ √ √ PipeDream [29] higher priority gradients (i.e., in initial layers) can preempt √ √ √ √ lower priority ones. This design allows the forward pass in ZeRO [32] √ √ √ √ √ the next iteration to start sooner, even before finishing gradi- Parallax [24] √ √ √ √ ents communications in the previous iteration, creating more ByteScheduler [31] √ √ √ √ opportunities to overlap computations and communications. TicTac [19] √ √ √ ByteScheduler [31] explored scheduling communications for PACE [12] distributed data parallel training as well. However, instead of binding with a single framework, ByteScheduler works for multiple frameworks by inserting a common core scheduler duce considerable overhead. Hence, applications can choose between framework APIs and framework engines and uses which techniques to use based on the size of the given model per-engine plugins to intercept communication invocations. and available resources. PipeDream [29] employs a different To integrate with PyTorch, ByteScheduler builds on top of approach where the model stack is decomposed into multiple Horovod [35] which launches communication in the opti- stages, where data parallelism is applied within one stage mizer. One downside of this approach is that, there is a hard and pipeline with model parallelisms govern the workload barrier between the backward pass and the optimizer step. across stages. One subtle detail is that to attain high train- As a result, communication can only overlap with the next ing speed, PipeDream slightly sacrifices accuracy by using forward pass instead of the current backward pass. With dy- the latest gradients from multiple concurrent passes. Al- namic graphs, the next iteration might touch a different set though the gradient might not be derived from the current of parameters, which would invalidate the schedule derived parameter states, the authors show that this mismatch is tol- from the previous iteration. PACE [12] computes the op- erable in practice. Parallax [24] explored a hybrid structure timal communication schedule and implements preemption that combines parameter-server [27] and collective commu- by segmenting primitive AllReduce operations into smaller nications. Models are partitioned based on sparsity, where pieces. Although segmenting can indeed mimic preemption, dense parameters are communicated using AllReduce and it will on the other hand hurt the total communication time sparse tensors are placed to parameter servers. This design as we have seen in Fig. 2. A more efficient approach would avoids densifying sparse tensors and communicating empty be to natively support prioritization in the communication values, which is especially helpful for NLP models. libraries (e.g., NCCL and Gloo). 8. CONCLUSION The mixture of different parallelism scheme fosters even This paper explained the design and implementation of more powerful training paradigms. Mesh-TensorFlow [36] the distributed data parallel module in PyTorch v1.5, and combines data parallelism with model parallelism. It verti- conducted performance evaluations on NCCL and Gloo back- cally divides some layers by dimensions and replicating other end using ResNet50 and BERT models. DDP accelerates layers where the given dimension is absent. ZeRO [32] also training by aggregating gradients into buckets for communi- combines data parallelism with model parallelism, but with cation, overlapping communication with computation, and minimum model replication to support fast training on su- skipping synchronizations. We also highlighted real-world per large models. The authors observed that main memory caveats in gradient synchronization which are important for consumption contributors are input data, model parame- broad adoption. Results showed that DDP with NCCL back- ters, gradients, optimizer states, and activations. Splitting end can achieve near-linear scalability on 256 GPUs when input data is trivial. However, model parameters and ac- configured properly. The measurements also revealed that tivations are compulsory ingredients for backward passes. the backward pass in DDP is the most expensive step in train- ZeRO addressed this problem by partitioning parameters, ing and requires efforts from both framework developers to gradients, and optimizer states on each DDP instance. Pa- enable optimization algorithms and application developers rameters are broadcast from the owner DDP instance to all to empirically configure the knobs. Based on our obser- others when necessary. Activations are recomputed during vations, we shared lessons learned from serving a variety the backward pass. Compared to PyTorch DDP, ZeRO can of application, discussed potential future improvements for scale to much larger models as each process only needs to distributed data parallel training, and enthusiastically en- maintain a small partition of the model. The high scalabil- courage open source community to experiment with more ity is achieved by sacrificing the training speed, as the ad- novel ideas. ditional re-computation, broadcast, and gather would intro-

3016 9. REFERENCES [19] S. H. Hashemi, S. A. Jyothi, and R. H. Campbell. Tictac: Accelerating distributed deep learning with [1] Gloo: a collective communications library. communication scheduling. arXiv preprint https://github.com/facebookincubator/gloo, 2019. arXiv:1803.03288, 2018. [2] NVIDIA Collective Communications Library (NCCL). [20] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual https://developer.nvidia.com/nccl, 2019. learning for image recognition. In Proceedings of the [3] NVLINK AND NVSWITCH: The Building Blocks of IEEE conference on computer vision and pattern Advanced Multi-GPU Communication. https: recognition, pages 770–778, 2016. //www.nvidia.com/en-us/data-center/nvlink/, [21] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, 2019. M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. [4] Open MPI: A High Performance Message Passing Gpipe: Eﬃcient training of giant neural networks Library. https://www.open-mpi.org/, 2019. using pipeline parallelism. In Advances in Neural [5] Pybind11: Seamless operability between C++11 and Information Processing Systems, pages 103–112, 2019. Python. https://pybind11.readthedocs.io/, 2019. [22] A. Jayarajan, J. Wei, G. Gibson, A. Fedorova, and [6] PyTorch Distributed RPC Framework. G. Pekhimenko. Priority-based parameter propagation https://pytorch.org/docs/master/rpc.html, 2019. for distributed dnn training. In Proceedings of Machine [7] PyTorch Module forward Function. Learning and Systems 2019, pages 132–145, 2019. https://pytorch.org/docs/stable/nn.html#torch. [23] S. Jeaugey. Massively Scale Your Deep Learning nn.Module.forward, 2019. Training with NCCL 2.4. [8] SciPy: open-source software for mathematics, science, https://devblogs.nvidia.com/ and engineering. https://docs.scipy.org/, 2019. massively-scale-deep-learning-training-nccl-2-4/, [9] PyTorch DistributedDataParallel. February 2019. https://pytorch.org/docs/stable/nn.html#torch. [24] S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, nn.parallel.DistributedDataParallel, 2020. S. Lee, J. S. Jeong, and B.-G. Chun. Parallax: [10] TensorFlow Distributed Training Sparsity-aware data parallel training of deep neural MultiWorkerMirroredStrategy. networks. In Proceedings of the Fourteenth EuroSys https://www.tensorflow.org/guide/distributed_ Conference 2019, pages 1–15, 2019. training#multiworkermirroredstrategy, 2020. [25] Y. LeCun, C. Cortes, and C. Burges. The MNIST [11] TensorFlow Distributed Training Database. http://yann.lecun.com/exdb/mnist/, ParameterServerStrategy. 1999. https://www.tensorflow.org/guide/distributed_ [26] Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. training#parameterserverstrategy, 2020. A theoretical framework for back-propagation. In [12] Y. Bao, Y. Peng, Y. Chen, and C. Wu. Preemptive Proceedings of the 1988 connectionist models summer all-reduce scheduling for expediting distributed dnn school, volume 1, pages 21–28. CMU, Pittsburgh, Pa: training. In IEEE INFOCOM, 2020. Morgan Kaufmann, 1988. [13] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, [27] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and U. Muller, J. Zhang, et al. End to end learning for B.-Y. Su. Scaling distributed machine learning with self-driving cars. arXiv preprint arXiv:1604.07316, the parameter server. In 11th {USENIX} Symposium 2016. on Operating Systems Design and Implementation [14] M. Cho, U. Finkler, M. Serrano, D. Kung, and ({OSDI} 14), pages 583–598, 2014. H. Hunter. Blueconnect: Decomposing all-reduce for [28] H. Mao, M. Cheung, and J. She. Deepart: Learning deep learning on heterogeneous network hierarchy. joint representations of visual arts. In Proceedings of IBM Journal of Research and Development, 63(6):1–1, the 25th ACM international conference on Multimedia, 2019. pages 1183–1191, 2017. [15] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. [29] D. Narayanan, A. Harlap, A. Phanishayee, Bert: Pre-training of deep bidirectional transformers V. Seshadri, N. R. Devanur, G. R. Ganger, P. B. for language understanding. arXiv preprint Gibbons, and M. Zaharia. Pipedream: generalized arXiv:1810.04805, 2018. pipeline parallelism for dnn training. In Proceedings of [16] M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: the 27th ACM Symposium on Operating Systems Anomaly detection and diagnosis from system logs Principles, pages 1–15, 2019. through deep learning. In Proceedings of the 2017 [30] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, ACM SIGSAC Conference on Computer and G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, Communications Security, pages 1285–1298, 2017. L. Antiga, A. Desmaison, A. Kopf, E. Yang, [17] A. Fan, E. Grave, and A. Joulin. Reducing Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, transformer depth on demand with structured B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: dropout. arXiv preprint arXiv:1909.11556, 2019. An imperative style, high-performance deep learning [18] X. Guo, S. Singh, H. Lee, R. L. Lewis, and X. Wang. library. In Advances in Neural Information Processing Deep learning for real-time atari game play using Systems 32, pages 8024–8035. Curran Associates, Inc., oﬄine monte-carlo tree search planning. In Advances 2019. in neural information processing systems, pages [31] Y. Peng, Y. Zhu, Y. Chen, Y. Bao, B. Yi, C. Lan, 3338–3346, 2014. C. Wu, and C. Guo. A generic communication

3017 scheduler for distributed dnn training acceleration. In M. Hong, C. Young, et al. Mesh-tensorflow: Deep Proceedings of the 27th ACM Symposium on Operating learning for supercomputers. In Advances in Neural Systems Principles, pages 16–29, 2019. Information Processing Systems, pages 10414–10423, [32] S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He. 2018. Zero: Memory optimization towards training a trillion [37] P. Sun, Y. Wen, R. Han, W. Feng, and S. Yan. parameter models. arXiv preprint arXiv:1910.02054, Gradientflow: Optimizing network performance for 2019. large-scale distributed dnn training. IEEE [33] B. Ramsundar, P. Eastman, P. Walters, and Transactions on Big Data, 2019. V. Pande. Deep learning for the life sciences: applying [38] A. Van den Oord, S. Dieleman, and B. Schrauwen. deep learning to genomics, microscopy, drug discovery, Deep content-based music recommendation. In and more. ” O’Reilly Media, Inc.”, 2019. Advances in neural information processing systems, [34] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit pages 2643–2651, 2013. stochastic gradient descent and its application to [39] G. Wang, S. Venkataraman, A. Phanishayee, data-parallel distributed training of speech dnns. In J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and Fifteenth Annual Conference of the International generic collectives for distributed ml. arXiv preprint Speech Communication Association, 2014. arXiv:1910.04940, 2019. [35] A. Sergeev and M. D. Balso. Horovod: fast and easy [40] J. Wang, V. Tantia, N. Ballas, and M. Rabbat. distributed deep learning in TensorFlow. arXiv Slowmo: Improving communication-efficient preprint arXiv:1802.05799, 2018. distributed sgd with slow momentum. arXiv preprint [36] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, arXiv:1910.00643, 2019. A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee,

3018