<<

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan‡★, Mohammad Shoeybi†, Jared Casper†, Patrick LeGresley†, Mostofa Patwary†, Vijay Korthikanti†, Dmitri Vainbrand†, Prethvi Kashinkunti†, Julie Bernauer†, Bryan Catanzaro†, Amar Phanishayee∗, Matei Zaharia‡ †NVIDIA ‡Stanford University ∗ Research

ABSTRACT 103 Large language models have led to state-of-the-art accuracies across 102 GPT-3 (175B) several tasks. However, training these models efficiently is chal- 101 Turing-NLG (17.2B) Megatron-LM (8.3B) lenging because: a) GPU memory capacity is limited, making it 0 impossible to fit large models on even a multi-GPU server, and 10 GPT-2 (1.5B) (in billions) 10 1 BERT-L (340M) b) the number of compute operations required can result in un- ELMo (94M) realistically long training times. Consequently, new methods of 10 2 Number of parameters 2018 2019 2020 2021 model parallelism such as tensor and pipeline parallelism have Year been proposed. Unfortunately, naive usage of these methods leads to scaling issues at thousands of GPUs. In this paper, we show how Figure 1: Trend of sizes of state-of-the-art Natural Language Pro- tensor, pipeline, and can be composed to scale cessing (NLP) models with time. The number of floating-point op- to thousands of GPUs. We propose a novel interleaved pipelining erations to train these models is increasing at an exponential rate. schedule that can improve throughput by 10+% with memory foot- print comparable to existing approaches. Our approach allows us Various model parallelism techniques have been proposed to to perform training iterations on a model with 1 trillion parameters address these two challenges. For example, recent work [39, 40] has at 502 petaFLOP/s on 3072 GPUs (per-GPU throughput of 52% of shown how tensor (intra-layer) model parallelism, where matrix theoretical peak). multiplications within each transformer layer are split over multiple GPUs, can be used to overcome these limitations. Although this approach works well for models of sizes up to 20 billion parameters 1 INTRODUCTION on NVIDIA DGX A100 servers (with 8 80GB-A100 GPUs), it breaks Transformer-based language models [13, 27, 33–35, 42, 46] in Nat- down for larger models. Larger models need to be split across ural Language Processing (NLP) have driven rapid progress in re- multiple multi-GPU servers, which leads to two problems: (a) the cent years as computation at scale has become more available and all-reduce communication required for tensor parallelism needs datasets have become larger. Recent work [11, 40] has shown large to go through inter-server links, which are slower than the high- language models to be effective zero- or few-shot learners, with high bandwidth NVLink [9] available within a multi-GPU server, and accuracy on many NLP tasks and datasets. These large language (b) a high degree of model parallelism can create small matrix models have a number of exciting downstream applications such multiplications (GEMMs), potentially decreasing GPU utilization. as client feedback summarization, automatic dialogue generation, Pipeline model parallelism [14, 20, 23, 29, 30, 45] is another tech- semantic search, and code autocompletion [1, 4, 5]. As a result, the nique to support the training of large models, where layers of a number of parameters in state-of-the-art NLP models have grown model are striped over multiple GPUs. A batch is split into smaller at an exponential rate (Figure 1). Training such models, however, microbatches, and execution is pipelined across these microbatches. is challenging for two reasons: (a) it is no longer possible to fit the Layers can be assigned to workers in various ways, and various parameters of these models in the main memory of even the largest schedules for the forward and backward passes of inputs can be arXiv:2104.04473v5 [cs.CL] 23 Aug 2021 GPU (NVIDIA recently released 80GB-A100 cards), and (b) even if used. The layer assignment and scheduling strategy results in dif- we are able to fit the model in a single GPU (e.g., by swapping pa- ferent performance tradeoffs. Regardless of schedule, to preserve rameters between host and device memory [38]), the high number strict optimizer semantics, optimizer steps need to be synchronized of compute operations required can result in unrealistically long across devices, leading to a pipeline flush at the end of every batch, training times (e.g., training GPT-3 with 175 billion parameters [11] where microbatches are allowed to complete execution (and no would require approximately 288 years with a single V100 NVIDIA new microbatches are injected). As much as 50% of time can be GPU). This calls for parallelism. Data-parallel scale-out usually spent flushing the pipeline depending on the number of micro- works well, but suffers from two limitations: a) beyond a point, the batches injected into the pipeline. The larger the ratio of number per-GPU batch size becomes too small, reducing GPU utilization of microbatches to the pipeline size, the smaller the time spent in and increasing communication cost, and b) the maximum number the pipeline flush. Therefore, to achieve high efficiency, alarger of devices that can be used is the batch size, limiting the number of batch size is often necessary. In this work, we also introduce a new accelerators that can be used for training. pipeline schedule that improves efficiency at small batch sizes. Users can thus train their large models using various techniques, ★ Work done as an intern at NVIDIA. each with different tradeoffs. Moreover, these techniques canbe combined. However, combining these techniques leads to non-trivial • The schedule used for pipeline parallelism has an impact interactions, which need to be reasoned through carefully for good on the amount of communication, the pipeline bubble size, performance. In this paper, we address the following question: and memory used to store activations. We propose a novel How should parallelism techniques be combined to max- interleaved schedule that can improve throughput by as imize the training throughput of large models given a much as 10% compared to previously-proposed schedules [20, batch size while retaining strict optimizer semantics? 30] with comparable memory footprint. • Values of hyperparameters such as microbatch size have an In particular, we show how to combine pipeline, tensor, and impact on the memory footprint, the arithmetic efficiency of data parallelism, a technique we call PTD-P, to train large language kernels executed on the worker, and the pipeline bubble size. models with good computational performance (52% of peak device In our experiments, the optimal value of the microbatch size throughput) on 1000s of GPUs. Our method leverages the com- is problem-dependent and can increase throughput by 15%. bination of pipeline parallelism across multi-GPU servers, tensor • At scale, distributed training is communication-intensive. parallelism within a multi-GPU server, and data parallelism, to When training a trillion-parameter model on 3072 GPUs, our practically train models with a trillion parameters with graceful implementation used an effective bisection bandwidth of 892 scaling in an optimized cluster environment with high-bandwidth GB/s for pipeline-parallel communication, and 13 TB/s for links between GPUs on the same server and across servers. We can data-parallel communication. Using slower inter-node in- use similar ideas to train larger models as well, given more train- terconnects or more communication-intensive partitionings ing resources. In our experiments, we demonstrate close to linear would hinder scaling performance. scaling to 3072 A100 GPUs, with an achieved end-to-end training We should note that we do not automatically explore the search throughput of 163 teraFLOP/s per GPU (including communication, space of parallelism strategies (such as FlexFlow [22], PipeDream [29], data processing, and optimization), and an aggregate throughput Tarnawski et al. [41], and DAPPLE [14]), but instead suggest heuris- of 502 petaFLOP/s, on a GPT model [11] with a trillion parame- tics (in §3) that we found work well in practice. ters using mixed precision. This throughput facilitates practical training times: we estimate end-to-end training of this model to take ∼ 3 months. We believe this is the fastest training throughput 2 MODES OF PARALLELISM achieved for this size of model: past systems [29, 40] cannot train In this section, we discuss the parallelism techniques that facilitate such large models since they do not combine pipeline and tensor the efficient training of large models that do not fit in the memory of parallelism. We also compared to ZeRO [36], and found that our a single GPU. In this work, we combine pipeline model parallelism approach outperforms ZeRO-3 by 70% for models with 175 and 530 and tensor model parallelism (combination shown in Figure 2) with billion parameters due to less cross-node communication. These data parallelism. We call this PTD-P for short. models are too large to fit on a multi-GPU server. Achieving this throughput at scale required innovation and care- 2.1 Data Parallelism ful engineering along multiple axes: efficient kernel implementa- With data parallelism [25, 43], each worker has a copy of the full tions that allowed most of the computation to be compute-bound model, the input dataset is sharded, and workers aggregate their as opposed to memory-bound, smart partitioning of computation gradients periodically to ensure that all workers see a consistent graphs over the devices to reduce the number of bytes sent over net- version of the weights. For large models which do not fit on a single work links while also limiting device idle periods, domain-specific worker, data parallelism can be used on smaller model shards. communication optimization, and fast hardware (state-of-the-art GPUs and high-bandwidth links between GPUs on the same and 2.2 Pipeline Model Parallelism different servers). We are hopeful that our open-sourced software With pipeline parallelism, the layers of a model are sharded across (available at https://github.com/nvidia/megatron-lm) will enable multiple devices. When used on models with the same transformer other groups to train large NLP models efficiently at scale. block repeated, each device can be assigned an equal number of In addition, we studied the interaction between the various com- transformer layers. We do not consider more asymmetric model ar- ponents affecting throughput, both empirically and analytically chitectures, where assignment of layers to pipeline stages is harder; when possible. Based on these studies, we offer the following guid- we defer to related work [22, 29, 41] to solve this problem. ing principles on how to configure distributed training: A batch is split into smaller microbatches; execution is then • Different forms of parallelism interact in non-trivial ways: pipelined across microbatches. Pipelining schemes need to ensure the parallelization strategy has an impact on the amount of that inputs see consistent weight versions across forward and back- communication, the compute efficiency with which kernels ward passes for well-defined synchronous weight update semantics. are executed, as well as the idle time workers spend waiting Specifically, naive pipelining can lead to an input seeing weight for computation due to pipeline flushes (pipeline bubbles). updates in the backward pass not seen in the forward pass. For example, in our experiments, we found that sub-optimal To retain strict optimizer semantics exactly, we introduce peri- combinations of tensor and pipeline model parallelism can odic pipeline flushes so that optimizer steps are synchronized across lead to up to 2× lower throughput, even with high-bandwidth devices. At the start and end of every batch, devices are idle. We network links between servers; tensor model parallelism call this idle time the pipeline bubble, and want to make it as small is effective within a multi-GPU server, but pipeline model as possible. Asynchronous and bounded-staleness approaches such parallelism must be used for larger models. as PipeMare, PipeDream, and PipeDream-2BW [23, 29, 30, 45] do Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Transformer layer #1 Transformer layer #2

Tensor MP partition #1 Tensor MP partition #1

Tensor MP partition #2 Tensor MP partition #2

Pipeline MP partition #1 Pipeline MP partition #2

Figure 2: Combination of tensor and pipeline model parallelism (MP) used in this work for transformer-based models. Pipeline flush

Device 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 Device 2 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 Device 3 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 9 Device 4 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 9 10111213141516 9 10

Time Devices idle Forward Pass Backward Pass

Figure 3: GPipe pipeline schedule with forward passes (blue) for all microbatches (represented by numbers) followed by backward passes (green). The gray area represents the pipeline bubble. For simplicity, we assume that the backward pass takes twice as long as the forward pass. The efficiency of the pipeline schedule does not depend on this factor. Each batch in this example consists of 8 microbatches, andthe numbers in each blue or green box are unique identifiers given to the corresponding microbatch (in particular, the first batch consistsof microbatches 1 − 8, the second batch consists of microbatches 9 − 16, and so on). The optimizer is stepped and weight parameters updated at the pipeline flush to ensure strict optimizer semantics, leading to idle devices and a pipeline bubble.

Device 1 1 2 3 4 1 5 2 6 3 7 4 8 5 6 7 8 9 10 11 12 9 10 Device 2 1 2 3 4 1 2 5 3 6 4 7 5 8 6 7 8 9 10 11 12 9 10

Device 3 1 2 3 4 1 2 3 5 4 6 5 7 6 8 7 8 9 10 11 12 9 13 10 11

Device 4 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 Time Assign multiple stages to each device

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 5 6 7 1 8 2 5 3 6 4 7 1 8 2 3 4 5 6 7 8 5 6 7 8 9 9 9 10 11 12 9 10 11 Device 1 0 1 2 0 1 2 3 4 5 6 3 4 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 4 5 1 6 2 7 3 8 4 5 1 6 2 7 3 8 4 5 6 7 8 5 6 7 8 9 9 9 10 11 12 9 10 11 12 Device 2 0 1 2 0 1 2 3 4 5 6 3 4 5 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 1 2 3 1 4 2 5 3 6 4 7 1 8 2 5 3 6 4 7 5 8 6 7 8 5 6 7 8 9 9 9 10 11 12 9 10 11 12 13 Device 3 0 1 2 0 1 2 3 4 5 6 3 4 5 1 1 1 1 1 1 1 1 1 1 1 1 1 2 3 4 1 1 2 2 3 3 4 4 5 1 6 2 7 3 8 4 5 5 6 6 7 7 8 8 5 6 7 8 9 9 9 10 11 12 9 10 11 12 13 14 Device 4 0 1 2 0 1 2 3 4 5 6 3 4 Time Forward Pass Backward Pass

Figure 4: Default and interleaved 1F1B pipeline schedules. The top figure shows the default non-interleaved 1F1B schedule. The bottom figure shows the interleaved 1F1B schedule, where each device is assigned multiple chunks (in this case, 2). Dark colors show the first chunk and light colors show the second chunk. The size of the pipeline bubble is smaller (the pipeline flush happens sooner in the interleaved timeline). away with flushes completely, but relax weight update semantics. followed by backward passes for all microbatches (shown in Fig- We defer consideration of such schemes to future work. ure 3). We can quantify the size of GPipe’s pipeline bubble (푡푝푏 ). There are several possible ways of scheduling forward and back- We denote the number of microbatches in a batch as 푚, the number ward microbatches across devices; each approach offers different of pipeline stages (number of devices used for pipeline parallelism) tradeoffs between pipeline bubble size, communication, and mem- as 푝, the ideal time per iteration as 푡푖푑 (assuming perfect or ideal ory footprint. We discuss two such approaches in this section. scaling), and the time to execute a single microbatch’s forward and backward pass as 푡푓 and 푡푏 . In this schedule, the pipeline bubble 2.2.1 Default Schedule. GPipe [20] proposes a schedule where the consists of 푝 − 1 forward passes at the start of a batch, and 푝 − 1 forward passes for all microbatches in a batch are first executed, backward passes at the end. The total amount of time spent in the pipeline bubble is then 푡푝푏 = (푝 − 1)·(푡푓 +푡푏). The ideal processing This means that the new schedule reduces the bubble time by 푣. time for the batch is 푡푖푑 = 푚 ·(푡푓 + 푡푏 ). Therefore, the fraction of This reduced pipeline bubble size, however, does not come for free: ideal computation time spent in the pipeline bubble is: this schedule requires extra communication. Quantitatively, the amount of communication also increases by 푣. In the next section, 푡푝푏 푝 − 1 Bubble time fraction (pipeline bubble size) = = . we discuss how we can utilize the 8 InfiniBand networking cards in 푡푖푑 푚 a multi-GPU server (e.g., a DGX A100 node) to reduce the impact For the bubble time fraction to be small, we thus need 푚 ≫ 푝. of this extra communication. However, for such large 푚, this approach has a high memory foot- print as it requires stashed intermediate activations (or just input 2.3 Tensor Model Parallelism activations for each pipeline stage when using activation recompu- With tensor model parallelism, individual layers of the model are tation) to be kept in memory for all 푚 microbatches through the partitioned over multiple devices. In this paper, we use the particular lifetime of a training iteration. partitioning strategy used by Megatron [40] for transformer layers, Instead, we use the PipeDream-Flush schedule [30]. In this sched- the bedrock of language models. We can apply similar ideas to other ule, we first enter a warm-up phase where workers perform dif- types of models, like CNNs, as well. We briefly outline this strategy, fering numbers of forward passes as shown in Figure 4 (top). This illustrated in Figure 5, below. schedule limits the number of in-flight microbatches (the number of A transformer layer consists of a self-attention block followed microbatches for which the backward pass is outstanding and acti- by a two-layer multi-layer perceptron (MLP). Further details of the vations need to be maintained) to the depth of the pipeline, instead transformer layer can be found in Vaswani et al [42]. of the number of microbatches in a batch. After the warm-up phase, The MLP block consists of two GEMMs and a GeLU non-linearity: each worker then enters a steady state, where workers perform 푌 = GeLU(푋퐴). 푍 = Dropout(푌퐵). one forward pass followed by one backward pass (1F1B for short). Finally, at the end of a batch, we complete backward passes for We can split 퐴 along its columns 퐴 = [퐴1, 퐴2]. This partitioning all remaining in-flight microbatches. The time spent in the bubble allows the GeLU non-linearity to be independently applied to the is the same for this new schedule, but the number of outstanding output of each partitioned GEMM: forward passes is at most the number of pipeline stages for the [푌1, 푌2] = [GeLU(푋퐴1), GeLU(푋퐴2)]. PipeDream-Flush schedule. As a result, this schedule requires acti- vations to be stashed for 푝 or fewer microbatches (compared to 푚 This is advantageous as it removes the need for synchronization microbatches for the GPipe schedule). Consequently, when 푚 ≫ 푝, (needed if 퐴 is split along its rows since GeLU is non-linear). PipeDream-Flush is much more memory-efficient than GPipe. The rows of the second weight matrix 퐵 can then be split along its rows to remove the need for any communication between the 2.2.2 Schedule with Interleaved Stages. To reduce the size of the GEMMs (shown in Figure 5a), as shown below: pipeline bubble, each device can perform computation for multiple 퐵  subsets of layers (called a model chunk), instead of a single contigu- 퐵 = 1 , 푌 = [푌 , 푌 ]. 퐵 1 2 ous set of layers. For example, if each device had 4 layers before 2 (i.e., device 1 had layers 1 − 4, device 2 had layers 5 − 8, and so on), The output of the second GEMM is then reduced across the GPUs we could have each device perform computation for two model before the dropout layer. chunks (each with 2 layers), i.e., device 1 has layers 1, 2, 9, 10; device We exploit the inherent parallelism in the multi-head attention 2 has layers 3, 4, 11, 12; and so on. With this scheme, each device operation to partition the self-attention block (shown in Figure 5b). in the pipeline is assigned multiple pipeline stages (each pipeline The key (퐾), query (푄), and value (푉 ) matrices can be partitioned in stage has less computation compared to before). a column-parallel fashion. The output linear layer can then directly As before, we can use an “all-forward, all-backward” version of operate on the partitioned output of the attention operation (weight this schedule, but this has a high memory footprint (proportional to matrix partitioned across rows). 푚). Instead, we developed an interleaved schedule that adapts the This approach splits GEMMs in the MLP and self-attention blocks memory-efficient 1F1B schedule from before. This new schedule is across GPUs while requiring only two all-reduce operations in the shown in Figure 4, and requires the number of microbatches in a forward pass (푔 operator) and two all-reduces in the backward pass batch to be an integer multiple of the degree of pipeline parallelism (푓 operator). We implemented 푓 and 푔 in a few lines of code. (number of devices in the pipeline). For example, with 4 devices, the number of microbatches in a batch must be a multiple of 4. 3 PERFORMANCE ANALYSIS OF As shown in Figure 4, the pipeline flush for the same batch PARALLELIZATION CONFIGURATIONS size happens sooner in the new schedule. If each device has 푣 In this section, we consider the performance implications of com- stages (or model chunks), then the forward and backward time bining pipeline and tensor model parallelism with data parallelism. for a microbatch for each stage or chunk will now be 푡푓 /푣 and 푡푏 /푣. Given a fixed budget of GPUs and batch size, one can use different (푝−1)·(푡 +푡 ) int. = 푓 푏 degrees of the parallelism types in PTD-P to train models; each The pipeline bubble time thus reduces to 푡푝푏 푣 , and the bubble time fraction is then: dimension exposes tradeoffs between memory footprint, device utilization, and amount of communication. int. 푡푝푏 1 푝 − 1 We discuss these tradeoffs in the rest of this section, andthen Bubble time fraction (pipeline bubble size) = = · . 푡푖푑 푣 푚 show empirical results in §5.4. We present analytical models where Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

� = GeLU(��) � = Dropout(��) n=32, b'=32 n=128, b'=128 GeLU n=32, b'=128 n=128, b'=512

� ��! �! �!�! �! Dropout 1.00 � � � � GeLU 0.75 � ��" �" �"�" �" 0.50 �! � = [�!, �"] � = �" 0.25

(a) MLP. Pipeline bubble size 0.00 1 2 4 8 16 32 64 � = Self-Attention(�) Data-parallel size (d) �! � = Dropout(��) Softmax Dropout Figure 6: Fraction of time spent idling due to pipeline flush (pipeline � �! �! �!�! �! bubble size) versus data-parallel size (푑), for different numbers of �! Dropout GPUs (푛) and ratio of batch size to microbatch size (푏′ = 퐵/푏). � � � � Softmax Dropout �" �" �"�" �" � �" The amount of communication performed between different �! �" � = GPUs is also affected by the values of 푝 and 푡. Pipeline model par- �" � = [�!, �"] allelism features cheaper point-to-point communication. Tensor Split attention heads → � = [�!, �"] model parallelism, on the other hand, uses all-reduce communi- � = [�!,�"] cation (two all-reduce operations each in the forward and back- (b) Self-Attention. ward pass, see §2.3). With pipeline parallelism, the total amount Figure 5: Blocks of transformer model partitioned with tensor of communication that needs to be performed between every pair model parallelism (figures borrowed from Megatron [40]). 푓 and 푔 of consecutive devices (for either the forward or backward pass) are conjugate. 푓 is the identity operator in the forward pass and all- for each microbatch is 푏푠ℎ, where 푠 is the sequence length and ℎ reduce in the backward pass, while 푔 is the reverse. is the hidden size. With tensor model parallelism, tensors of total relevant for the pipeline bubble size. We qualitatively describe how size 푏푠ℎ need to be all-reduced among 푡 model replicas twice each communication time behaves and present cost models for amount in the forward and backward pass for each layer, leading to a total  푡−1  of communication; however, we do not present direct cost models communication of 8푏푠ℎ 푡 per layer per device for each micro- for communication time, which is harder to model for a hierarchical batch. Each device typically has multiple layers; the total amount network topology where interconnects between GPUs on the same of tensor-parallel-communication per device for each microbatch server have higher bandwidth than interconnects between servers. stage   푡−1  stage is then 푙 · 8푏푠ℎ 푡 , where 푙 is the number of layers in To the best of our knowledge, this is the first work to analyze the a pipeline stage. performance interactions of these parallelization dimensions. Consequently, we see that tensor model parallelism increases the amount of communication between devices. Thus, when 푡 is 3.1 Notation larger than the number of GPUs in a single node, the overhead of We use the following notation in this section: performing tensor model parallelism across slower inter-node links •( 푝, 푡,푑): Parallelization dimensions. 푝 for the pipeline-model- can be impractical. We see these results empirically in §5.4. parallel size, 푡 for the tensor-model-parallel size, and 푑 for the data-parallel size. Takeaway #1: When considering different forms of model par- • 푛: Number of GPUs. We require 푝 · 푡 · 푑 = 푛. allelism, tensor model parallelism should generally be used up • 퐵: Global batch size (provided as input). to degree 푔 when using 푔-GPU servers, and then pipeline model • 푏: Microbatch size. parallelism can be used to scale up to larger models across servers. • = 1 · 퐵 푚 푏 푑 : Number of microbatches in a batch per pipeline. 3.3 Data and Model Parallelism 3.2 Tensor and Pipeline Model Parallelism We also want to consider the interaction between data parallelism Tensor and pipeline model parallelism can both be used to partition and the two types of model parallelism. In this section, we consider a model’s parameters over multiple GPUs. As stated earlier, using these interactions independently for simplicity. pipeline parallelism with periodic flushes results in a pipeline bubble of size (푝 − 1)/푚. Let us assume that 푑 = 1 (data-parallel size); 3.3.1 Pipeline Model Parallelism. Let 푡 = 1 (tensor-model-parallel consequently, 푡 · 푝 = 푛. The pipeline bubble size in terms of 푡 is: size). The number of microbatches per pipeline is 푚 = 퐵/(푑 · 푏) = 푏′/푑, where 푏′ := 퐵/푏. With total number of GPUs 푛, the number 푝 − 1 푛/푡 − 1 = . of pipeline stages is 푝 = 푛/(푡 · 푑) = 푛/푑. The pipeline bubble size is: 푚 푚 As 푡 increases, the pipeline bubble thus decreases for fixed 퐵, 푏, and 푝 − 1 푛/푑 − 1 푛 − 푑 = = . 푑 (푚 = 퐵/(푏 · 푑) is fixed as well). 푚 푏′/푑 푏′ 100 1.25 75 1.00

50 0.75

per GPU 0.50 25 Batch size = 128 0.25 Batch size = 512

Achieved teraFLOP/s 0 1 2 4 8 16 0.00 Microbatch size Normalized throughput 1 2 4 8 16 Microbatch size Figure 7: Per-GPU throughput versus microbatch size for a GPT model with a billion parameters (128 attention heads, hidden size Figure 8: Behavior of normalized estimated throughput (time com- ′  of 4096, 4 transformer layers). puted as 푡 = (푏 /푏 + 푝 − 1) · 푡푓 (푏) + 푡푏 (푏) ) with respect to the mi- crobatch size 푏 for the same GPT model from Figure 7.

As 푑 becomes larger, 푛 − 푑 becomes smaller, and thus the pipeline bubble becomes smaller. Figure 6 shows the behavior of the pipeline bubble size for various values of 푑, 푛, and 푏′. It might not be pos- The microbatch size thus affects both the arithmetic intensity of sible to increase 푑 all the way to 푛 for all models, since a model’s operations as well as the pipeline bubble size (by affecting 푚). Fig- full training memory footprint might be larger than the memory ure 8 shows estimated throughput (equation (1) used to estimate capacity of a single accelerator. processing time) for a GPT model with a billion parameters and Overall throughput will thus increase if the all-reduce commu- (푝, 푡) = (8, 8). The optimal 푏 for both batch sizes is 4. nication needed for data parallelism does not drastically increase with higher 푑, which should hold since the communication time Takeaway #3: The optimal microbatch size 푏 depends on the 푑−1 = − 1 throughput and memory footprint characteristics of the model, as for a ring-based implementation scales with 푑 1 푑 . We can also analyze the impact of increasing the batch size 퐵. well as the pipeline depth 푝, data-parallel size 푑, and batch size 퐵. For a given parallel configuration, as the batch size 퐵 increases, 푏′ = 퐵/푏 increases, (푛 − 푑)/푏′ decreases, consequently increasing 3.5 Activation Recomputation throughput. All-reduce communication required by data parallelism Activation recomputation [12, 18, 20, 21] is an optional technique also becomes more infrequent, further increasing throughput. that trades off an increase in the number of compute operations per- 3.3.2 Data and Tensor Model Parallelism. With tensor model paral- formed for additional memory footprint, by running the forward lelism, all-reduce communication needs to be performed for every pass a second time just before the backward pass (and stashing microbatch. This can be expensive across multi-GPU servers. On only the input activations for a given pipeline stage, as opposed to the other hand, data parallelism only needs to perform expensive the entire set of intermediate activations, which is much larger). all-reduce communication once per batch. Moreover, with tensor Activation recomputation is required to train reasonably large mod- model parallelism, each model-parallel rank performs a subset of els with pipeline parallelism to keep memory footprint acceptably the computation in each model layer, and thus for insufficiently- low. Previous work like PipeDream-2BW [30] has looked at the large layers, modern GPUs might not perform these sub-matrix performance ramifications of activation recomputation. computations with peak efficiency. The number of activation checkpoints does not impact through- put, but impacts memory footprint. Let 퐴input be the size of the Takeaway #2: When using data and model parallelism, a total input activations of a layer, and 퐴intermediate be the size of interme- model-parallel size of 푀 = 푡 · 푝 should be used so that the model’s diate activations per layer. If a model stage has 푙 layers, and if 푐 is parameters and intermediate metadata fit in GPU memory; data the number of checkpoints, the total memory footprint is going to parallelism can be used to scale up training to more GPUs. be 푐 ·퐴input +푙/푐 ·퐴intermediate. The minimum value of this function √︃ is obtained when 푐 = 푙 · 퐴intermediate/퐴input. In practice, we 3.4 Microbatch Size measure 퐴intermediate empirically. For most cases, checkpointing The choice of the microbatch size 푏 also affects model-training every 1 or 2 transformer layers is optimal. throughput. For example, we see in Figure 7 that per-GPU through- Other techniques such as activation partitioning [36] can also put increases by up to 1.3× with a larger microbatch size on a single be used in conjunction with tensor model parallelsim to reduce the GPU. We now want to determine the optimal microbatch size 푏 memory footprint due to activations further. given a parallel configuration (푝, 푡,푑) and batch size 퐵. The amount of data-parallel communication will be the same regardless of the 4 IMPLEMENTATION microbatch size. Given functions 푡푓 (푏) and 푡푏 (푏) that map the mi- crobatch size to the forward and backward computation times for a We implemented PTD-P as an extension to the Megatron-LM code- single microbatch, the total time spent computing a batch, ignoring base. Our implementation is built using PyTorch [32]. We use communication cost, is (as before, define 푏′ as 퐵/푑): NCCL [7] for communication between devices. To obtain good performance, we implemented optimizations targeting both com- ′    푏 /푏 + 푝 − 1 · 푡푓 (푏) + 푡푏 (푏) . (1) munication and computation, which we outline below. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Infiniband Scatter of All-gather of [푠,푏, 푎, ℎ], where 푏, 푠, 푎, and ℎ are batch, sequence, attention-head, and hidden-size dimensions, respectively. Second, we generated 1 3 1 3 NVLink fused kernels for a sequence of element-wise operations (bias + 2 4 2 4 GeLU and bias + dropout + add) using PyTorch JIT [10]. Third, we created two custom kernels to enable the fusion of scale, mask, and (a) W/o scatter/gather optimization. (b) With scatter/gather optimization. softmax (reduction) operations: one to support general masking Figure 9: Scatter/gather communication optimization. Light blue (used in models such as BERT) and another to support implicit blocks are layers in the first pipeline stage, and dark blue blocks causal masking (used in auto-regressive models such as GPT). We are layers in the second pipeline stage. Without the scatter/gather quantify the effect of these optimizations in the next section. optimization, the same tensor is sent redundantly over inter-node InfiniBand links. Instead, at the sender, we can scatter thetensor 5 EVALUATION into smaller chunks, reducing the sizes of tensors sent over Infini- Band links. The final tensor can then be rematerialized at there- In this section, we seek to answer the following questions: ceiver using a gather operation. • How well does PTD-P perform? Does it result in realistic end-to-end training times? 4.1 Communication Optimizations • How well does pipeline parallelism scale for a given model and batch size? How much impact does the interleaved sched- When using pipeline parallelism, we want to send and receive ten- ule have on performance? sors in the forward and backward direction in parallel. Each DGX • How do different parallelization dimensions interact with A100 is equipped with 8 InfiniBand (IB) networking cards. Unfor- each other? What is the impact of hyperparameters such as tunately, sends and receives are point-to-point, and only happen microbatch size? between a pair of GPUs on two servers, making it hard to leverage • What is the impact of the scatter-gather communication all 8 cards for a single communication call within the pipeline. optimization? What types of limits do we put on hardware However, we can leverage the fact that we use both tensor model when running training iterations at scale? parallelism and pipeline model parallelism to reduce the overhead of cross-node communication. In particular, we note that the output All of our results are run with mixed precision on the Selene of each transformer layer is replicated (after 푔 in MLP block, see supercomputer [8]. Each cluster node has 8 NVIDIA 80-GB A100 Figure 5a) across the tensor-parallel ranks. As a result, ranks in two GPUs [6], connected to each other by NVLink and NVSwitch [9]. consecutive pipeline stages that are performing tensor model par- Each node has eight NVIDIA Mellanox 200Gbps HDR Infiniband allelism send and receive the exact same set of tensors (Figure 9a). HCAs for application communication, with an additional two HCAs For large enough models, we use a tensor-model-parallel size per node for dedicated storage. The nodes are connected in a three- of 8. This means we are sending the same set of tensors 8 times level (leaf, spine, core) fat-tree topology with 850 switches. This between corresponding GPUs on adjacent multi-GPU servers. To topology allows efficient all-reduce communication (dominant com- reduce this redundancy, we can instead split the tensor on the send munication pattern in deep learning training). The cluster uses an side into equal-sized chunks, and then only send one chunk to all-NVME shared parallel filesystem for high-performance data ac- the corresponding rank on the next node using the rank’s own cess and storage. The peak device throughput of an A100 GPU with InfiniBand card (e.g., rank 1 sends to rank 3 and rank 2sendsto 16-bit precision is 312 teraFLOP/s. For most of our results, we report rank 4 in Figure 9). With 8 tensor-model-parallel ranks, each chunk throughput per GPU. Aggregate throughput can be computed by would be one-eighth smaller. Then, on the receive side, we can multiplying with the number of GPUs used. perform an all-gather over NVLink, which is much faster than the For our experiments, we use GPT models of appropriate sizes. In InfiniBand interconnect, to re-materialize the full tensor. This is particular, for any given microbenchmark, the model needs to fit on shown in Figure 9b. We call this the scatter/gather communication the number of model-parallel GPUs used in the experiment. We use optimization. This optimization helps better leverage the multiple standard model architectures such as GPT-3 [11] when appropriate. IB cards on the DGX A100 servers, and makes more communication- intensive schedules such as the interleaved one feasible. 5.1 End-to-End Performance Quantitatively, with the scatter-gather communication optimiza- We consider the end-to-end performance of our system on GPT tion, the total amount of communication that needs to be performed 푏푠ℎ models ranging from a billion to a trillion parameters, using ten- between every pair of consecutive stages is reduced to 푡 , where sor, pipeline, and data parallelism (degrees picked using heuristics 푡 is the tensor-model-parallel size, 푠 is the sequence length, and ℎ described in §3). In particular, we use the interleaved pipeline sched- is the hidden size (푡 = 8 in our experiments). ule with the scatter/gather optimization enabled. All models use a vocabulary size (denoted by 푉 ) of 51,200 (multiple of 1024) and a 4.2 Computation Optimizations sequence length (푠) of 2048. We vary hidden size (ℎ), number of at- We implemented three model-specific optimizations to the compu- tention heads, and number of layers (푙). The number of parameters tation graph to attain high performance. First, we changed the data in a model, 푃, can be computed as: layout in the transformer layer to avoid memory-intensive trans- pose operations, and to enable the use of strided batched GEMM  13 푉 + 푠  푃 = 12푙ℎ2 1 + + . (2) kernels. Specifically, we changed the data layout from [푏, 푠, 푎, ℎ] to 12ℎ 12푙ℎ Number of Achieved Percentage of Achieved Attention Hidden Number Tensor model- Pipeline model- Number Batch parameters teraFlOP/s theoretical aggregate heads size of layers parallel size parallel size of GPUs size (billion) per GPU peak FLOP/s petaFLOP/s 1.7 24 2304 24 1 1 32 512 137 44% 4.4 3.6 32 3072 30 2 1 64 512 138 44% 8.8 7.5 32 4096 36 4 1 128 512 142 46% 18.2 18.4 48 6144 40 8 1 256 1024 135 43% 34.6 39.1 64 8192 48 8 2 512 1536 138 44% 70.8 76.1 80 10240 60 8 4 1024 1792 140 45% 143.8 145.6 96 12288 80 8 8 1536 2304 148 47% 227.1 310.1 128 16384 96 8 16 1920 2160 155 50% 297.4 529.6 128 20480 105 8 35 2520 2520 163 52% 410.2 1008.0 160 25600 128 8 64 3072 3072 163 52% 502.0

Table 1: Weak-scaling throughput for GPT models ranging from 1 billion to 1 trillion parameters.

As the model size increases, we also increase the batch size (퐵) and ZeRO-3, 175B PTD-P, 175B the number of GPUs (푛). The majority of floating-point operations ZeRO-3, 530B PTD-P, 530B in the model are performed in the matrix multiplications (GEMMs) 200 in the transformer and logit layers. Considering just these GEMMs, 150 the number of FLOPs per iteration is (more details in the Appendix):  푠 푉  100 = 2 + +

퐹 96퐵푠푙ℎ 1 . (3) per GPU 6ℎ 16푙ℎ 50 This is a lower bound for the true FLOP count but should be close

Achieved teraFLOP/s 0 to the actual value. We count a FLOP as a floating-point operation 768 1152 1536 1920 regardless of precision. We also note that equation (3) assumes Number of GPUs activation recomputation and takes into account the floating-point operations associated with the extra forward pass. Figure 10: Throughput per GPU of PTD-P and ZeRO-3 for two differ- ent GPT models (the 175B GPT-3 model is shown with dotted lines, Table 1 shows the model configurations along with the achieved and the 530B model is shown with solid lines). Global batch sizes FLOP/s (both per GPU and aggregate over all GPUs). We see super- are fixed and ZeRO-3 is used without any model parallelism. linear scaling to 3072 A100 GPUs (384 DGX A100 nodes), since GPU utilization improves as the models get larger (larger matrix 5.2 Comparison to ZeRO-3 multiplications) without significant increase in the communication We compare PTD-P to ZeRO-3 [36, 37] in Table 2 and Figure 10 for time relative to computation time. Note that throughput is measured the standard GPT-3 model architecture, as well as the 530-billion- for end-to-end training, i.e., includes all operations including data parameter model from Table 1. The results provide a point of com- loading, optimizer steps, communication, and logging. We achieve parison to a method that does not use model parallelism. We in- 52% of peak device throughput for the largest model, and 44% of tegrated ZeRO into our codebase using the DeepSpeed Python peak device throughput for the smallest model. library [3]. We keep the global batch size the same as we increase Training Time Estimates. Given these throughputs, we can the number of GPUs. With fewer GPUs and a microbatch size of 4, also estimate the total amount of time needed for end-to-end train- PTD-P results in 6% and 24% higher throughput for the 175- and ing on 푇 tokens. Training requires 퐼 = 푇 /(퐵 · 푠) iterations. Using 530-billion-parameter models respectively. As we increase the num- the value of 퐹 from equation (3) and empirical end-to-end through- ber of GPUs, PTD-P scales more gracefully than ZeRO-3 in isolation puts from Table 1 (denoted by 푋), we can estimate total training (see Figure 10). For example, by doubling the number of GPUs (keep- time. We note that for the configurations in Table 1, we have 6ℎ ≫ 푠, ing the batch size the same), PTD-P outperforms ZeRO-3 by 70% 16푙ℎ ≫ (푉 + 푠), and 12푙ℎ ≫ 푉 . Combining these observations with for both models due to less cross-node communication. We note equations (2) and (3), we arrive at that we have only considered ZeRO-3 without tensor parallelism. 8푇푃 End-to-end training time ≈ . (4) ZeRO-3 can be combined with model parallelism to potentially 푛푋 improve its scaling behavior. Let us consider the GPT-3 model with 푃 =175 billion parameters as an example. This model was trained on 푇 = 300 billion tokens. On 5.3 Pipeline Parallelism 푛 = 1024 A100 GPUs using batch size 1536, we achieve 푋 = 140 ter- We now evaluate the weak-scaling performance of pipeline paral- aFLOP/s per GPU. As a result, the time required to train this model lelism in isolation, and also compare the performance of the non- is 34 days. For the 1 trillion parameter model, we assume that 450 interleaved schedule to the interleaved schedule. billion tokens are needed for end-to-end training. With 3072 A100 GPUs, we can achieve a per-GPU throughput of 163 teraFLOP/s, 5.3.1 Weak Scaling. We evaluate the scaling of the default non- and end-to-end training time of 84 days. We believe these training interleaved pipeline-parallel schedule using a weak scaling setup, times (using a reasonable number of GPUs) are practical. a GPT model with 128 attention heads and a hidden size of 20480, Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Number of Model- Achieved Training time Batch Number Microbatch Scheme parameters parallel teraFlOP/s for 300B size of GPUs size (billion) size per GPU tokens (days) 384 4 144 90

ZeRO-3 174.6 1 1536 768 2 88 74 without 1536 1 44 74 Model 2560* 640 4 138 169 Parallelism 529.6 1 1120 2 98 137 2240 2240 1 48 140 384 1 153 84 174.6 96 1536 768 1 149 43 PTD 1536 1 141 23 Parallelism 560 1 171 156 529.6 280 2240 1120 1 167 80 2240 1 159 42

Table 2: Comparison of PTD Parallelism to ZeRO-3 (without model paralllelism). The 530-billion-parameter GPT model did not fit on 560 GPUs when using a microbatch size of 4 with ZeRO-3, so we increased the number of GPUs used to 640 and global batch size to 2560 to provide a throughput estimate (relevant row marked in table with a *).

200 200

150 150

100 100 per GPU per GPU Batch size = 8 Batch size = 32 50 50 Batch size = 128 Batch size = 128

Achieved teraFLOP/s 0 Achieved teraFLOP/s 0 1 2 4 8 (2, 32) (4, 16) (8, 8) (16, 4) (32, 2) Pipeline-parallel size (Pipeline-parallel size, Tensor-parallel size)

Figure 11: Throughput per GPU of pipeline parallelism using two Figure 13: Throughput per GPU of various parallel configurations different batch sizes in a weak-scaling experiment setup (model size that combine pipeline and tensor model parallelism using a GPT increases with the pipeline-parallel size). model with 162.2 billion parameters and 64 A100 GPUs.

150 5.3.2 Interleaved versus Non-Interleaved Schedule. Figure 12 shows the per-GPU-throughput for interleaved and non-interleaved sched- 125 ules on the GPT-3 [11] model with 175 billion parameters (96 100 layers, 96 attention heads, hidden size of 12288). The interleaved

per GPU Non-interleaved schedule with the scatter/gather communication optimization has 75 Interleaved higher computational performance than the non-interleaved (de- fault) schedule. This gap closes as the batch size increases due to

Achieved teraFLOP/s 50 12 24 36 48 60 two reasons: (a) as the batch size increases, the bubble size in the Batch size default schedule decreases, and (b) the amount of point-to-point Figure 12: Throughput per GPU of interleaved and non-interleaved communication within the pipeline is proportional to the batch size, schedules for a GPT model (175 billion parameters) on 96 GPUs. and consequently the non-interleaved schedule catches up as the amount of communication increases (the interleaved schedule fea- tures more communication per sample). Without the scatter/gather and a microbatch size of 1. As we increase the number of pipeline optimization, the default schedule performs better than the inter- stages, we also increase the size of the model by proportionally leaved schedule at larger batch sizes (not shown). increasing the number of layers in the model, e.g., with a pipeline- parallel size of 1, we use a model with 3 transformer layers and 15 5.4 Comparison of Parallel Configurations billion parameters, and with a pipeline-parallel size of 8, we use a In this sub-section, we show the various tradeoffs associated with model with 24 transformer layers and 121 billion parameters. We combining different parallelization dimensions. In particular, we use a tensor-parallel size of 8 for all configurations, and vary the show the performance for parallel configurations using the same total number of A100 GPUs used from 8 to 64. Figure 11 shows number of GPUs for a given model and multiple batch sizes. throughput per GPU for two different batch sizes to illustrate the 푝−1 impact of the pipeline bubble, which behaves as 푚 (§2.2.1). As 5.4.1 Tensor versus Pipeline Parallelism. We evaluate the impact of expected, the higher batch size scales better since the pipeline pipeline and tensor model parallelism on performance for a given bubble is amortized over more microbatches. model and batch size. The empirical results in Figure 13 show the 200 (used by PipeDream [30] and others) in isolation can match the Batch size = 32 performance of using both techniques in conjunction. 150 Batch size = 512 100 5.4.2 Pipeline versus Data Parallelism. We evaluate the impact of per GPU 50 data and pipeline model parallelism on performance for a GPT model with 5.9 billion parameters (32 transformer layers, 32 at-

Achieved teraFLOP/s 0 (2, 32) (4, 16) (8, 8) (16, 4) (32, 2) tention heads, hidden size of 3840) in Figure 14. We use a smaller (Pipeline-parallel size, Data-parallel size) model than before since we want to show performance for models that fit when the model-parallel size is only 2. For simplicity, we Figure 14: Throughput per GPU of various parallel configurations keep the microbatch size equal to 1 in these experiments. We see that combine data and pipeline model parallelism using a GPT that for each batch size, the throughput decreases as the pipeline- model with 5.9 billion parameters, three different batch sizes, mi- crobatch size of 1, and 64 A100 GPUs. parallel size increases, matching our analytical model from §3.3. Pipeline model parallelism should be used primarily to support the 200 training of large models that do not fit on a single worker, and data Batch size = 32 parallelism should be used to scale up training. 150 Batch size = 128 100 Batch size = 512 5.4.3 Tensor versus Data Parallelism. We also evaluate the impact per GPU 50 of data and tensor model parallelism on performance for the same GPT model with 5.9 billion parameters in Figure 15 (smaller model

Achieved teraFLOP/s 0 (2, 32) (4, 16) (8, 8) (16, 4) (32, 2) used for same reason as above). As before, we keep the microbatch (Tensor-parallel size, Data-parallel size) size equal to 1 initially. With larger batch sizes and a microbatch size of 1, data-parallel communication is infrequent; the all-to-all Figure 15: Throughput per GPU of various parallel configurations communication required in tensor model parallelism needs to be that combine data and tensor model parallelism using a GPT model performed for every microbatch in a batch. This all-to-all communi- with 5.9 billion parameters, three different batch sizes, microbatch size of 1, and 64 A100 GPUs. cation with tensor model parallelism dominates end-to-end training time, especially when communication needs to be performed across 200 multi-GPU nodes. Additionally, as the tensor-model-parallel size increases, we perform smaller matrix multiplications on every GPU, 150 decreasing utilization on each GPU. 100 We should note that although data parallelism can lead to effi-

per GPU Batch size = 128 50 cient scaling, we cannot use data parallelism in isolation for very Batch size = 512 large models with a limited training batch size because of a) insuffi-

Achieved teraFLOP/s 0 1 2 4 8 cient memory capacity, and b) scaling limitations of data parallelism Microbatch size (e.g., GPT-3 was trained to convergence with a batch size of 1536. Data parallelism thus supports parallelization to only 1536 GPUs; Figure 16: Throughput per GPU of a (푡, 푝) = (8, 8) parallel configura- tion for different microbatch sizes on a GPT model with 91 billion however, roughly 10, 000 GPUs were used to train this model in a parameters, for two different batch sizes using 64 A100 GPUs. reasonable amount of time). importance of using both tensor and pipeline model parallelism in conjunction to train a 161-billion-parameter GPT model (32 trans- 5.5 Microbatch Size former layers to support pipeline-parallel size of 32, 128 attention We evaluate the impact of the microbatch size on the performance heads, hidden size of 20480) with low communication overhead and of parallel configurations that combine pipeline and tensor model high compute resource utilization. We observe that tensor model parallelism in Figure 16 for a model with 91 billion parameters parallelism is best within a node (DGX A100 server) due to its expen- ((푡, 푝) = (8, 8)). We see that the best microbatch size is 2 for this sive all-reduce communication. Pipeline model parallelism, on the model; the optimal microbatch size is different for other models (not other hand, uses much cheaper point-to-point communication that shown in Figure) and model-dependent. For a given batch size, in- can be performed across nodes without bottlenecking the entire creasing the microbatch size decreases the number of microbatches computation. However, with pipeline parallelism, significant time in the pipeline (푚), leading to a larger pipeline bubble; however, can be spent in the pipeline bubble: the total number of pipeline increasing the microbatch size can also improve GPU utilization stages should thus be limited so that the number of microbatches by increasing the arithmetic intensity of executed kernels. These in the pipeline is a reasonable multiple of the number of pipeline two factors are at odds with each other, which makes the choice stages. Consequently, we see peak performance when the tensor- of optimal microbatch size challenging. Our analytical model from parallel size is equal to the number of GPUs in a single node (8 with §3.3 reasonably approximates true performance, and can be used DGX A100 nodes). This result indicates that neither tensor model as a proxy to determine how to pick this hyperparameter value for parallelism (used by Megatron [40]) nor pipeline model parallelism various training configurations and models. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

10.0 5.9 Inter-Node Communication Bandwidth Act. recomputation Our strong results are a byproduct of using an optimized software 7.5 W/o act. recomputation and hardware stack together. In particular, we take advantage of the 5.0 high-bandwidth communication links between GPUs on the same server and across servers. On the trillion-parameter model with

Throughput 2.5 3072 GPUs, we observed that the effective bisection bandwidth of

(sequences/second) 0.0 point-to-point communication among pipeline stages is 892 GB/s, 1 2 4 8 16 32 64 128 256 Batch size while the effective bisection bandwidth of all-reduce operations among data-parallel replicas is 12.9 TB/s. A less-optimized parti- Figure 17: Throughput (in sequences per second) with and without tioning of operators across devices would lead to more inter-node activation recomputation for a GPT model with 145 billion param- communication, hampering scaling performance. eters using 128 A100 GPUs ((푡, 푝) = (8, 16)). 5.10 Checkpoint Loading and Saving 150 An important practical consideration for the training of large mod- 125 els is loading and saving model checkpoints, which are especially large for the models considered in this paper. For example, the 100 trillion-parameter model has a checkpoint of size 13.8 terabytes.

per GPU Unoptimized The initial load of checkpoints for the trillion-parameter model by 75 Scatter/gather optimization all 384 nodes (3072 GPUs) reaches a peak read bandwidth of 1TB/s,

Achieved teraFLOP/s 50 the maximum read throughput possible from the parallel filesystem. 12 24 36 48 60 Batch size Checkpoint saves reach 40% of peak write bandwidth (273 GB/s). Figure 18: Throughput per GPU with and without the scatter/gather 6 RELATED WORK optimization for a GPT model with 175 billion parameters using 96 A100 GPUs and the interleaved schedule. In this section, we discuss other techniques to train models at scale. Parallelism for Large Models. Pipeline model parallelism is a com- mon technique used to train large models. Pipeline parallelism 5.6 Activation Recomputation comes in a few flavors: the mode discussed in this paper uses flushes Figure 17 shows throughput with and without activation recompu- to ensure strict optimizer semantics. TeraPipe [26] exposes fine- tation for a GPT model with 145 billion parameters (80 transformer grained pipeline parallelism across tokens in a single training se- layers, 96 attention heads, hidden size of 12288) using 128 A100 quence for auto-regressive models like GPT. PipeTransformer [19] GPUs, (푡, 푝) = (8, 16), and a range of batch sizes. For small batch elastically adjusts the degree of pipelining and data parallelism sizes, activation recomputation leads to up to 33% lower throughput by freezing layers with “stable” weights, and instead dedicates re- (in sequences per second) due to the extra forward pass that needs sources to train the remaining “active” layers. HetPipe [31] uses a to be executed during the backward pass. However, activation re- combination of pipeline and data parallelism on a set of heteroge- computation is needed to support larger batch sizes. Throughput at neous accelerators. Pipeline parallelism can also be implemented large batch sizes with activation recomputation is up to 2× higher with relaxed semantics: PipeDream-2BW [30] maintains two weight than the best throughput achieved without activation recomputa- versions and guarantees 1-stale weight updates without expen- tion (for a smaller batch size) due to a smaller pipeline bubble. sive flushes, while PipeMare45 [ ] and Kosson et al. [23] use asyn- choronous pipeline parallelism. These techniques have improved 5.7 Scatter-Gather Optimization throughput compared to the techniques with pipeline flushes con- sidered in this paper, but potentially at the cost of convergence rate Figure 18 shows per-GPU-throughput with and without (unop- or final accuracy. Moreover, pipeline parallelism in isolation can timized) the scatter/gather communication optimization for the still only scale to a number of devices equal to the number of layers GPT-3 model with 175 billion parameters. We see an improvement in the model, which is limiting for certain model architectures. of up to 11% in throughput for communication-intensive sched- PipeDream [29] combined pipeline parallelism and data paral- ules (large batch size with interleaving) by reducing the amount of lelism in a principled way to reduce cross-device communication. communication over cross-node links. DeepSpeed [2] combined pipeline parallelism with tensor and data parallelism to train models with up to a trillion parameters, but 5.8 Fused Operators with lower throughput than what was shown in this paper (52% We also evaluate the performance impact of operator fusion de- vs. 36% of peak) for a few reasons: operator fusion to keep most of scribed in §4.2. For the GPT-3 model (175 billion parameters), through- the operator graph compute-bound, a more-efficient pipeline paral- put increased by 19% with fusion (113 teraFLOP/s per GPU to 135 lelism schedule to minimize the pipeline bubble size, fast hardware teraFLOP/s per GPU). For the larger GPT model with 530 billion (A100 vs. V100 GPUs and high-bandwidth links between GPUs parameters (model configuration in Figure 1), throughput increased on the same and different servers), and scaling to more GPUs. We by 11% (133 teraFLOP/s per GPU to 148 teraFLOP/s per GPU). want to emphasize that this higher throughput makes estimated training times much more practical (about 3 months); an aggregate various tradeoffs associated with each of these types of parallelism, throughput of 37.6 petaFLOP/s would take about 40 months to train and how the interactions between them need to be considered an equivalently-sized model. We can scale to larger models as well, carefully when combined. but would need more GPUs to keep training time practical. Even though the implementation and evaluation in this paper Mesh-TensorFlow [39] proposes a language for easily specifying is GPU-centric, many of these ideas translate to other types of parallelization strategies that combine data and model parallelism. accelerators as well. Concretely, the following are ideas that are Switch Transformers [15] used Mesh-Tensorflow to train a sparsely accelerator-agnostic: a) the idea of smartly partitioning the model activated expert-based model with 1.6 trillion parameters, with training graph to minimize the amount of communication while improved pre-training speed over the T5-11B model [35]. still keeping devices active, b) minimizing the number of memory- bound kernels with operator fusion and careful data layout, c) other Sharded Data Parallelism. As part of performance optimizations domain-specific optimizations (e.g., scatter-gather optimization). for MLPerf 0.6 [28], sharded data parallelism [24, 44], where opti- mizer state is sharded over data-parallel workers, was introduced. ACKNOWLEDGEMENTS This method has two advantages: (a) it does not introduce extra communication over vanilla data parallelism, and (b) it divides the We thank the anonymous reviewers, Seonmyeong Bak, Keshav San- optimizer’s computation and memory cost across the data-parallel thanam, Trevor Gale, Dimitrios Vytiniotis, and Siddharth Karam- partitions. ZeRO [36, 37] extends this idea: weight parameters and cheti for their help and feedback that improved this work. This gradients are sharded across data-parallel workers as well, and research was supported in part by NSF Graduate Research Fellow- workers fetch relevant state from their “owning” workers before ship grant DGE-1656518 and NSF CAREER grant CNS-1651570. Any performing computations. This adds additional communication, opinions, findings, and conclusions or recommendations expressed which can be partially hidden by carefully overlapping computa- in this material are those of the authors alone. tion and communication. However, this can become harder if tensor APPENDIX: FLOATING-POINT OPERATIONS parallelism is not used or the batch size is not large enough to hide the extra communication overhead (Figure 10). ZeRO-Infinity [37] In this section, we describe how we calculate the number of floating- uses NVMe to efficiently swap parameters, enabling the training of point operations (FLOPs) in a model. We consider a language model very large models on a small number of GPUs. We note that using with 푙 transformer layers, hidden size ℎ, sequence length 푠, vocabu- a small number of GPUs for training a very large model results in lary size 푉 , and training batch size 퐵. unrealistic training times (e.g., thousands of years to converge). A 퐴푚×푘 ×푋푘×푛 matrix multiplication requires 2푚 ×푘 ×푛 FLOPs (factor of 2 needed to account for multiplies and adds). Automatic Partitioning. FlexFlow [22], PipeDream [29], DAP- A transformer layer consists of an attention block followed by PLE [14], and Tarnawski et al. [41] all auto-partition model training a 2-layer feed-forward network. For the attention block, the main graphs over multiple devices with the help of cost models. However, FLOP contributors are the key, query, and value transformation each of these do not consider all the parallelism dimensions con- (6퐵푠ℎ2 operations), attention matrix computation (2퐵푠2ℎ opera- sidered in this paper: pipeline and tensor model parallelism, data tions), attention over values (2퐵푠2ℎ operations), and post-attention parallelism, microbatch size, and the effect of memory-savings op- linear projection (2퐵푠ℎ2 operations). The feed-forward network timizations like activation recomputation on the training of models increases the hidden size to 4ℎ and then reduces it back to ℎ; this larger than the memory capacity of an accelerator. These added requires 16퐵푠ℎ2 FLOPs. Summing these together, each transformer dimensions increase the search space that needs to be explored. layer results in 24퐵푠ℎ2 + 4퐵푠2ℎ FLOPs for the forward pass. The Gholami et al. [16] show how communication costs for combina- backward pass requires double the number of FLOPs since we tions of data and model parallelism can be modeled. need to calculate the gradients with respect to both input and HPC for Model Training. Goyal et al. [17] and You et al. [47] both weight tensors. In addition, we are using activation recomputation, demonstrate the use of High Performance Computing techniques which requires an additional forward pass before the backward to train highly-accurate ImageNet models in minutes. However, the pass. As a result, the total number of FLOPs per transformer layer 2 2  2  푠  image classification models considered fit comfortably on asingle is 4 × 24퐵푠ℎ + 4퐵푠 ℎ = 96퐵푠ℎ 1 + . 6ℎ accelerator, rendering model parallelism unnecessary, support very The other main contributor to the FLOP count is the logit layer in large batch sizes (> 32k) that allow scaling data parallelism to large the language model head, which transforms features of dimension worker counts with infrequent communication, and are composed ℎ to the vocabulary dimension 푉 . The required FLOPs for this of compact convolutional layers that are inherently amenable to operation is 2퐵푠ℎ푉 in the forward pass and 4퐵푠ℎ푉 in the backward data-parallel communication. pass, resulting in 6퐵푠ℎ푉 FLOPs in total. Thus, for a transformer model with 푙 transformer layers, the 7 DISCUSSION AND CONCLUSION total number of floating-point operations is: In this paper, we have shown how PTD-P (inter-node pipeline par-  푠 푉  96퐵푠푙ℎ2 1 + + . allelism, intra-node tensor parallelism, and data parallelism) can be 6ℎ 16푙ℎ composed to achieve high aggregate throughput (502 petaFLOP/s) while training large models with a trillion parameters. This facil- itates end-to-end training in reasonable times (estimated time of around 3 months for a trillion-parameter model). We discussed the Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

REFERENCES [29] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R [1] Applications of GPT-3. https://openai.com/blog/gpt-3-apps/. Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: [2] DeepSpeed: Extreme-Scale Model Training for Everyone. https: Generalized Pipeline Parallelism for DNN Training. In Proceedings of the 27th //www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model- ACM Symposium on Operating Systems Principles, pages 1–15, 2019. training-for-everyone/. [30] Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. [3] DeepSpeed Repository. https://www.deepspeed.ai/. Memory-Efficient Pipeline-Parallel DNN Training. In International Conference on [4] GitHub Copilot. https://copilot.github.com/. Machine Learning, pages 7937–7947. PMLR, 2021. [5] Microsoft Translates Spoken Text to Code. https://techcrunch.com/2021/05/25/ [31] Jay H Park, Gyeongchan Yun, M Yi Chang, Nguyen T Nguyen, Seungmin Lee, microsoft-uses-gpt-3-to-let-you-code-in-natural-language/. Jaesik Choi, Sam H Noh, and Young-ri Choi. HetPipe: Enabling Large DNN Train- [6] NVIDIA A100 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/ ing on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined a100/. Model Parallelism and Data Parallelism. In 2020 USENIX Annual Technical Con- [7] NVIDIA Collective Communication Library (NCCL). https://developer.nvidia. ference (USENIX ATC 20), pages 307–321, 2020. com/nccl. [32] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- [8] NVIDIA Selene Supercomputer. https://www.top500.org/system/179842/. gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, [9] NVLink and NVSwitch. https://www.nvidia.com/en-us/data-center/nvlink/. Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, [10] PyTorch JIT. https://pytorch.org/docs/stable/jit.html. Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and [11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, and et al. Language Soumith Chintala. PyTorch: An Imperative Style, High-Performance Deep Learn- Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165, 2020. ing Library. In Advances in Neural Information Processing Systems, volume 32, [12] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets 2019. with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174, 2016. [33] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving [13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Language Understanding by Generative Pre-Training, 2018. Pre-training of Deep Bidirectional Transformers for Language Understanding. [34] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya arXiv preprint arXiv:1810.04805, 2018. Sutskever. Language Models are Unsupervised Multitask Learners. OpenAI Blog, [14] Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan 1(8):9, 2019. Wu, Guoping Long, Jun Yang, Lixue Xia, et al. DAPPLE: A Pipelined Data Parallel [35] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Approach for Training Large Models. In Proceedings of the 26th ACM SIGPLAN Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits Symposium on Principles and Practice of Parallel Programming, pages 431–445, of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683, 2021. 2019. [15] William Fedus, Barret Zoph, and Noam Shazeer. Switch Transformers: Scaling [36] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv preprint Memory Optimization Towards Training A Trillion Parameter Models. arXiv arXiv:2101.03961, 2021. preprint arXiv:1910.02054, 2019. [16] Amir Gholami, Ariful Azad, Peter Jin, Kurt Keutzer, and Aydin Buluc. Integrated [37] Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith, and Yuxiong Model, Batch, and Domain Parallelism in Training Neural Networks. In Proceed- He. ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep ings of the 30th on Symposium on Parallelism in Algorithms and Architectures, Learning. arXiv preprint arXiv:2104.07857, 2021. pages 77–86, 2018. [38] Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, [17] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. ZeRO-Offload: De- Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, Large mocratizing Billion-Scale Model Training. arXiv preprint arXiv:2101.06840, 2021. Minibatch SGD: Training ImageNet in 1 Hour. arXiv preprint arXiv:1706.02677, [39] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Pen- 2017. porn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff [18] Andreas Griewank and Andrea Walther. Revolve: An Implementation of Check- Young, Ryan Sepassi, and Blake Hechtman. Mesh-TensorFlow: Deep Learning pointing for the Reverse or Adjoint Mode of Computational Differentiation. ACM for Supercomputers. In Neural Information Processing Systems, 2018. Transactions on Mathematical Software (TOMS), 26(1):19–45, 2000. [40] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, [19] Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. PipeTrans- and Bryan Catanzaro. Megatron-LM: Training Multi-Billion Parameter Language former: Automated Elastic Pipelining for Distributed Training of Transformers. Models using GPU Model Parallelism. arXiv preprint arXiv:1909.08053, 2019. arXiv preprint arXiv:2102.03161, 2021. [41] Jakub M Tarnawski, Amar Phanishayee, Nikhil Devanur, Divya Mahajan, and [20] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Fanny Nina Paravecino. Efficient Algorithms for Device Placement ofDNN Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. GPipe: Graph Operators. In Advances in Neural Information Processing Systems, pages Efficient Training of Giant Neural Networks using Pipeline Parallelism. In 15451–15463, 2020. Advances in Neural Information Processing Systems, pages 103–112, 2019. [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, [21] Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Joseph Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is All You Need. Gonzalez, Kurt Keutzer, and Ion Stoica. Breaking the Memory Wall with Optimal arXiv preprint arXiv:1706.03762, 2017. Tensor Rematerialization. In Proceedings of Machine Learning and Systems 2020, [43] Eric P Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, pages 497–511. 2020. Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. Petuum: A New [22] Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism Platform for Distributed Machine Learning on Big Data. IEEE Transactions on for Deep Neural Networks. In Proceedings of the 2nd Conference on Machine Big Data, 1(2):49–67, 2015. Learning and Systems (MLSys), 2018. [44] Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, [23] Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, and Urs Köster. and Shibo Wang. Automatic Cross-Replica Sharding of Weight Updates in Data- Pipelined Backpropagation at Scale: Training Large Models without Batches. Parallel Training. arXiv preprint arXiv:2004.13336, 2020. Proceedings of Machine Learning and Systems, 2021. [45] Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher Aberger, and [24] Sameer Kumar, Victor Bitorff, Dehao Chen, Chiachen Chou, Blake Hechtman, Christopher De Sa. PipeMare: Asynchronous Pipeline Parallel DNN Training. HyoukJoong Lee, Naveen Kumar, Peter Mattson, Shibo Wang, Tao Wang, et al. Proceedings of Machine Learning and Systems, 2021. Scale MLPerf-0.6 Models on Google TPU-v3 Pods. arXiv preprint arXiv:1909.09756, [46] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, 2019. and Quoc V. Le. XLNet: Generalized Autoregressive Pretraining for Language [25] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Understanding. CoRR, abs/1906.08237, 2019. Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. PyTorch [47] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Ima- Distributed: Experiences on Accelerating Data Parallel Training. arXiv preprint geNet Training in Minutes. In Proceedings of the 47th International Conference on arXiv:2006.15704, 2020. Parallel Processing, pages 1–10, 2018. [26] Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models. arXiv preprint arXiv:2102.07988, 2021. [27] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, abs/1907.11692, 2019. [28] Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micike- vicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, et al. MLPerf Training Benchmark. arXiv preprint arXiv:1910.01500, 2019.