Automated Elastic Pipelining for Distributed Training of Transformers

PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers Chaoyang He 1 Shen Li 2 Mahdi Soltanolkotabi 1 Salman Avestimehr 1 Abstract the-art convolutional networks ResNet-152 (He et al., 2016) and EfficientNet (Tan & Le, 2019). To tackle the growth in The size of Transformer models is growing at an model sizes, researchers have proposed various distributed unprecedented rate. It has taken less than one training techniques, including parameter servers (Li et al., year to reach trillion-level parameters since the 2014; Jiang et al., 2020; Kim et al., 2019), pipeline paral- release of GPT-3 (175B). Training such models lel (Huang et al., 2019; Park et al., 2020; Narayanan et al., requires both substantial engineering efforts and 2019), intra-layer parallel (Lepikhin et al., 2020; Shazeer enormous computing resources, which are luxu- et al., 2018; Shoeybi et al., 2019), and zero redundancy data ries most research teams cannot afford. In this parallel (Rajbhandari et al., 2019). paper, we propose PipeTransformer, which leverages automated elastic pipelining for effi- T0 (0% trained) T1 (35% trained) T2 (75% trained) T3 (100% trained) cient distributed training of Transformer models. In PipeTransformer, we design an adaptive on the fly freeze algorithm that can identify and freeze some layers gradually during training, and an elastic pipelining system that can dynamically Layer (end of training) Layer (end of training) Layer (end of training) Layer (end of training) Similarity score allocate resources to train the remaining active layers. More specifically, PipeTransformer automatically excludes frozen layers from the Figure 1. Interpretable Freeze Training: DNNs converge bottom pipeline, packs active layers into fewer GPUs, up (Results on CIFAR10 using ResNet). Each pane shows layer- and forks more replicas to increase data-parallel by-layer similarity using SVCCA (Raghu et al., 2017). PipeTransformer width. We evaluate us- Existing distributed training solutions, however, only study ing Vision Transformer (ViT) on ImageNet and scenarios where all model weights are required to be opti- BERT on SQuAD and GLUE datasets. Our results mized throughout the training (i.e., computation and com- show that compared to the state-of-the-art base- munication overhead remains relatively static over different PipeTransformer line, attains up to 2:83- iterations). Recent works on freeze training (Raghu et al., fold speedup without losing accuracy. We also 2017; Morcos et al., 2018) suggest that parameters in neural provide various performance analyses for a more networks usually converge from the bottom-up (i.e., not all comprehensive understanding of our algorithmic layers need to be trained all the way through training). Fig- and system-wise design. Finally, we have modu- ure1 shows an example of how parameter values gradually larized our training system with flexible APIs and stabilize during training in this approach. This observation made the source code publicly available. motivates us to utilize freeze training for distributed training arXiv:submit/3591484 [cs.LG] 5 Feb 2021 of Transformer models to accelerate training by dynamically allocating resources to focus on a shrinking set of active 1. Introduction layers. Such a layer freezing strategy is especially pertinent Large Transformer models (Brown et al., 2020; Lepikhin to pipeline parallelism, as excluding consecutive bottom et al., 2020) have powered accuracy breakthroughs in both layers from the pipeline can reduce computation, memory, natural language processing and computer vision. GPT-3 hit and communication overhead. a new record high accuracy for nearly all NLP tasks. Vision In this paper, we propose PipeTransformer, an elastic Transformer (ViT) (Dosovitskiy et al., 2020) also achieved pipelining training acceleration framework that automati- 89% top-1 accuracy in ImageNet, outperforming state-of- cally reacts to frozen layers by dynamically transforming the scope of the pipelined model and the number of pipeline 1University of Southern California 2Facebook AI. Correspon- dence to: Chaoyang He <[email protected]>. replicas. To the best of our knowledge, this is the first paper that studies layer freezing in the context of both pipeline and Preprint. Under Review. data-parallel training. Figure2 demonstrates the benefits PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers “Hi, PipeTransformer, Transformer Models for NLP (BERT) and CV (ViT) “Pipeline Transformation for Transformer Models” What does your name mean?” T0 1. Freeze Algorithm 4. AutoCache: Cross-process caching T1 T2 T3 Cache Active Layers Frozen Layers Active Layers Frozen Layers Active Layers Fronzen Layers Pipeline 0 (server 0) FP 0 BP 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 DP Pipeline 1 (server 1) DP DP DP DP DP DP DP FP 8 BP 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 8 9 10 11 12 13 14 15 2. AutoPipe: Elastic pipelining 3. AutoPipe: Spawning More Pipeline Replicas Figure 2. The process of PipeTransformer’s automated and elastic pipelining to accelerate distributed training of Transformer models of such a combination. First, by excluding frozen layers across existing and new data-parallel processes and auto- from the pipeline, the same model can be packed into fewer matically replaces stale caches during transitions (Section GPUs, leading to both fewer cross-GPU communications 3.4). and smaller pipeline bubbles. Second, after packing the Overall, PipeTransformer combines the Freeze model into fewer GPUs, the same cluster can accommodate Algorithm, AutoPipe, AutoDP and AutoCache mod- more pipeline replicas, increasing the width of data paral- ules to provide a significant training speedup. We eval- lelism. More importantly, the speedups acquired from these uate PipeTransformer using Vision Transformer (ViT) two benefits are multiplicative rather than additive, further on ImageNet and BERT on GLUE and SQuAD datasets. accelerating the training. Our results show that PipeTransformer attains up to The design of PipeTransformer faces four major chal- 2:83-fold speedup without losing accuracy. We also provide lenges. First, the freeze algorithm must make on the fly and various performance analyses for a more comprehensive adaptive freezing decisions; however, existing work (Raghu understanding of our algorithmic and system-wise design. et al., 2017) only provides a posterior analysis tool. Sec- Finally, we have also developed open-source flexible APIs ond, the efficiency of pipeline re-partitioning results is for PipeTransformer which offer a clean separation influenced by multiple factors, including partition gran- among the freeze algorithm, model definitions, and train- ularity, cross-partition activation size, and the chunking ing accelerations, allowing for transferability to other algo- (the number of micro-batches) in mini-batches, which re- rithms that require similar freezing strategies. The source quire reasoning and searching in a large solution space. code is made publicly available. Third, to dynamically introduce additional pipeline replicas, PipeTransformer must overcome the static nature of 2. Overview collective communications and avoid potentially complex cross-process messaging protocols when onboarding new 2.1. Background and Problem Setting processes (one pipeline is handled by one process). Finally, Suppose we aim to train a massive model in a distributed caching can save time for repeated forward propagation training system where the hybrid of pipelined model paral- of frozen layers, but it must be shared between existing lelism and data parallelism is used to target scenarios where pipelines and newly added ones, as the system cannot afford either the memory of a single GPU device cannot hold the to create and warm up a dedicated cache for each replica. model, or if loaded, the batch size is small enough to avoid PipeTransformer is designed with four core building running out of memory. More specifically, we define our blocks to address the aforementioned challenges. First, we settings as follows: design a tunable and adaptive algorithm to generate signals Training task and model definition. We train Transformer that guide the selection of layers to freeze over different models (e.g., Vision Transformer (Dosovitskiy et al., 2020), iterations (Section 3.1). Once triggered by these signals, our BERT (Devlin et al., 2018)) on large-scale image or text elastic pipelining module AutoPipe, then packs the remain- datasets. The Transformer model F has L layers, in which ing active layers into fewer GPUs by taking both activation the ith layer is composed of a forward computation function sizes and variances of workloads across heterogeneous par- f and a corresponding set of parameters, w . With this defi- titions (frozen layers and active layers) into account. It then i i nition, the overall model is F = f (w )◦:::◦f (w ). splits a mini-batch into an optimal number of micro-batches 0 0 L−1 L−1 The model size is S, and the batch size is set to N . based on prior profiling results for different pipeline lengths bs (Section 3.2). Our next module, AutoDP, spawns additional Training infrastructure. Assume the training infrastruc- pipeline replicas to occupy freed-up GPUs and maintains hi- ture contains a GPU cluster that has N GPU servers (i.e. erarchical communication process groups to attain dynamic nodes). Each node has I GPUs. Our cluster is homoge- membership for collective communications (Section 3.3). neous, meaning that each GPU and server have the same Our final module, AutoCache, efficiently shares activations hardware configuration. Each GPU’s memory capacity is PipeTransformer: Automated Elastic Pipelining for Distributed Training of Transformers pipeline 0 at timestep 1 Pipeline 1 at timestep 1 MGPU. Servers are connected by a high bandwidth network interface such as InfiniBand interconnect. Dataset Dataset Pipeline parallelism. In each machine, we load a model pipeline 0 at timestep 0 add new pipelines F into a pipeline P which has K partitions (K also rep- resents the pipeline length).

Automated Elastic Pipelining for Distributed Training of Transformers

Backpropagation and Deep Learning in the Brain

Convolutional Neural Network Transfer Learning for Robust Face

Introduction to Deep Learning Framework 1. Introduction 1.1

Zero-Shot Text-To-Image Generation

Tiny Imagenet Challenge

Deep Learning Is Robust to Massive Label Noise

Real-Time Object Detection for Autonomous Vehicles Using Deep Learning

TRAINING NEURAL NETWORKS with TENSOR CORES Dusan Stosic, NVIDIA Agenda

Cs231n Lecture 8 : Deep Learning Software

Pytorch Is an Open Source Machine Learning Library for Python and Is Completely Based on Torch

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Kernel Descriptors for Visual Recognition