Avoiding GPU OOM for Dynamic Computational Graphs Training

AVOIDING GPU OOM FOR DYNAMIC COMPUTATIONAL GRAPHS TRAINING Siyuan Zhuang ABSTRACT State-of-the-art deep learning models are becoming larger and larger and GPU memory becomes a bottleneck. Meanwhile, dynamic computational graph libraries like PyTorch gain a lot of popularity among machine learning researcher, while previous works on overcoming GPU memory bottleneck mainly focused on static graph. In this work, we target on solving our-of-memory issues on GPU for dynamic computational graphs. Specifically, we developed a system to swap the parameters and intermediate results of the dynamic computational graph between CPU and GPU memories asynchronously. With proper system design and optimizations, our system achieves 2-3x speed up compared to various baselines, while the overhead is less than 0.5 second even for complex neural networks. 1 INTRODUCTION Saving GPU memory for dynamic model is inherently harder than for static ones. As shown in Fig.1, the exe- Deep learning models are becoming larger and larger. For cution path of dynamic models (as dynamic computational example, the biggest parameter size of NLP models is ac- graphs) is highly dependent on the input and it is impossible tively growing: from 340M (BERT-Large) to 8.3B (GPT- to find time-optimal memory saving solution for general 2 8B) (Narasimhan, 2019). However, the computational cases because it is impossible to predict the future without power of GPUs and GPU memory are growing in a different the required information. speed. Therefore, GPU memory begin to be a bottleneck for large model developments. To tackle this issue, in this paper, we propose a new system where users can train dynamic models without worrying Previous works (Narayanan et al., 2019; Huang et al., 2019; about GPU memory limitations, while exploiting heuris- Gholami et al., 2018; Wang et al., 2019; Jia et al., 2018; tics of graph structure and underlying platform to keep the Shazeer et al., 2018; Jain et al., 2019; Chen et al., 2016; high performance. Specifically, given that on modern GPU Zhang et al., 2019) proposes different methods to tackle the servers, CPU memory is much cheaper and often times issue. However, these works mainly focus on optimization larger than GPU memory, we leveraged the CPU memory over static/fixed model instead of dynamic ones with fixed and dynamically move tensors back and forth between CPU computation resource configuration. This limits their prac- and GPU memories. By using a properly designed swapping tical usage in both industry and academia: For industry, it policy and performing memory swapping asynchronously, is not feasible to use techniques like model parallelism for our proposed system enables users to train a much larger online serving, and size of static/fixed models can be further neural networks with little overhead. optimized by pruning (Han et al., 2015), knowledge distilla- tion (Hinton et al., 2015), quantilization (Courbariaux et al., We illustrate the design of our system in Fig.2. Three ma- 2015), or other novel machine learning techniques (Lan jor components of our system are a tracer, a policy and a et al., 2019); For academia, researchers typically would like memory pool. The tracer will be triggered before the exe- to explore novel, dynamic and complex networks for re- cution of a node in the computational graph (which could search purposes. Growing interest and adoption of PyTorch be a layer or a part of a layer of a dynamic neural network (Paszke et al., 2019) along with much more dynamic and model). The tracer will record the current node and execu- more flexible neural networks (Kosiorek et al., 2019; Cai & tion context and pass them to the policy. The policy will Vasconcelos, 2018; Yeo et al., 2018) shows that researchers analyze GPU memory constraint, execution dependencies prefer deep learning frameworks with flexibility. However, and the execution history in previous training iterations and existing methods are not suitable for dynamic graphs. decide an action to take. For example, if the GPU memory is not sufficient for current node, the policy will operate Correspondence to: Siyuan Zhuang on the memory pool to swap out some tensors from GPU <siyuan [email protected]>. memory and reserve enough GPU memory for the new results. The memory pool is responsible for managing current GPU memory, recording necessary information for memory Avoiding GPU OOM for Dynamic Computational Graphs Training 2 RELATED WORKS 2.1 Data Parallelism and Gradient Accumulation The dominate and the easiest way to parallelize a deep learning model is data parallelism. Data parallelism split the input data in to different subsets and running different subsets of input on different devices. Most deep learning frameworks (e.g. PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016)) implemented data parallelism as an API call. To utilize data parallelism in a single device setting, gradient accumulation (Ott et al., 2018) performs multiple rounds of forward and backward propagation in a single training Figure 1. An example of a dynamic computation graph. The com- iteration to accumulate gradient before a gradient update. putation order goes from the left to the right. Unlike static com- This enables large training batch size, which is necessary putational graphs, the shadowed area which we haven’t visited or for large model training. created are unknown, and these parts could result in GPU out of The biggest issue for data parallelism is that each device still memory. needs to have a full copy of the model and perform the whole model execution on a single device, which is impossible control, and performing low-level memory operations. We for larger models (Narasimhan, 2019). Also, some neural implemented our system on PyTorch (Paszke et al., 2019) network layers will behave differently given different input and provide a simple one-line API for end users to make size (e.g. Batch Normalization (Ioffe & Szegedy, 2015)), use of our system. which will introduce uncertainty during model development. Our results shows that we can be 1.8-2.1x faster than a sim- 2.2 Model Parallelism ple GPU memory swapping method on ResNet-152, while 2-3x faster compared to another baseline using conservative As a way to run larger and larger deep neural networks, in policies. Although there are still some gaps from optimal model parallelism, a DNN model are partitioned across the solutions, these results provide some initial evidence that available workers, with each worker evaluating and perform- there could be still very large space for dynamic computa- ing updates for only a subset of the model’s parameters for tional graph memory optimization, and we believe that this all inputs. will become an active research topic just like optimizations One way to implement model parallelism is to spilt some on static graph for the last two years. We will present ideas dimensions of neural network layers and execute in a SIMD about how to further improve performance of prevent OOM manner on multiple machines. Early works (Krizhevsky, for dynamic networks in section7. 2014) parallelizes specific classes of DNN models, and is We summarize our contributions as follows: limited in flexibility and generality. Mesh-TensorFlow li- brary (Shazeer et al., 2018) provides users an interface spec- • To our knowledge, it is one of the first papers targeting ify any tensor-dimensions to be split across any dimensions on solving out-of-memory issues on GPU for dynamic of a multi-dimensional mesh of processors. Recently, a set neural network models. of works (Gholami et al., 2018; Jia et al., 2018; Wang et al., 2019) proposed to automatically performing model optimal • We proposed a complete system to perform memory model parallel that can achieve best speed performance. swapping between CPU and GPU memories with an Another way is to place different parts of a neural network easy-to-use API for end-users. on different devices. Directly applying this will lead to extremely poor performance: since neural network layers • With a tight connection with the underlying graph exe- are depend in a sequential order and during training, both cution system (i.e. PyTorch), proper optimization and forward and backward propagations need to be performed, design, memory swapping can be very effective on naive implementation will have only one device running at dynamic computational graphs. any given time. GPipe (Huang et al., 2019) mitigates this issue by splitting an input batch into groups and pipelining • As an on-going project, we also discuss challenges the execution of the groups on different devices. Although and opportunities of this research direction and present achieving high speedup, there is still a significant amount of some preliminary results as future works. Avoiding GPU OOM for Dynamic Computational Graphs Training Figure 2. An overview of our framework architecture. time that some devices will be idle. Pipedream (Narayanan modify the original ResNet to a reversible network that et al., 2019) tackles this issue by relaxing the synchronize reaches nearly identical performance as the original one. training setup in GPipe and reduce the idle time to zero. Anonymous(2020) extend this idea to the Transformer net- This type of model parallelism is sometimes referred to as work for natural language processing and also achieve near pipeline parallelism. identical performance as the original Transformer network in language modeling. Although partly solves the memory issue, the model size for model parallelism is still constrained by the total memory Comparing to data and model parallelism, tensor remateri- of all devices. Besides, most of the existing model paral- alization actually enables us to train a larger network that lel systems require the input network to be static, which can break the original memory bottleneck. However, tensor is infeasible for the popular dynamic computational graph rematerialization is actually trading off the memory cost frameworks (e.g.

Load more