AVOIDING GPUOOM FOR DYNAMIC COMPUTATIONAL GRAPHS TRAINING

Siyuan Zhuang

ABSTRACT State-of-the-art models are becoming larger and larger and GPU memory becomes a bottleneck. Meanwhile, dynamic computational graph libraries like PyTorch gain a lot of popularity among researcher, while previous works on overcoming GPU memory bottleneck mainly focused on static graph. In this work, we target on solving our-of-memory issues on GPU for dynamic computational graphs. Specifically, we developed a system to swap the parameters and intermediate results of the dynamic computational graph between CPU and GPU memories asynchronously. With proper system design and optimizations, our system achieves 2-3x speed up compared to various baselines, while the overhead is less than 0.5 second even for complex neural networks.

1 INTRODUCTION Saving GPU memory for dynamic model is inherently harder than for static ones. As shown in Fig.1, the exe- Deep learning models are becoming larger and larger. For cution path of dynamic models (as dynamic computational example, the biggest parameter size of NLP models is ac- graphs) is highly dependent on the input and it is impossible tively growing: from 340M (BERT-Large) to 8.3B (GPT- to find time-optimal memory saving solution for general 2 8B) (Narasimhan, 2019). However, the computational cases because it is impossible to predict the future without power of GPUs and GPU memory are growing in a different the required information. speed. Therefore, GPU memory begin to be a bottleneck for large model developments. To tackle this issue, in this paper, we propose a new system where users can train dynamic models without worrying Previous works (Narayanan et al., 2019; Huang et al., 2019; about GPU memory limitations, while exploiting heuris- Gholami et al., 2018; Wang et al., 2019; Jia et al., 2018; tics of graph structure and underlying platform to keep the Shazeer et al., 2018; Jain et al., 2019; Chen et al., 2016; high performance. Specifically, given that on modern GPU Zhang et al., 2019) proposes different methods to tackle the servers, CPU memory is much cheaper and often times issue. However, these works mainly focus on optimization larger than GPU memory, we leveraged the CPU memory over static/fixed model instead of dynamic ones with fixed and dynamically move tensors back and forth between CPU computation resource configuration. This limits their prac- and GPU memories. By using a properly designed swapping tical usage in both industry and academia: For industry, it policy and performing memory swapping asynchronously, is not feasible to use techniques like model parallelism for our proposed system enables users to train a much larger online serving, and size of static/fixed models can be further neural networks with little overhead. optimized by pruning (Han et al., 2015), knowledge distilla- tion (Hinton et al., 2015), quantilization (Courbariaux et al., We illustrate the design of our system in Fig.2. Three ma- 2015), or other novel machine learning techniques (Lan jor components of our system are a tracer, a policy and a et al., 2019); For academia, researchers typically would like memory pool. The tracer will be triggered before the exe- to explore novel, dynamic and complex networks for re- cution of a node in the computational graph (which could search purposes. Growing interest and adoption of PyTorch be a or a part of a layer of a dynamic neural network (Paszke et al., 2019) along with much more dynamic and model). The tracer will record the current node and execu- more flexible neural networks (Kosiorek et al., 2019; Cai & tion context and pass them to the policy. The policy will Vasconcelos, 2018; Yeo et al., 2018) shows that researchers analyze GPU memory constraint, execution dependencies prefer deep learning frameworks with flexibility. However, and the execution history in previous training iterations and existing methods are not suitable for dynamic graphs. decide an action to take. For example, if the GPU memory is not sufficient for current node, the policy will operate Correspondence to: Siyuan Zhuang on the memory pool to swap out some tensors from GPU . memory and reserve enough GPU memory for the new re- sults. The memory pool is responsible for managing current GPU memory, recording necessary information for memory Avoiding GPU OOM for Dynamic Computational Graphs Training

2 RELATED WORKS 2.1 Data Parallelism and Gradient Accumulation The dominate and the easiest way to parallelize a deep learning model is data parallelism. Data parallelism split the input data in to different subsets and running different subsets of input on different devices. Most deep learning frameworks (e.g. PyTorch (Paszke et al., 2019), TensorFlow (Abadi et al., 2016)) implemented data parallelism as an API call. To utilize data parallelism in a single device setting, gradient accumulation (Ott et al., 2018) performs multiple rounds of forward and backward propagation in a single training Figure 1. An example of a dynamic computation graph. The com- iteration to accumulate gradient before a gradient update. putation order goes from the left to the right. Unlike static com- This enables large training batch size, which is necessary putational graphs, the shadowed area which we haven’t visited or for large model training. created are unknown, and these parts could result in GPU out of The biggest issue for data parallelism is that each device still memory. needs to have a full copy of the model and perform the whole model execution on a single device, which is impossible control, and performing low-level memory operations. We for larger models (Narasimhan, 2019). Also, some neural implemented our system on PyTorch (Paszke et al., 2019) network layers will behave differently given different input and provide a simple one-line API for end users to make size (e.g. (Ioffe & Szegedy, 2015)), use of our system. which will introduce uncertainty during model development.

Our results shows that we can be 1.8-2.1x faster than a sim- 2.2 Model Parallelism ple GPU memory swapping method on ResNet-152, while 2-3x faster compared to another baseline using conservative As a way to run larger and larger deep neural networks, in policies. Although there are still some gaps from optimal model parallelism, a DNN model are partitioned across the solutions, these results provide some initial evidence that available workers, with each worker evaluating and perform- there could be still very large space for dynamic computa- ing updates for only a subset of the model’s parameters for tional graph memory optimization, and we believe that this all inputs. will become an active research topic just like optimizations One way to implement model parallelism is to spilt some on static graph for the last two years. We will present ideas dimensions of neural network layers and execute in a SIMD about how to further improve performance of prevent OOM manner on multiple machines. Early works (Krizhevsky, for dynamic networks in section7. 2014) parallelizes specific classes of DNN models, and is We summarize our contributions as follows: limited in flexibility and generality. Mesh-TensorFlow li- brary (Shazeer et al., 2018) provides users an interface spec- • To our knowledge, it is one of the first papers targeting ify any tensor-dimensions to be split across any dimensions on solving out-of-memory issues on GPU for dynamic of a multi-dimensional mesh of processors. Recently, a set neural network models. of works (Gholami et al., 2018; Jia et al., 2018; Wang et al., 2019) proposed to automatically performing model optimal • We proposed a complete system to perform memory model parallel that can achieve best speed performance. swapping between CPU and GPU memories with an Another way is to place different parts of a neural network easy-to-use API for end-users. on different devices. Directly applying this will lead to extremely poor performance: since neural network layers • With a tight connection with the underlying graph exe- are depend in a sequential order and during training, both cution system (i.e. PyTorch), proper optimization and forward and backward propagations need to be performed, design, memory swapping can be very effective on naive implementation will have only one device running at dynamic computational graphs. any given time. GPipe (Huang et al., 2019) mitigates this issue by splitting an input batch into groups and pipelining • As an on-going project, we also discuss challenges the execution of the groups on different devices. Although and opportunities of this research direction and present achieving high speedup, there is still a significant amount of some preliminary results as future works. Avoiding GPU OOM for Dynamic Computational Graphs Training

Figure 2. An overview of our framework architecture. time that some devices will be idle. Pipedream (Narayanan modify the original ResNet to a reversible network that et al., 2019) tackles this issue by relaxing the synchronize reaches nearly identical performance as the original one. training setup in GPipe and reduce the idle time to zero. Anonymous(2020) extend this idea to the Transformer net- This type of model parallelism is sometimes referred to as work for natural language processing and also achieve near pipeline parallelism. identical performance as the original Transformer network in language modeling. Although partly solves the memory issue, the model size for model parallelism is still constrained by the total memory Comparing to data and model parallelism, tensor remateri- of all devices. Besides, most of the existing model paral- alization actually enables us to train a larger network that lel systems require the input network to be static, which can break the original memory bottleneck. However, tensor is infeasible for the popular dynamic computational graph rematerialization is actually trading off the memory cost frameworks (e.g. PyTorch (Paszke et al., 2019)). Also for with extra computation cost, making the already slow exe- pipeline parallelism, systems like Pipedream (Narayanan cution of large networks even slower. Also for reversible et al., 2019) will execute a different model or differnt train- networks, the change of network structure introduces new ing schedule comparing to the old one, making it hard to uncertainty to the model, which makes new model develop- debug during model development. ment harder. Furthermore, most of the optimizations here requires the knowledge of the whole network in advance and 2.3 Efficient Tensor Rematerialization is infeasible for dynamic frameworks and dynamic neural networks. For neural network training, most of the memory usage is ac- tually storing the intermediate activation of different layers 2.4 Memory Swapping in a network, rather than the parameters and the gradients of a neural network (Chen et al., 2016). To reduce the memory√ On modern GPU servers, CPU memory is much cheaper usage, Chen et al.(2016) proposed to store only O( n) and often times larger than GPU memory. Therefore, the intermediate layers for a n layer network with the cost of memory load of neural network training can be reduced by an extra forward pass. Jain et al.(2019) further extend swapping tensors to CPU memory when they are not in use, this idea to use a more systematic way to find the optimal and swapping them back to GPU memory right before their checkpointing strategies for a given memory budget. next access. Ideally, the communication in both directions should be hidden under computation (via separate streams) Relaxing the constraint to perform exact the same neural to minimize the communication overhead (He & Yu, 2011). network, a line of works designed a special kind of network that enables reversibility: given the output of a layer, one In GeePS (Cui et al., 2016), the decision to swap which layer can compute the input of that layer with the computational or which tensor is made by the end user. It requires the end cost of an extra forward computation. Gomez et al.(2017) user to have a good understanding of the model, including Avoiding GPU OOM for Dynamic Computational Graphs Training the execution order, memory and time consumption of each For backpropergation, PyTorch uses the operator overload- layer. SuperNeurons (Wang et al., 2018) restricts to swap ing approach, which builds up a representation of the com- only layers. Big tensors of other layers are not puted function every time it is executed. During backproper- considered for swapping. It also requires the end-user’s gation, PyTorch performs reverse-mode automatic differen- intervention. Zhang et al.(2019) proposed a near optimal tiation, which computes the gradient of a scalar output with way to swap the memory automatically. The overhead of respect to a multivariate input in the reverse order. these methods are relatively small, but most previous works For the execution of the computational graph, PyTorch main- based their analysis on a static neural network, which is tains a strict separation between its control and data flow. infeasible for recent popular dynamic neural networks. PyTorch is designed to execute operators asynchronously on GPU by leveraging the CUDA stream mechanism (Cook, 3 BACKGROUND 2012) to queue CUDA kernel invocations to the GPUs hard- ware FIFO. This allows the system to overlap the execution 3.1 Dynamic Computational Graph of Python code on CPU with tensor operators on GPU, Most deep learning libraries can be divided into two execu- which saturate the GPU and reach peak performance even tion patterns: static computational graph (e.g. TensorFlow in an interpreted language with fairly high overhead like (Abadi et al., 2016), MXNet (Chen et al., 2015), Theano Python. (Team et al., 2016)) or dynamic computational graph (e.g. PyTorch (Paszke et al., 2019)). The static frameworks will 4 OURMETHODS construct a static dataflow graph that represents the compu- tation and which can then be applied repeatedly to batches Naive approach To address the GPU out-of-memory is- of data. Users need to specify the whole graph before the sue for dynamic computational graphs, a naive approach is execution. This approach provides visibility into the whole to redirect all computations to CPU when the GPU mem- computation ahead of time, and can theoretically be lever- ory is used up. However, even this naive solution is non- aged to improve performance and scalability. However, it trivial. For example, we need to detach GPU ’leaf’ tensors comes at the cost of ease of use, ease of debugging, and flex- which require gradients from the graph to prevent creating ibility of the types of computation that can be represented. GPU gradient tensors (which will cause OOM) during back- propagation. These GPU leaf tensors are replaced with CPU Therefore, dynamic frameworks like PyTorch are becoming ones whose gradien will be accumulated back to correspond- more and more popular among machine learning researchers ing GPU leaf tensors after full back-propagation. This naive and have been adopted more in applications (Paszke et al., approach is proven to be too slow to meet our expectation, 2019). Dynamic frameworks define the computational graph so it will be treated as the fallback or conservative policy if at the same time with the actual network computation. The the major policy failed. It is also used as a baseline in our computational graph is defined during execution and be experiments. recorded for back propagation. Early dynamic frameworks (Tokui et al.; Collobert et al., 2002; Neubig et al., 2017) often suffer from high computational overhead or uses a Tensor swapping between GPU and CPU Memory We less expressive programming language. However, PyTorch implemented a major and more advanced method, based on (Paszke et al., 2019) overcame these issues and gained lots of the observation that CPU memory is typically much larger popularity from the machine learning research community. than GPU memory for deep-learning-oriented servers. The basic idea is simple: when there is not enough GPU memory, 3.2 PyTorch Execution Flow we pick some previous GPU tensors, copy their data to the To support this growing complexity, PyTorch (Paszke CPU memory, and then release them to reserve enough mem- et al., 2019) foregoes the potential benefits of a graph- ory for following computations. Tensor re-materialization metaprogramming based approach to preserve the impera- is required when such a tensor will serve as the input of a tive programming model of Python. PyTorch implements node later, or when it will be used to compute the gradient this idea to all aspects of deep learning workflows. For during back-propagation. The main performance bottleneck instance, neural network layers are typically expressed as for this method is the device-host memory copying speed. Python classes whose constructors create and initialize their This paper will introduce several important optimizations. parameters, and whose forward methods process an input One of the most important optimizations is based on the activation. Similarly, models are usually represented as deep understanding of the workflow of PyTorch, where we classes that compose individual layers. The high integration can manage to perform asynchronous memory copy without with the Python language makes PyTorch easy to debug and affecting the on-going GPU computation. Depending on make it very dynamic. this optimization, we further make use of memory footprint history to scheduling memory copy ahead of time during Avoiding GPU OOM for Dynamic Computational Graphs Training forwarding phase, and we also use trace information during 5.3 Policy the back-propagation phase to pre-materialize GPU tensors. The policy module is responsible for taking actions accord- As we will show in our experiments, these optimizations ing the current state captured by the tracer. There is an greatly improve our performance. We will show detailed internal field called policy type which controls the behavior implementation in section5. of the policy. Currently, we support 3 policy types: NULL, CONSERVATIVE and GPU MEMORY SWAP. 5 IMPLEMENTATION 5.3.1 Policy Type: NULL 5.1 API This is the initial policy of our framework before each train- Previous frameworks like JANUS (Jeong et al., 2019) re- ing iteration. This policy indicates that we haven’t encoun- quire modification of python interpreter or underlying deep tered any GPU out of memory yet. This policy will try learning platform, making them less infeasible for real- to minimize overhead of its execution by skipping check- world applications. In contrast, our library only needs one ing if the inputs has been materialized. This policy is also line to prevent a dynamic model from out of memory. For responsible for checking the training history to decide if example, in Fig.3, all a user need is to put part of his train- it is necessary to pre-copying GPU memory to CPU. Pre- ing code under the scope of the ’with’ clause. This will copying memory could be helpful in reducing the memory greatly benefit the application of our method. copying overhead, if a GPU OOM is going to happen in the future.

5.3.2 Policy Type: CONSERVATIVE This policy ensures that no OOM could happen even un- der the most extreme cases. If this policy type is turned on, it will forward all further computations to CPU and prevents creating GPU gradients for parameters at the back- propagation pass. However, its cost is high because CPU is typically 10-20x times slower when dealing with common Figure 3. An example usage of our API. neural network computations, so we only use it as a fallback solution when all other policies have failed.

5.3.3 Policy Type: GPU MEMORY SWAP 5.2 Execution Tracer If the current policy type is ’NULL’, and we detect that The execution tracer will trace the execution of the dy- executing the current node will cause OOM, the policy type namic computational graph for both forward pass and back- will switch to ’GPU MEMORY SWAP’. It is our major propagation pass. Before each training iteration, the tracer policy in case of GPU OOM. This policy will estimate will replace low-level PyTorch operations with a callback how much memory we are going to use for fowarding the hook, so it could be caught by our tracer during forward current node, then it will try to reserve at least that amount pass. Because the PyTorch frontend will lose track of a of memory from the memory pool. The memory pool will tensor if it is out of scope in Python, we manually build move tensors to CPU and release the GPU memory. We will a mirrored graph to maintain all tensors and their depen- describe this in details in subsection 5.4. dencies to ensure correctness of memory management and During back-propagation pass, if our policy type was or tensor re-materialization. currently is GPU MEMORY SWAP, our policy will check On forward pass, the tracer will create a corresponding the dependencies of the current node to see which ten- node in the mirrored graph and increase the execution order. sors would need to be re-materialized. If the current pol- Then it will pass the node and execution order to the forward icy is GPU MEMORY SWAP, we will further enable pre- policy. materialization, and we will describe it in subsection 5.4.4. On back-propagation pass, we make use of a PyTorch API To put them together, we illustrate the basic ideas of two to register hooks after the gradient of a tensor is computed. algorithms for forward pass (Alg.1) and back-propagation The tracer will then receive a callback from the backend, pass (Alg.2) separately. with a tensor and its gradient as the arguments. The tracer will find out the parent node of the tensor in the mirrored graph and pass it to the policy module. Avoiding GPU OOM for Dynamic Computational Graphs Training

Figure 4. An illustration of asynchronous memory copy. Note that the pinned memory allocation + async memory copy can be started ahead of time without blocking the main CPU thread. This enables further optimization like pre-copying tensors. Copy tensor from CPU to GPU memory works similarly.

5.4 Memory Pool stream to wait all previous operations to be completed.

The memory pool component makes use of low-level CUDA 2. The main thread will start copying data to the CPU operations to move data between GPU & CPU efficiently. tensor. This will block the submission of all further This is the basis of most our optimizations. We will describe operations. two most important sub-modules here. 3. The main thread recovers from the blocking and starts 5.4.1 Victim tensors lookup to submit next operations to the idle CUDA stream. Victim tensors are tensors to be swapped out from GPU memory to CPU memory when GPU memory is insufficient. To improve the speed, we implemented asynchronous mem- ory copy. It can be further decomposed into several sub- Here we look up victim tensors based on several heuristics goals: which we found to improve our performance: 1. Avoid synchronizing the whole PyTorch CUDA • Neural networks are basically linear due to efficient stream. The synchronization happened because Py- reuse of deep features. So we use FIFO scheduling to doesn’t know if operations related to the target catch victim tensors. is ready, we only know that the operation must have • Bigger tensors will consume most GPU memory. So been submitted to the Pytorch CUDA stream. So syn- we will assign tensors whose size is smaller than 4K to chronize the CUDA stream can ensure that we will not reduce frequency of tensor swapping. copy unready data, but it is an overkill for our purpose. Here we use an important property of computational • If a dematerialized tensor is reused as the input, it could graph: if a tensor is ready if and only if (a) it is a ’leaf’ be further reused. So after rematerialization, we will tensor (b) the node outputs the tensor has completed append them to the tail of the FIFO scheduler. its operation. Thus, we insert CUDA events after each operation using the tracer. Then we only need to wait 5.4.2 Asynchronous Memory Copy the event instead of the whole stream.

Asynchronous memory copy is the key supporting mecha- 2. Avoiding blocking the main thread while copying nism for performance improvement in our paper. Synchro- data. This is done by allocating a specific kind of CPU nized copy using the original PyTorch API is very slow. memory called pinned memory. Pinned memory is This is what happened in this case: locked in the physical memory and GPU can perform DMA without triggering page faults in the OS, thus 1. PyTorch has to synchronize the default PyTorch CUDA memory copying can be done both asynchronously Avoiding GPU OOM for Dynamic Computational Graphs Training

Algorithm 1 Policy For Forwarding Pass Algorithm 2 Policy For Back-propagation Pass Input: graph G, history history, tensor memory pool Input: graph G, node nd, tensor memory pool pool, free pool, node nd, execution order exo, gpu memory used memory utilize ratio fmr gmu if G.policy type = GPU MEMORY SWAP then Variables: victim tensors vts, reserved memory size: rs, free mem ← pool.get free gpu memory() node input tensors its, estimated forward memory efm // pre-materialize tensors up to the free memory if G.policy type = NULL then rts ← pool.pop dematerialized tensors(free mem∗ // pre-copying GPU memory to CPU according to his- fmr) tory for tensor in rts do rs ← history.get reserve size(exo, gmu) enqueue move tensor to gpu(tensor) vts ← pool.pop victim tensors(reserve size) end for for tensor in vts do end if pool.enqueue move tensor to cpu(tensor) // rematerialize necessary inputs to compute the grad end for for input in nd.inputs do end if rematerialize(input) // rematerialize inputs if is inplace op(input.parent) then its ← get input tensors(nd) for pinput in input.parent.inputs do for tensor in its do rematerialize(pinput) if not is marterialized(tensor) then end for pool.rematerialize(tensor) end if end if end for end for // backpropgation will continue after this routine exits // memory check efm ← estimate forward memory(nd) if G.policy type = NULL then usage at each execution order. We allocate a large pinned if not pool.can allocate memory(efm) then memory buffer for pre-copying and it will be reused for G.policy type ← GPU MEMORY SWAP each training iteration. Before executing each node, we will end if copying data to CPU unless (a) copied memory size has else been sufficient to cover the gap between peak memory in if G.policy type = GPU MEMORY SWAP then history and GPU memory limit (b) less than 40% GPU mem- pool.reserve gpu memory(efm) ory remained uncopied (c) the policy type wasn’t ’NULL’ end if anymore. end if // forward the node and fetch outputs 5.4.4 Pre-materialization During Back-propagation outputs ← forward(nd) return outputs We observed that at the back-propagation pass, the GPU memory drops quickly. Thus we could also restore GPU tensors ahead of time, and then there could be less over- head when re-materializing tensors required to compute the and faster. Although allocating pinned memory has gradient. larger overhead compared to normal memory, its la- tency can be overlapped by concurrent execution with CUDA stream, and we show that we can get net posi- 6 EVALUATION tive performance improvement in our experiments. We 6.1 Metrics append a CUDA event after the asynchronous memory copy, so our main thread can synchronize with the copy Our metrics focused on how fast our solution is against progress. different baselines. There are three different baselines used in our paper: The asynchronous memory copy is summerized in Fig.4. • A conservative baseline that we mentioned above, 5.4.3 Make Use of the History which computes on GPU first and offload computa- tions to CPU when OOM. Indicated by Fig.4, we could further accelerate data copy- ing if we could schedule these memory copying earlier. We • A medium baseline (or bare solution) that implements recorded past memory footprint and average the memory tensor swapping between GPU and CPU without our Avoiding GPU OOM for Dynamic Computational Graphs Training

Figure 5. ResNet-152 performance Ablation Study under Different Memory Limit. The ’sync tensor swapping’ is the bare version that only swapping tensor from CPU to GPU without any optimization.

optimization. shows that our method has successfully defended OOM. • A optimal baseline that assumes no GPU memory limit and no overhead caused by the framework. 2. The CPU memory is increased at earlier stage for our fully optimized model, which is the evidence that our history exploiting method works. The unit we use for the speed is average running time per training iteration, which smaller means faster. 3. The GPU memory has been filled up at the back-propagation stage, showing the effect of pre- 6.2 Experimental Setup materialization.

We perform our experiments on a local server, equipped with 6.4 Microbenchmark 2 Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz CPUs, 252 GB CPU Memory, and a Volta 100 GPU with maximum We constructed a simple microbenchmark to evaluate the 12 GB memory. For benchmark purpose, we could set the performance of our solution against other baselines. memory limit lower than 12 GB in our experiments. The microbenchmark will apply a 3x3 convolution to an Several typical network structures (Huang et al., 2017; input tensor of shape (128, 16, 224, 244) to get an output of Newell et al., 2016; He et al., 2016) are used here as the the same shape, then it will continue this process using the experimental target or building blocks of dynamic networks. current output as the input. The convolution will be applied for 32 times in total. After that it will treat the average value 6.3 Memory Footprint of the last output as the loss and backpropagate it through the constructed chain. The peak GPU memory during a To prove the correctness and effectiveness of our methods, training iteration is 9.57 GB. we plot the memory footprint of ResNet-152. An input shape of (84, 3, 224, 224) is used here to represent a typical Table 6.4 contains our results. The unit in this table is aver- ImageNet training batch, and the memory limit is 6 GB (the age time per training iteration, and we also show the speed red line). The green line separates the forwarding pass and up against the conservative baseline in the table. Limited by back-propagation pass. GPU-CPU memory copying speed, the bare solution doesn’t have clear edge on the baseline especially when the GPU We can witness strong evidence that: memory is relatively sufficient; while our fully optimized solution is clearly faster than any of them. This further in- 1. The GPU memory is below the 6GB red line. This dicates the effectiveness of our solution. We will perform Avoiding GPU OOM for Dynamic Computational Graphs Training

Figure 6. Memory footprint for ResNet-152. Execution order is the node execution index in the graph. The left side is the footprint of the bare solution, and the right side is the footprint of our fully optimized solution. We can see the GPU memory area is moved to the right side due to our tensor pre-materialization; and the CPU area is moved to the left side due to our exploiting of training history. ablation study in the next section to find out how does each 6.6 Results on Dynamic Computational Graphs optimizations contribute to the final result. In this section we construct three typical structures in a dynamic network. We can construct almost all dynamic 6.5 Ablation Study models using the combination of the three structures. Thus We use ResNet-152 for our ablation study. Although it is the performance of our method on these structures can be not a dynamic model inherently, its static property makes it used for indicating our performance on general dynamic suitable for analysis. models. Fig.5 shows the performance of different ablated methods under different memory constraints. peak memory size is Linear Structure This kind of structure can be treated the max amount of memory this model requires during as a static neural network when isolated from a whole dy- a single training iteration, and memory limit is the GPU namic model, but inside a dynamic graph it is necessary memory limit. for connecting different components. Since we have tested ResNet for ablation study, we use DenseNet-121 (Huang When peak memory size/memory limit = 1, no mem- et al., 2017), a network with more complex interconnections ory overflow would happen. In this case, our overhead here. GPU memory limit is set to 6GB, and the input size is compared to the optimal baseline is 0.35s. the same as the previous ResNet experiment. When peak memory size/memory limit > 1, our framework started to deal with GPU out of memory. Our Branching Structure This kind of dynamic structure will fully optimized version is 1.8-2.1 times faster than the produce a scalar variable before branching, and choose one bare version. Major performance gains come from the branch according to the variable. Since the variable depends asynchronous GPU memory swapping, and making use on both current input and model weights, we cannot predict of history as well as pre-materialization also introduces the which branch we are going to use before running it. This promising performance improvements. In addition, the fig- structure is the core for hierarchy models. One application ure shows the that the effect of exploiting history and pre- could be using the longer branch for hard example to get materialization is additive, which is reasonable since they better accuracy and using the shorter branch for simple are applied in separate stages. example to reduce the latency. For our experimental setting, the longer branch will cause OOM while the shorter one will not. We use the top-1 and top-5 error rate of ImageNet to estimate the percentage of Avoiding GPU OOM for Dynamic Computational Graphs Training

Table 1. MICROBENCHMARKS

GPU MEMORY LIMIT CONSERVATIVE BASELINE BARE SOLUTION OUR SOLUTION 2 GB 11.21 S 5.15 S (2.17X) 1.40 S (8.01X) 4 GB 8.84 S 5.87 S (1.51X) 1.02 S (8.67X) 6 GB 6.31 S 5.27 S (1.19X) 1.32 S (4.78X) 8 GB 4.51 S 3.29 S (1.37X) 1.21 S (3.73X)

7 CONCLUSIONSAND FUTUREWORKS Table 2. EXPERIMENTAL RESULTS FOR DYNAMIC STRUCTURES Deep learning models are becoming larger and larger and GPU memory becomes a bottleneck. While dynamic com- STRUCTURE BASELINE OURMETHOD SPEEDUP putational graph libraries like PyTorch are being widely LINEAR 3.599 S 1.822 S 1.98 adopted by machine learning researchers, previous works BRANCHING 1.221 S 0.921 S 1.33 on the memory issue still mainly focused on static compu- RECURRENT 2.322 S 0.924 S 2.51 tational graphs. In this work, we target on solving our-of- memory issues on GPU for dynamic computational graphs by developing a system to swap the parameters and interme- diate results of the dynamic computational graph between hard examples, and we just choose 10%, a number between CPU and GPU memories asynchronously. Our systems top-1 and top-5 error here for simplicity. We use the con- achieved great results for training neural networks under servative baseline here for comparison. The conservative constrained memory setting. baseline shares the same training iteration latency when there is no OOM. As an ongoing project, our final goal is to build a system to overcome the GPU memory barrier with a mixed strategy, The branching structure used in our experiment shares the including: same structure as resnet-18 for its first, second and fourth stages, while its third stage is branched. The longer branch is • Swapping memory back and forth between CPU and 6 times deeper than the shorter one. Like other experiments, GPU memory, which is the proposed method of this we use the 224 * 224 input resolution. The memory limit is paper. still 6GB for this experiment. • Rematerialize part of tensors using inputs of a node instead of swapping them back from CPU. In cases where computation is much faster than data movement, Recurrent Structure Recurrent structure is known to im- this strategy will accelerate the whole training. prove performance in a few famous works (Cai & Vascon- celos, 2018; Newell et al., 2016). The number of recurrent • Automatically perform gradient accumulation if we iterations affects accuracy, latency, and memory consump- find out that all layer computations are independent tion. So it could be adjusted dynamically according to the with the batch dimension. current time limit and accuracy requirements. • More advanced methods (e.g. machine learning) for We implemented a recurrent structure based on the Hour- exploiting the history. glass structure (Newell et al., 2016). The input size is (12, • If we have multiple devices and one of them meet 3, 128, 128), which is typical for some keypoint detection OOM issue, we can send the intermediate results to problems. With the 6GB memory limit, the network will go another device and finish the rest of the computation OOM if the iteration number is larger than 1. A 10% OOM on the other device. rate is chosen followed by the previous experiment. For the last point, we surprisingly find out that there does not exist a library to perform point-to-point communica- Results Results are shown in Table 6.6. From results we tion for PyTorch GPU tensors. Therefore, we also devel- can conclude that even if OOM is the minor event during oped a library named TensorTransfer to implement efficient dynamic models training, our method is still significantly point-to-point communication and plan to release it as an faster than the conservative baseline. This shows the im- open-source project recently. In the future, we will try to im- portance of proper system design and optimizations for this plement all of the previous strategies and build a system that problem. can be really beneficial for new large model development. Avoiding GPU OOM for Dynamic Computational Graphs Training

REFERENCES Han, S., Mao, H., and Dally, W. J. Deep compres- sion: Compressing deep neural networks with pruning, Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., trained quantization and huffman coding. arXiv preprint Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin, arXiv:1510.00149, 2015. M., et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint He, B. and Yu, J. X. High-throughput transaction execu- arXiv:1603.04467, 2016. tions on graphics processors. Proceedings of the VLDB Endowment, 4(5):314–325, 2011. Anonymous. Reformer: The efficient transformer. In Sub- mitted to International Conference on Learning Represen- He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- tations, 2020. URL https://openreview.net/ ing for image recognition. In Proceedings of the IEEE forum?id=rkgNKkHtvB. under review. conference on and , pp. 770–778, 2016. Cai, Z. and Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Hinton, G., Vinyals, O., and Dean, J. Distilling conference on computer vision and pattern recognition, the knowledge in a neural network. arXiv preprint pp. 6154–6162, 2018. arXiv:1503.02531, 2015.

Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible K. Q. Densely connected convolutional networks. In and efficient machine learning library for heterogeneous Proceedings of the IEEE conference on computer vision distributed systems. arXiv preprint arXiv:1512.01274, and pattern recognition, pp. 4700–4708, 2017. 2015. Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: deep nets with sublinear memory cost. arXiv preprint Efficient training of giant neural networks using pipeline arXiv:1604.06174, 2016. parallelism. In Advances in Neural Information Process- ing Systems, pp. 103–112, 2019. Collobert, R., Bengio, S., and Mariethoz,´ J. Torch: a modu- lar machine learning library. Technical report, Ioffe, S. and Szegedy, C. Batch normalization: Accelerating Idiap, 2002. deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. Cook, S. CUDA programming: a developer’s guide to parallel computing with GPUs. Newnes, 2012. Jain, P., Jain, A., Nrusimha, A., Gholami, A., Abbeel, P., Keutzer, K., Stoica, I., and Gonzalez, J. E. Checkmate: Courbariaux, M., Bengio, Y., and David, J.-P. Binarycon- Breaking the memory wall with optimal tensor remateri- nect: Training deep neural networks with binary weights alization. arXiv preprint arXiv:1910.02653, 2019. during propagations. In Advances in neural information processing systems, pp. 3123–3131, 2015. Jeong, E., Cho, S., Yu, G.-I., Jeong, J. S., Shin, D.-J., and Chun, B.-G. {JANUS}: Fast and flexible deep learning Cui, H., Zhang, H., Ganger, G. R., Gibbons, P. B., and via symbolic graph execution of imperative programs. Xing, E. P. Geeps: Scalable deep learning on distributed In 16th {USENIX} Symposium on Networked Systems gpus with a gpu-specialized parameter server. In Proceed- Design and Implementation ({NSDI} 19), pp. 453–468, ings of the Eleventh European Conference on Computer 2019. Systems, pp. 4. ACM, 2016. Jia, Z., Lin, S., Qi, C. R., and Aiken, A. Exploring hidden di- Gholami, A., Azad, A., Jin, P., Keutzer, K., and Buluc, A. In- mensions in parallelizing convolutional neural networks. tegrated model, batch, and domain parallelism in training arXiv preprint arXiv:1802.04924, 2018. neural networks. In Proceedings of the 30th on Sympo- sium on Parallelism in Algorithms and Architectures, pp. Kosiorek, A. R., Sabour, S., Teh, Y. W., and Hinton, 77–86. ACM, 2018. G. E. Stacked capsule . arXiv preprint arXiv:1906.06818, 2019. Gomez, A. N., Ren, M., Urtasun, R., and Grosse, R. B. The reversible residual network: without Krizhevsky, A. One weird trick for parallelizing convolu- storing activations. In Advances in neural information tional neural networks. arXiv preprint arXiv:1404.5997, processing systems, pp. 2214–2224, 2017. 2014. Avoiding GPU OOM for Dynamic Computational Graphs Training

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Wang, M., Huang, C.-c., and Li, J. Supporting very large and Soricut, R. Albert: A lite for self-supervised models using automatic dataflow graph partitioning. In learning of language representations. arXiv preprint Proceedings of the Fourteenth EuroSys Conference 2019, arXiv:1909.11942, 2019. pp. 26. ACM, 2019.

Narasimhan, S. Nvidia clocks world’s fastest bert train- Yeo, H., Jung, Y., Kim, J., Shin, J., and Han, D. Neural ing time and largest transformer based model, paving adaptive content-aware internet video delivery. In 13th path for advanced conversational ai. Technical report, {USENIX} Symposium on Operating Systems Design and 2019. URL https://devblogs.nvidia.com/ Implementation ({OSDI} 18), pp. 645–661, 2018. training-bert-with-gpus/. Zhang, J., Yeung, S. H., Shu, Y., He, B., and Wang, W. Effi- Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V., cient memory management for gpu-based deep learning Devanur, N. R., Ganger, G. R., Gibbons, P. B., and Za- systems. arXiv preprint arXiv:1903.06631, 2019. haria, M. Pipedream: generalized pipeline parallelism for dnn training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pp. 1–15. ACM, 2019.

Neubig, G., Dyer, C., Goldberg, Y., Matthews, A., Am- mar, W., Anastasopoulos, A., Ballesteros, M., Chiang, D., Clothiaux, D., Cohn, T., et al. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980, 2017.

Newell, A., Yang, K., and Deng, J. Stacked hourglass networks for human pose estimation. In European con- ference on computer vision, pp. 483–499. Springer, 2016.

Ott, M., Edunov, S., Grangier, D., and Auli, M. Scaling neu- ral machine translation. arXiv preprint arXiv:1806.00187, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035, 2019.

Shazeer, N., Cheng, Y., Parmar, N., Tran, D., Vaswani, A., Koanantakool, P., Hawkins, P., Lee, H., Hong, M., Young, C., et al. Mesh-tensorflow: Deep learning for supercom- puters. In Advances in Neural Information Processing Systems, pp. 10414–10423, 2018.

Team, T. T. D., Al-Rfou, R., Alain, G., Almahairi, A., Anger- mueller, C., Bahdanau, D., Ballas, N., Bastien, F., Bayer, J., Belikov, A., et al. Theano: A python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016.

Tokui, S., Oono, K., and Hido, S. : a next-generation open source framework for deep learning.

Wang, L., Ye, J., Zhao, Y., Wu, W., Li, A., Song, S. L., Xu, Z., and Kraska, T. Superneurons: dynamic gpu memory management for training deep neural networks. In ACM SIGPLAN Notices, volume 53, pp. 41–53. ACM, 2018.