Cavs: An Efficient Runtime System for Dynamic Neural Networks Shizhen Xu, Carnegie Mellon University, Tsinghua University; Hao Zhang, Graham Neubig, and Wei Dai, Carnegie Mellon University, Petuum Inc.; Jin Kyu Kim, Carnegie Mellon University; Zhijie Deng, Tsinghua University; Qirong Ho, Petuum Inc.; Guangwen Yang, Tsinghua University; Eric P. Xing, Petuum Inc. https://www.usenix.org/conference/atc18/presentation/xu-shizen

This paper is included in the Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC ’18). July 11–13, 2018 • Boston, MA, USA ISBN 978-1-939133-02-1

Open access to the Proceedings of the 2018 USENIX Annual Technical Conference is sponsored by USENIX. Cavs: An Efficient Runtime System for Dynamic Neural Networks

1,2Shizhen Xu†, 1,3Hao Zhang†, 1,3Graham Neubig, 3Wei Dai, 1Jin Kyu Kim, 2Zhijie Deng, 3Qirong Ho, 2Guangwen Yang, 3Eric P. Xing Carnegie Mellon University1, Tsinghua University2, Petuum Inc.3 (†equal contributions)

Abstract cuDNN [6]), and compute clusters [56, 57, 7]. Recent deep learning (DL) models are moving more One dominant programming paradigm, adopted by DL and more to dynamic neural network (NN) architectures, toolkits such as Caffe and TensorFlow, is to represent a where the NN structure changes for every data sample. neural network as a static dataflow graph [32, 1], where However, existing DL programming models are ineffi- computation functions in the NN are associated with cient in handling dynamic network architectures because nodes in the graph, and input and output of the computa- of: (1) substantial overhead caused by repeating dataflow tion map to edges. It requires DL programmers to define graph construction and processing every example; (2) the network architecture (i.e. the dataflow graph) using difficulties in batched of multiple samples; (3) symbolic expressions, once before beginning execution. inability to incorporate graph optimization techniques Then, for a given graph and data samples, the software such as those used in static graphs. In this paper, we toolkits can automatically derive the correct algorithm present “Cavs”, a runtime system that overcomes these for training or inference, following backpropagation [21] bottlenecks and achieves efficient training and inference and auto-differentiation rules. With proper optimiza- of dynamic NNs. Cavs represents a dynamic NN as a tion, the execution of these static dataflow graphs can be static vertex function F and a dynamic instance-specific highly efficient; as the dataflow graph is fixed for all data, graph G. It avoids the overhead of repeated graph con- the evaluation of multiple samples through one graph struction by only declaring and constructing F once, and can be naturally batched, leveraging the improved par- allows for the use of static graph optimization techniques allelization capability of modern hardware (e.g. GPUs). on pre-defined operations in F. Cavs performs train- Moreover, by separating model declaration and execu- ing and inference by the execution of F fol- tion, it makes it possible for the graph to be optimized lowing the dependencies in G, hence naturally exposing once at declaration time [1], with these optimizations batched execution opportunities over different samples. benefiting the efficiency of processing arbitrary input Experiments comparing Cavs to state-of-the-art frame- data batches at execution time. works for dynamic NNs (TensorFlow Fold, PyTorch and While the dataflow graph has major efficiency advan- DyNet) demonstrate the efficacy of our approach: Cavs tages, its applicability highly relies on a key assump- achieves a near one order of magnitude speedup on train- tion – the graph (i.e. NN architecture) is fixed through- ing of dynamic NN architectures, and ablations verify the out the runtime. This assumption however breaks for effectiveness of our proposed design and optimizations. dynamic NNs, where the network architectures condi- tionally change with every input sample, such as NNs 1 Introduction that compute over sequences of variable lengths [22, 43], Deep learning (DL), which refers to a class of neural net- trees [45], and graphs [26]. works (NNs) with deep architectures, is now a workhorse Due to the growing interest in these sorts of dy- powering state-of-the-art results on a wide spectrum namic models, recent years have seen an increase in of tasks [53, 54, 30]. One reason for its widespread the popularity of frameworks based on dynamic declara- adoption is the variety and quality of software toolkits, tion [49, 33, 11], which declare a different dataflow graph such as Caffe [23], TensorFlow [1], PyTorch [36] and per sample. While dynamic declaration is convenient to DyNet [33, 34], which ease programming of DL models, developers as it removes the restriction that computation and speed computation by harnessing modern computing be completely specified before training begins, it exhibits hardware (e.g. GPUs), software libraries (e.g. CUDA, a few limitations. First, constructing a graph for every

USENIX Association 2018 USENIX Annual Technical Conference 937 sample results in substantial overhead, which grows lin- batching, two libraries that allow for the use of dynamic early with the number of input instances. In fact, we NNs with automatic operation batching, we find that find graph construction takes longer time than the com- Cavs’ has significant performance advantages; on static putation in some frameworks (see §5.2). It also pre- graphs it performs equivalently or slightly better, and vents the application of complex static graph optimiza- on dynamic NNs with difficult-to-batch workloads (e.g. tion techniques (see §3.4). Moreover, since each sample Tree-LSTM [45] and Tree-FC [27]), Cavs demonstrates owns a dataflow graph specifying its unique computa- near one order of magnitude speedups across multiple tional pattern, batching together similarly shaped compu- dataset and hyper-parameter settings (§5.1). We further tations across instances is non-trivial. Without batching, investigate the effectiveness of our design choices: Cavs the computation is inefficient due to its lack of ability benefits from not only our proposed memory manage- to exploit modern computational hardware. While some ment strategy, but also various optimizations on graph progress has been made in recent research [34, 27], how execution, which were originally for static dataflow to automatically batch the computational operations from graphs and not applicable in dynamic declaration. different graphs remains a difficult problem. To summarize, we make three primary contributions in To address these challenges, we present Cavs, an ef- this paper: (1) We propose a novel representation for dy- ficient runtime system for dynamic NNs that exploits namic NNs, based on which we design four and im- the recurrent and recursive nature of dynamic NNs. In- plement the Cavs runtime system (§3.1); (2) We propose stead of declaring a dataflow graph per sample, it decom- several novel strategies in Cavs for efficient training and poses a dynamic NN into two components: a static ver- inference of dynamic NNs: the batching policy (§3.2), a tex function F that is only declared (by the user) and mechanism to guarantee the mem- optimized once before execution, and an input-specific ory coalescing (§3.3), and multiple graph execution op- graph G obtained via I/O at runtime. Cavs inherits the timizations (§3.4); (3) We compare Cavs to state-of-the- flexibility of symbolic programming [1, 12, 33] for DL; art systems for dynamic NNs (§5). We reveal the prob- it requires users to define F by writing symbolic expres- lems of existing systems, and report near 10x speedup sions in the same way as in static declaration. With F for Cavs on various experimental settings. We also ver- and G, the workflow of training or testing a dynamic NN ify the effectiveness of our proposed design strategies, is cast as scheduling the execution of F following the and quantize their contributions to the final performance. structure of the input graph G. Cavs will perform auto- 2 Background differentiation, schedule the execution following depen- dencies in G, and guarantee efficiency and correctness. 2.1 Dynamic Neural Networks Cavs’ design allows for highly efficient computation Successful NN models generally exhibit suitable archi- in dynamic graphs for a number of reasons. First, it tectures that capture the structures of the input data. allows the vertex function only to be defined and con- For example, convolutional neural networks [24, 53], structed once for any type of structured data, hence which apply fixed-structured operations to fixed-sized avoiding the overhead of repeated dataflow graph con- images, are highly effective precisely because they cap- struction. Second, as the dataflow graph encoded by the ture the spatial invariance common in computer vision vertex function is static throughout the runtime, it can domains [39, 44]. However, apart from images, many benefit from various static graph optimizations [1, 5, 12, forms of data are structurally complex and can not be 18](§3.4), which is not the case in the scenario of dy- readily captured by fixed-structured NNs. Appropriately namic declaration (§2.2). Moreover, it naturally exposes reflecting these structures in the NN design has shown ef- opportunities for batched computation, i.e. we are able fective in sentiment analysis [45], semantic similarity be- to parallelize the execution of F over multiple vertices tween sentence pairs [40], and image segmentation [26]. from different input graphs (§3.2) with the support of our To see this, we will take the constituency parsing prob- proposed memory management strategy (§3.3). lem as an example. Sentences in natural languages are To evaluate Cavs’ performance, we compare it to sev- often represented by their constituency parse tree [45, eral state-of-the-art systems supporting dynamic NNs. 31], whose structure varies depending on the content of We focus our experiments on GPU training, and verify the sentence itself (Fig. 1(a)). Constituency parsing is an that both Fold and DyNet suffer from substantial over- important problem in natural language processing that head caused by repeated graph preprocessing or con- aims to determine the corresponding grammar type of struction, which is bypassed by Cavs (§5.2). In a com- all internal nodes given the parsing tree of a sentence. parison with unbatched dynamic graphs in PyTorch and Fig. 1(b) shows an example of a network that takes into DyNet, two widely-used dynamic NN libraries, we ver- account this syntactic structure, generating representa- ify that batching is essential for efficient processing. In tions for the sentence by traversing the parse tree bottom- a comparison with TensorFlow Fold and DyNet Auto- up and combining the representations for each sub-tree

938 2018 USENIX Annual Technical Conference USENIX Association (a) (b) p S LSTM ecution. Particularly, since a dataflow graph Dk needs to be constructed per sample, the overhead is linearly N VP LSTM LSTM increasing with the number of samples, and sometimes V NP LSTM LSTM yields downgraded performance [27] (§5.2), even for D N LSTM LSTM frameworks with optimized graph construction imple- John hit the ball John hit the ball mentations [33]. Moreover, we can hardly benefit from Figure 1: An example of a dynamic NN: (a) a constituency any well-established dataflow graph optimization (§3.4). parsing tree, (b) the corresponding Tree-LSTM network. We We will have to perform graph processing/optimization use the following abbreviations in (a): S for sentence, N for for each dataflow graph and every single sample; but in- noun, VP for verb phrase, NP for noun phrase, D for deter- corporating this optimization itself has a non-negligible miner, and V for verb. overhead. More importantly, as we are unable to batch using a dynamic NN called Tree Structured Long Short- the computation of different structured graphs, we note term Memory (Tree-LSTM) [45]. In particular, each p p in Fig 2(b) single-instance computation Dk (xk ) would be node of the tree maps to a LSTM function [22]. The very inefficient in the absence of batched computation. internal computations and parameters of the LSTM func- Dynamic batching. To address the batching problem, tion is defined in Fig. 4. At each node, it takes a variable some recent effort, notably TensorFlow Fold [27] and number of inputs and returns to the parent node a vector DyNet [34], propose dynamic batching that dynami- representing the parsing semantics up to that point, un- cally groups similarly shaped operations from different til the root LSTM node returns a vector representing the graphs, and batch their execution whenever possible. semantics of the entire sentence. Fold turns dynamic dataflow graphs into a static con- The important observation is that the NN structure trol flow graph to enable batched execution, but intro- varies with the underlying parsing tree over each input duces a complicated functional programming-like inter- sample, but the same LSTM cell is constant in shape and face and a large graph preprocessing overhead. As we repeated at each internal node. Similar examples can be will show in §5.2, the graph construction sometimes found for graph input [25, 26] and sequences of variable slows down the computation by 4x. DyNet proposes an lengths [43, 2]. We refer to these NNs that exhibit dif- auto-batching strategy that searches for batching oppor- ferent structures for different input samples as dynamic tunities by profiling every fine-grained operator, while neural networks, in contrast to the static networks that this step itself has non-negligible overhead (§5.2). It is have fixed network architecture for all samples. also not open to dataflow graph level optimizations. 2.2 Programming Dynamic NNs In summary, there are three major challenges that pre- There is a natural connection between NNs and directed vent the efficient execution of dynamic neural networks: graphs: we can map the graph nodes to the computa- (1) non-negligible graph construction overhead; (2) dif- tional operations or parameters in NNs, and let the edges ficulties in parallel execution; (3) unavailability to graph indicate the direction of the data being passed between execution optimization. the nodes. In this case, we can represent the process of training NNs as batches of data flowing through compu- 2.3 Motivation tational graphs, i.e. dataflow graphs [3, 1, 33]. Our motivation for Cavs comes from a key property of Static declaration. As mentioned previously, static dec- dynamic NNs: most dynamic NNs are designed to ex- laration is one dominant programming paradigm for pro- hibit a recursive structure; Within the recursive structure, gramming NNs [3, 1, 5]. Fig 2(a) summarizes its work- a static computational function is being applied follow- flow, which assumes all data samples share a fixed NN ing the topological order over instance-specific graphs. structure declared symbolically in a dataflow graph D. For instance, if we denote the constituency parsing tree Static declaration, using a single dataflow graph D, can- in §2.1 as a graph G, where each node of the tree maps not express dynamic NNs with structures changing with to a vertex in G, we note the Tree-LSTM can be inter- data samples. A primary remedy to this problem is to preted as follows: a computational cell function, speci- forgo the efficiency gains of static dataflow graphs and fied in advance, is applied from leaves to the root, fol- instead use a dynamic declaration framework. lowing the dependencies in G. G might change with in- Dynamic declaration. Fig 2(b) illustrates the workflow put samples, but the cell function itself is always static: of dynamic declaration. By creating a unique dataflow It is parametrized by a fixed set of learnable parameters p p graph Dk for each sample xk according to its associated and interacts in the same way with its neighbors when structure, dynamic declaration is able to express sample- applied at different vertices of G. dependent dataflow graphs. It however causes extra over- These observations motivate us to decompose a dy- head on graph construction and puts constraints on run- namic NN into two parts: (1) a static computational ver- time optimization, which usually lead to inefficient ex- tex function F that needs to be declared by the program-

USENIX Association 2018 USENIX Annual Technical Conference 939 /* (a) static declaration */ /* (b) dynamic declaration */ /* () our proposed vertex-centric model */ // all samples must share one graph for p = 1 → P: declare a symbolic vertex function F. p K declare a static dataflow graph D. read the pth data batch {xk }k=1. for p = 1 → P: p K for p = 1 → P: for k = 1 → K: read the pth data batch {xk }k=1. p K p p p K read the pth data batch {xk }k=1. declare a dataflow graph Dk for xk . read their associated graphs {Gk }k=1. p K p p p K p K batched computation: D({xk }k=1). single-instance computation: Dk (xk ). compute F over {Gk }k=1 with inputs {xk }k=1. Figure 2: The workflows of (a) static declaration, (b) dynamic declaration, (c) Cavs. Notations: D notates both the dataflow graph itself and the computational function implied by it; p is the index of a batch while k is the index of a sample in the batch.

Chain LSTM ht-1 ct-1 f design: (1) how to design minimal APIs in addition to the O xt u|i ct ht symbolic programming interface to minimize user code; push

GRU (2) how to schedule the execution of F over multiple in- ht-1 ct-1 gather scatter f Tree f O put graphs to enable batched computation; (3) how to xt u ct ht manage memory to support the dynamic batching; (4) Input pull Childsum h1 c1 LSTM Output f1 how to incorporate static graph optimization in Cavs’s x u|i c O h execution engine to exploit more parallelism. f0 Internal External On Both h0 c0 Data Path Data Path Data Path Figure 3: Cavs represents a dynamic structure as a dynamic 3 Cavs Design and Optimization input graph G (left) and a static vertex function F (right). 3.1 Programming Interface mer once before runtime; (2) a dynamic input graph G that changes with every input sample1. With this repre- Conventional dataflow graph-based programming mod- sentation, the workflow of training a dynamic NN can els usually entangle the computational workflow in F be cast as scheduling the evaluation of the symbolic con- with the structure in G, and require users to express them struct encoded by F, following the graph dependencies as a whole in a single dataflow graph. Instead, Cavs sep- of G, as illustrated in Fig 2(c). This representation ex- arates the static vertex function F from the input graph ploits the property of dynamic NNs to address the afore- G (see Fig 3). While users use the same set of symbolic mentioned issues in the following ways: operators [1, 11] to assemble the computational work- F Minimize graph construction overhead. Cavs only re- flow in , Cavs proposes four additional APIs, gather, quires users to declare F using symbolic expressions, scatter, pull, push, to specify how the messages shall G and construct it once before execution. This bypasses re- be passed between connected vertices in : peated construction of multiple dataflow graphs, avoid- • gather(child idx): gather accepts an index of ing overhead. While it is still necessary to create an a child vertex, gets its output, and returns a list of I/O function to read input graphs G for each sample, this symbols that represent the output of the child. must be done by any method, and only once before train- ing commences, and it can be shared across samples. • scatter(op): scatter reverses gather. It sets Batched execution. With the proposed representation, the output of the current vertex as op. If this vertex Cavs transforms the problem of evaluating data samples is gathered, the content of op will be returned. p K {xk }k=1 (at the pth batch) on different dataflow graphs p gather and scatter are motivated by the GAS model {Dk }k=1 [27, 34] into a simpler form – scheduling the in graph computing [14] – both are vertex-centric APIs execution of the vertex function F following the depen- p that help users express the overall computational patterns dencies in input graphs {Gk }k=1. For the latter problem, by thinking locally like a vertex: gather receives mes- we can easily batch the execution of F on multiple ver- sages from dependent vertices, while scatter updates tices at runtime (§3.2), leveraging the batched computa- information to parent vertices (see discussion in §6). tional capability of modern hardware and libraries. However, in dynamic NNs, the vertex function F usu- Open to graph optimizations. Since the vertex function ally takes input from not only the internal vertices of G F encodes a dataflow graph which is static throughout (internal data path in Fig 3), but also the external envi- runtime, it can benefit from various graph optimizations ronment, e.g. an RNN can take inputs from a CNN fea- originally developed for static declaration, such as kernel ture extractor or some external I/O (external data path fusion, streaming, and our proposed lazy batching, which in Fig 3). Cavs therefore provides another two APIs to are not effective in dynamic declaration. express such semantics: Based on this motivation, we next describe the Cavs system. Cavs faces the following challenges in system • pull(): pull grabs inputs from the external of the current dynamic structure, e.g. another NN, or I/O. 1In the following text, we will distinguish the term vertex from node. We use vertex to denote a vertex in the input graph while node • push(op): push reverses pull. It sets the output to denote an operator/variable in a dataflow graph. Hence, a vertex of the current vertex as op. If this vertex is pulled function can have many nodes as itself represents a dataflow graph. by others, the content of op will be returned.

940 2018 USENIX Annual Technical Conference USENIX Association 1 def F(): declaration that one has to define a vertex function for 2 fork in range(N): 3 S= gather(k) # gather states of child vertices each vertex of each input graph. Fortunately, dynamic 4 ck, hk = split(S,2) # get hidden states c and h NNs in this scenario are usually avoided because of the 5 x= pull() # pull the first external input x 6 difficulties in design, programming and learning. 7 # specify the computation N−1 8 h= ∑k=0 hk 3.2 Scheduling (i) (i) (i) 9 i= sigmoid(W × x+U × h+b ) 10 fork in range(N): Once F is defined and G is obtained from I/O, Cavs will ( f ) ( f ) ( f ) 11 fk = sigmoid(W × x+U × hk +b ) perform computation by scheduling the evaluation of F (o) (o) (o) 12 o= sigmoid(W × x + U × h + b ) N N (u) (u) (u) over data samples {x } and their input graphs {G } . 13 u= tanh(W × x+U × h + b ) i i=1 i i=1 N−1 14 c=i ⊗ u+ ∑k=0 fk ⊗ ck Forward pass. For a sample xi with its input graph Gi, 15 h=o ⊗ tanh(c) the scheduler starts the forward pass from the input ver- 16 17 scatter(concat([c, h],1)) # scatter c, h to parents tices of Gi, and proceeds following the direction indi- 18 push(h) # push to external connectors cated by the edges in Gi: at each sub-step, the sched- Figure 4: The vertex function of an N-ary child-sum TreeL- uler figures out the next activated vertex in Gi, and eval- STM [45] in Cavs. Within F, users declare a computational uates all expressions in F at this vertex. It then marks dataflow graph using symbolic operators. The defined F will this vertex as evaluated, and proceeds with the next ac- be evaluated on each vertex of G following graph dependencies. tivated vertex until reaching a terminal vertex (e.g. the Once F declared, together with an input graph G, loss function). A vertex of G is activated if and only if all they encode a recursive dataflow graph structure, which its dependent vertices have been evaluated. maps to a subgraph of the implicit full dataflow graph Backward pass. The backward pass is continued right of the model that may needs to be explicitly declared in after the forward. The scheduler first resets the status of traditional programming models. Via push and pull, all vertices as not evaluated, then scans the graph in a Cavs allows users to connect any external static dataflow reverse direction, starting from the ending point of the graph to a dynamic structure encoded by (F,G), to ex- forward pass. It evaluates ∂F at each vertex until all press more complex model architectures, such as the vertices have been evaluated in the backward pass. LRCN [9] (i.e. connecting a CNN to an RNN), or an To train a NN to convergence, the above process has to N encoder-decoder LSTM network [43] (i.e. connecting be iterated on all samples {xi}i=1 and their input graphs N two different recursive structures). With these four APIs, {Gi}i=1, for many epochs. We next describe our batched we present in Fig 4 an example user program how the N- execution policy to speed the computation. K N ary child-sum Tree-LSTM [45] can be simply expressed Batching policy. Given a data batch {xk}k=1 ⊆ {xi}i=1 K by using them and other mathematical operators. and associated graphs {Gk}k=1, this policy groups mul- Auto-differentiation. Given a vertex function F Cavs tiple vertices and performs batched evaluation of F in order to reduce kernel launches and exploit parallelism. derives ∂F following the auto-differentiation rules: for K Specifically, a forward pass over a batch {xk} are per- each math expression such as sl = op(sr) in F, Cavs k=1 formed in multiple steps. At each step t, Cavs analyzes generates a corresponded backward expression ∇sr = K {Gk} at runtime and determines a set Vt that contains grad op(∇sl,sl,sr) in ∂F. For the four proposed opera- k=1 {G }K tors, we note scatter is the gradient operator of gather all activated vertices in graphs k k=1. It then evalu- ates F over these vertices by creating a batched execu- in the sense that if gather collects inputs from child ver- 2 tex written by scatter at the forward pass, a scatter tion task, with the task ID set to t . The task is exe- needs to be performed to write the gradients for its de- cuted by the Cavs execution engine (§3.4). Meanwhile, pendent vertices to gather at the backward pass. Hence, the scheduler records this task by pushing Vt into a stack S. To perform backward pass, the scheduler pops out an for an expression like sl = gather(child idx) in F, element Vt from S at each step – the execution engine Cavs will generate a backward expression scatter(∇sl) in ∂F. Similarly, the gradient operator of scatter is will evaluate the derivative function ∂F over vertices in V {G }K gather. The same rules apply for push and pull. t , until all vertices of k k=1 are evaluated. We note the batching policy is similar to the dynamic Expressiveness. With these four APIs, Cavs can be seen batching in Fold [27] and DyNet [33]. However, Cavs as a middle ground between static and dynamic decla- determines how to batch fully dynamically during run- ration. In the best case that the NN is fully recursive time using simple breadth-first search with negligible (e.g. most recurrent or recursive NNs), it can be repre- cost (instead of analyzing full dataflow graphs before ev- sented by a single vertex function and an input graph. ery iteration of the execution). Since batched computa- While in the worst case, that every sample has a unique tion requires the inputs to an expression over multiple input graph while every vertex in the graph has a unique 2 way to interact with its neighboring vertices (i.e. the NN Whenever the context is clear, we use Vt to denote both the set of is dynamic but non-recursive), Cavs reduces to dynamic vertices to be batched together, and the batched execution task itself.

USENIX Association 2018 USENIX Annual Technical Conference 941 def (): pull gather gather/scatter buffer 0 1 2 3 4 5 6 7 8 9 10 11 12 13 pull buffer  = pull() 0 push scatter  1 = gather(0)  = gather(1) 2 batching task  3 = matmul( , u) +matmul( , v) 0 1 2 5 6 7 8 9 3 10 11 4 12 13 +matmul( , w) scatter(0, ) 0 1 2 5 0 1 2 5 0 1 2 5 0 1 2 5 push( ) 0 8 5 3 7 11 13 6 7 8 9 6 7 8 9 6 7 8 9 6 7 8 9 3 10 11 3 10 11 4 3 10 11 4 1 9 6 2 10 12 4 11 12 12 12

3 2 5 6 7 10 0 1 2 5 6 7 8 9 3 10 11 4 12 13

0 1 8 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 push buffer Figure 5: The memory management at the forward pass of F (top-left) over two input trees (bottom-left). Cavs first analyzes F and 3 inputs – it creates four dynamic tensors {αn}n=0, and figures out there will be four batch tasks (dash-lined boxes). Starting from the first task (orange vertices {0,1,2,5,6,7,8,9}), Cavs performs batched evaluation of each expression in F. For example, for the pull expression α0 = pull(), it indexes the content of α0 on all vertices from the pull buffer using their IDs, and copies them to α0 continuously; for scatter and push expressions, it scatters a copy of the output (α3) to the gather buffer, and pushes them to the push buffer, respectively. Cavs then proceeds to the next batching task (blue vertices). At this task, Cavs evaluates each expression of F once again for vertices {3,10,11}. (e.g. for a pull expression α0 = pull(), it pulls the content of α0 from pull buffer again; for a gather expression α2 = gather(1) at vertex 3, it gathers the output of the second child of 3, which is 1); it writes results continuously at the end of each dynamic tensor. It proceeds until all batching tasks are finished. vertices to be placed on a continuous memory buffer, we then performs batched evaluation of each expression 3 develop a new memory management support for it. in F. For an expression sl = op(sr) , Cavs first ac- 3.3 Memory Management cesses αr (the dynamic tensor of the RHS sr) – it offsets αr.p by αr.offset, and reads a block of In static declaration [1, 33], a symbol in the user program Mt ∏ αr.shape[i] elements, and presents it as a tensor usually corresponds to a fixed-sized tensor object with a i with batch size Mt and other dimensions as αr.shape. It batch size dimension. While in Cavs, each batching task then applies batched computational kernels of the oper- Vt is determined at runtime. For the batched computation ator op over this memory block, and writes the results to be efficient, Cavs must guarantee for each batching to αl (the dynamic tensor of the LHS symbol sl) on the task, the inputs to each expression of F over a group of continuous block in between [αl.p + αl.offset,αl.p + runtime-determined vertices coalescing in memory. αl.offset + Mt ∏i αl.shape[i]]. Upon the completion Cavs proposes a novel data struct DynamicTensor { N of Vt , the scheduler increases offset of all {αn} by structure dynamic tensor to ad- vector shape; n=1 int bs; Mt ∏i αn.shape[i], respectively. It then starts the next dress this challenge (Fig 6). A int offset; task Vt+1. Hence, intermediate results generated in each void* p; }; dynamic tensor is a wrapper batching task at forward pass are stored continuously in of a multi-dimensional array [1, Figure 6: Dynamic ten- sor. the dynamic tensors, and their offsets are recorded. 52]. It contains four attributes: M At the entrance of F, the vertices {v } t in V need shape, bs, a pointer p to a chunk of memory, and m m=1 t to interact with its dependent vertices in previous V to offset. shape is an array of integers representing the t−1 gather their outputs as inputs (L3 of Figure 4), or pull specific shape of the tensor excluding the batch dimen- inputs from the external (L5 of Figure 4). Cavs main- sion. It can be inferred from the user program and set be- tains memory buffers to enable this (Figure 5). It records fore execution. The batch size bs is dynamically set by the offsets of the dynamic tensors for each v ∈ V , and the scheduler at runtime at the beginning of a batching m t therefore during the execution of gather operator, the task. To access a dynamic tensor, one moves p forward memory slices of specific children can be indexed. As with the value of offset, and reads/writes number of el- shown in Figure 5, gather and scatter share the same ements equal to bs · shape[i]. Therefore, bs together ∏i temporary buffer for memory re-organization, but push with offset provide a view of the tensor, and the state of and pull operate on external memory buffers. the tensor will vary based on their values. Given a vertex Algorithm 1 summarizes the memory management function F, Cavs creates dynamic tensors {α }N for n n=1 during forward pass. The backward execution follows an each non-parameter symbol s (n = 1,...,N) in F, and n exactly reverse order of the forward pass (§3.2 ), which also {∇α }N as their gradients, while it creates static n n=1 we skip in the text. With this strategy, Cavs guaran- tensors for model parameters. tees memory continuity for any batched computation of Fig 5 illustrates how the memory is assigned during F and ∂F. Compared to dynamic batching in DyNet, the forward pass by manipulating dynamic tensors. In Cavs performs memory movement only at the entrance particular, in a training iteration, for a batching task N Vt , the scheduler sets bs of all {αn}n=1 to Mt = |Vt | 3Note that the user-defined expressions can be arbitrary, e.g. with (the number of vertices in Vt ). The execution engine more than one argument or return values

942 2018 USENIX Annual Technical Conference USENIX Association Fused Kernel gather and scatter operator, respectively. A node that embedding pull matmul SYNC linear lookup add sig push has g as its dependent and is not on any path from g to s transformation add tan mul is a lazy operator. A node that has s as its ancestor and is gather slice matmul add slice mul add sig concat scatter not on any path from g to s is an eager operator. mul add sig slice mul Fig 7 illustrates a forward dataflow graph of the vertex function of Tree-LSTM, with eager and lazy operators parameter eager operator lazy operator lazy batching stream1 stream2 colored. A property of them is that their evaluation is Figure 7: The dataflow graph encoded by F of Tree-LSTM. not fully subject to the dependency reflected by the in- and exit of F, instead of for each expression (operator). put graph G. For instance, the pull operator in Fig 7 is We empirically find this significantly reduces overhead eager and can be executed in prior – even before F has of memory operations (§5.3). been evaluated at the vertices that gather tries to inter- act with; the push operator is lazy, so we can defer its Algorithm 1 Memory management at forward pass. execution without impacting the evaluation of F at par- T N ent vertices. Similarly, in ∂F, the gradient derivation 1: function FORWARD({Vt }t=1,{αn}n=1,F) 2: for t = 1 → T do for model parameters are mostly lazy – their execution 3: for n = 1 → N do αn.bs ← Mt end for can be deferred as long as the gradients of hidden states 4: for each expression like sl = op(sr) in F do 5: if op ∈ {gather,pull} then are derived and propagated in time. Cavs leverages this 6: C ← ∏i αl .shape[i],q ← αl .p + αl .offset. property and proposes the lazy batching strategy. It de- 7: for vm ∈ Vt (m = 1 → Mt ) do fers the execution of all lazy operators in F and ∂F until 8: src ← IndexBuffer(op,m),dest ← q + (m − 1)C. T all batching tasks {Vt } has finished. It then performs 9: memcpy(dest,src,C). t=1 10: end for a batched execution of these lazy operators over all ver- K 11: else if op ∈ {scatter,push} then tices of {Gk}k=1. These operators includes, but is not 12: C ← ∏i αr.shape[i],q ← αr.p + αr.offset. limited to, the push operator that is doing memory copy, 13: for vm ∈ Vt (m = 1 → Mt ) do and operators for computing gradients of model param- 14: dest ← IndexBuffer(op,m),src ← q + (m − 1)C. 15: memcpy(dest,src,C). eters. Lazy batching helps exploit more parallelism and 16: end for significantly reduces kernel launches. Empirically lazy 17: else batching brings 20% overall improvement (§5.3). 18: perform batched computation: αl = op kernel(αr). 19: end if To leverage the exhibited parallelization opportunity 20: end for between eager operators and the operators on the path 21: for n = 1 → N do αn.offset+ = Mt ∏i αn.shape[i] end for from gather to scatter (Figure 7), Cavs proposes a 22: end for streaming strategy that pipelines the execution of these 23: end function two groups of operators. It allocates two streams, and 3.4 Optimizing Execution Engine puts the eager operators on one stream, and the rest (ex- cluding lazy operators) on the other. Hence, independent Since Cavs separates out a static dataflow graph encoded operators in two streams run in parallel, while for those by F, we can replace the original F with an optimized operators that depend on an eager operator, this depen- one that runs more efficiently, as long as maintaining cor- dency is respected by synchronization barriers (Fig 7). rectness. We next described our optimization strategies. Lazy batching and streaming4. In addition to batched Automatic kernel fusion. Given F, before execution, execution of F, the lazy batching and streaming explore Cavs will run a fusion detector [20] to scan its cor- potential parallelism for a certain group of finer-grained responded dataflow graph and report all fuse-able sub- operators in F or ∂F called lazy and eager operators. graphs therein, i.e. all nodes in a fuse-able subgraph Definition. An operator in F (∂F) is a lazy operator if at can be fused as a single operator that behaves equiva- K lently but takes less execution time (e.g. with fewer ker- the forward (backward) pass, for ∀v ∈ G,∀G ∈ {Gk}k=1, the evaluation of F (∂F) at any parent (dependent) ver- nel launches and I/O, or faster computation). Currently, tex of v does not rely on the evaluation of F at v. It is an we only detect groups of directly linked elementwise op- eager operator if the evaluation at v does not rely on the erators, such as +,sigmoid, as shown in Fig 7, and we evaluation of F (∂F) at any dependents (parents) of v. use a simple union-find algorithm to detect the largest possible fuse-able subgraphs. Given a fuse-able sub- Proposition. Denote DF (D F ) as the dataflow graph ∂ graph, Cavs adopts de facto automatic code generation encoded by F (∂F), and g,s ∈ DF (D F ) as nodes of ∂ techniques [37, 8, 38, 35] to generate lower-level kernel 4Streaming is a borrowed terminology from CUDA programming implementations. Replacing the original fuse-able sub- which means executing different commands concurrently with respect graphs with fused operators during execution is benefi- to each other on different GPU streams. As Cavs’ optimizations are agnostic to the low-level hardware, we use streaming interchangeably cial in many aspects: (1) it reduces the number of kernel with multi-threading if the underlying computing hardware is CPU. launches; (2) on some devices such as GPUs, kernel fu-

USENIX Association 2018 USENIX Annual Technical Conference 943 sion transform device memory access into faster device ablation studies, and discuss Cavs’ advantages over other registers access. We empirically report another 20% im- DL systems for dynamic dataflow graphs. provement with automatic kernel fusion (§5.3). Environment. We perform all experiments in this paper on a single machine with an NVIDIA Titan X (GM200) 4 Implementation GPU, a 16-core CPU, and CUDA .0 and cuDNN v6 Cavs is implemented as a C++ and integrable with installed. As modern DL models are mostly trained us- existing DL frameworks to enhance their support for dy- ing GPUs, we focus our evaluation on GPUs, but note namic NNs. It is composed of three major layers (which Cavs’ design and implementation do not rely on a spe- is the case for most popular frameworks [3, 1, 33]): (1) cific type of device. We mainly compare Cavs to Tensor- a frontend that provides device-agnostic symbolic pro- Flow v1.2 [1] with XLA [18] and its variant Fold [27], gramming interface; (2) an intermediate layer that im- PyTorch v0.3.0 [11], and DyNet v2.0 [33] with auto- plements the core execution logic; (3) a backend with batching [34], as they have reported better performance device-specific kernels for all symbolic operators. than other frameworks [5, 50] on dynamic NNs. We fo- Frontend. In addition to the four APIs, Cavs pro- cus on metrics for system performance, e.g. time to scan vides a macro operator VertexFunction. Users in- one epoch of data. Cavs produces exactly the same nu- stantiate it by writing symbolic expressions and speci- merical results with other frameworks, hence the same fying methods to read input graphs. It encapsulates scat- per-epoch convergence ter/gather semantics, so users can continue using higher Models and dataset. We experiment on the follow- level APIs. To construct more complex NN architectures ing models with increasing difficulty to batch: (a) (e.g. encoder-decoder LSTM [43], LRCN [9])), users Fixed-LSTM language model (LM): a static sequence employ push and pull to connect multiple vertex func- LSTM with fixed steps for language modeling [42, 43, tions, or to external structures. 55]. We train it using the PTB dataset [48] that contains Intermediate Layer. Cavs has its core runtime logic at over 10K different words. We set the number of steps as this layer, i.e. the batching scheduler, the memory man- 64, i.e. at each iteration of training, the model takes a 64- agement, and the execution engine, etc. word sentence from the training corpus, and predicts the Backend. Following practice [1, 33, 12], we imple- next word of each word therein. Obviously, the compu- ment device-specific operator kernels at this layer. Cavs tation can be by nature batched easily, as each sentence has optimized implementations for the four proposed op- has exactly the same size. (b) Var-LSTM LM: that ac- erators (gather, scatter, pull, push). Specifically, cepts variable-length inputs. At each iteration the model gather and pull index different slices of a tensor and takes a batch of natural sentences with different length puts them together continuously on memory; scatter from PTB, and predicts the next words; (c) Tree-FC: and push by contrast splits a tensor along its batch di- the benchmarking model used in [27] with a single fully- mension, and copy different slices to different places. connected layer as its cell function. Following the same Cavs implements customized memcpy kernels for there setting in [27], we train it over synthetic samples gener- four operators, so that copying multiple slices from (or ated by their code [47] – each sample is associated with a to) different places is performed within one kernel. complete binary tree with 256 leaves (therefore 511 ver- Distributed Execution. While Cavs’s implementations tices per graph); (d) Tree-LSTM: a family of dynamic are focused on improving the efficiency on a single node, NNs widely adopted for text analysis [26, 51]. We im- they are compatible with most data-parallel distributed plement the binary child-sum Tree-LSTM in [45], and systems for deep learning [56, 7, 1], and can also benefit train it as a sentiment classifier using Stanford sentiment distributed execution on multiple nodes. treebank (SST) dataset [40]. The dataset contains 8544 training sentences, each associated with a human anno- 5 Evaluation tated grammar tree, and the longest one has 54 words. In this section, we evaluate Cavs on multiple NNs and datasets, obtaining the following major findings: (1) 5.1 Overall Performance Cavs has little overhead: on static NNs, Cavs demon- We first verify the viability of our design on the easiest- strates equal performance on training and inference with to-batch case: Fixed-LSTM language model. We com- other systems; On several NNs with notably difficult-to- pare Cavs to the following three strong baselines: (1) batch structures, Cavs outperforms all existing frame- CuDNN [6]: a CuDNN-based fixed-step sequence LSTM, works by a large margin. (2) We confirm the graph which is highly optimized by NVIDIA using handcrafted construction overhead is substantial in both Fold [27] kernels and stands as the best performed implementation and dynamic declaration [33], while Cavs bypasses it by on NVIDIA GPUs; (2) TF: the official implementation loading input graphs through I/O. (3) We verify the ef- of Fixed-LSTM LM in TensorFlow repository [46] based fectiveness of our proposed design and optimization via on static declaration; (3) DyNet: we implement a 64-step

944 2018 USENIX Annual Technical Conference USENIX Association (a) Fixed-LSTM (h = 512) (b) Var-LSTM (h = 512) (c) Tree-FC (h = 512) (d) Tree-LSTM (h = 512) 0.13 0.27 0.55 0.75 0.39 1.37 0.82 0.55 0.3

0.12 CuDNN 0.20 Cavs 0.08 Cavs 0.08 Cavs Cavs TF Fold Fold

0.09 TF 0.15 DyNet 0.06 DyNet 0.06 DyNet DyNet 0.06 0.10 0.04 0.04 Time (x1e3 s) 0.03 0.05 0.02 0.02

1 16 32 64 128 256 1 16 32 64 128 256 1 16 32 64 128 256 1 16 32 64 128 256

(e) Fixed-LSTM (b = 64) (f) Var-LSTM (b = 64) (g) Tree-FC (b = 64) (h) Tree-LSTM (b = 64) 0.16 0.29 0.12

0.12 CuDNN 0.20 Cavs 0.08 Cavs 0.08 Cavs Cavs TF Fold Fold

0.09 TF 0.15 DyNet 0.06 DyNet 0.06 DyNet DyNet 0.06 0.10 0.04 0.04 Time (x1e3 s) 0.03 0.05 0.02 0.02

64 128 256 512 1024 2048 64 128 256 512 1024 2048 64 128 256 512 1024 2048 64 128 256 512 1024 2048 Figure 8: Comparing five systems on the averaged time to finish one epoch of training on four models: Fixed-LSTM, Var-LSTM, Tree-FC and Tree-LSTM. In (a)-(d) we fix the hidden size h and vary the batch size bs, while in (e)-(h) we fix bs and vary h. LSTM in DyNet based on dynamic declaration – we de- namic NNs, with a depth-based dynamic batching strat- clare a dataflow graph per sample, and train with the au- egy. To enable the batching, it however needs to prepro- tobatching [34] enabled; (4) Cavs with batching policy, cess the input graphs, translate them into intermediate and all input samples have a same input graph – a 64- representations and pass them to lower-level TensorFlow node chain. We train the model to converge, and report control flow engine for execution. We report the results the average time per epoch in Fig 8(a)(e), where in (a) we in Figure 8(c)(g) with varying bs and h, respectively. For fix the hidden size h of the LSTM unit as 512 and vary all systems, we allocate a single CPU for graph the batch size bs, and in (e) we fix bs = 64 and vary h. preprocessing or construction. Cavs shows at least an Empirically, CuDNN performs best in all cases, but note order of magnitude speedups than Fold and DyNet at it is highly inflexible. Cavs performs slightly better than h ≤ 512. Because the size of the synthetic trees is large, TF in various settings, verifying that our system has little one major advantage of Cavs over them is the allevia- overhead handling fully static graphs, though it is spe- tion of graph preprocessing/construction overhead. With cialized for dynamic ones. We also conclude from Fig 8 a single CPU thread, Fold takes even more time on graph that batching is essential for GPU-based DL: bs = 128 is preprocessing than computation (§5.3). nearly one order of magnitude faster than bs = 1 regard- Finally, we compare three frameworks on Tree-LSTM less of used frameworks. For Cavs, the batching policy in Figure 8(d)(h): Cavs is 8-10x faster than Fold, and is 1.7x, 3.8x, 7.0x, 12x, 15x, 25x, 36x faster than non- consistently outperforms DyNet. One difference in this batched at bs = 2,4,8,16,32,64,128, respectively. experiment is that we allocate as many CPU threads as Next, we experiment with Var-LSTM, the most com- possible (32 on our machine) to accelerate graph pre- monly used RNN for variable-length sequences. We processing for Fold, otherwise it will take much longer compare the following three implementations (CuDNN- time. Further, we note DyNet performs much better here based LSTM cannot handle variable-length inputs): (1) than on Tree-FC, as the size of the input graphs in SST TF: an official TensorFlow implementation based on the (maximally 54 leaves) is much smaller than the synthetic dynamic unroll approach described in §6; (2) DyNet: an ones (256 leaves each) in Tree-FC experiments. We official implementation from DyNet benchmark reposi- observe DyNet needs more time on graph construction tory based on dynamic declaration [10]; (3) Cavs: where for large input graphs, and DyNet’s dynamic batching is each input sentence is associated with a chain graph that less effective on larger input graphs, as it has to perform has number of vertices equal to the number of words. We frequent memory checks to support its dynamic batch- vary h and bs, and report the results in Figure 8(b)(f), re- ing, which we will discuss in §5.3. We also compare spectively. Although all three systems perform batched Cavs with PyTorch – its per-epoch time on Tree-LSTM computation in different ways, Cavs is consistently 2-3 is 542s, 290x slower than Cavs when the batch size is times faster than TF, and outperforms DyNet by a large 256. Compared to other systems, PyTorch cannot batch margin. Compared to TF, Cavs saves computational re- the execution of dynamic NNs. sources. TF dynamically unrolls the LSTM unit accord- 5.2 Graph Construction and Computation ing to the longest sentence in the current batch, but it In this section, we investigate the graph construction cannot prevent unnecessary computation for those sen- overhead in Fold and DyNet. To batch computation of tences that are shorter than the longest one. different graphs, Fold analyzes the input graphs to recog- We then turn to Tree-FC, a dynamic model for bench- nize batch-able dynamic operations, then translates them marking. Since vanilla TensorFlow is unable to batch into intermediate instructions, with which, TensorFlow its computation, we compare Cavs to (1) DyNet and (2) generates appropriate control flow graphs for evaluation Fold, a specialized library built upon TensorFlow for dy- – we will treat the overhead caused in both steps as

USENIX Association 2018 USENIX Annual Technical Conference 945 946 2018 USENIXAnnual Technical Conference frame- existing in NNs dynamic prevents of and training samples efficient that training the obstacle of main number a the is with grows re- construction perspective, graph one the from that This, flects epoch. one finish to less time therefore efficiency, size computational batch improved larger yields because increasing prominent, more scale, overhead this percentage the at ertheless, of construc- regardless DyNet graph epoch, one for in time 40.1, tion constant 31.35, almost 7.14, take 1.1, Fold-1) Except 9): 48.77. Figure 46.13, in showing of stead at time its report we As training ing. when metrics Fold-32 same We the 9(b) Tree-LSTM workload. Figure computational in fixed is report there when size tree respectively. the 20%, and when while 80% leaves, and 1024 leaves, 32 has time, at relative at 50% the overhead wastes of nately least terms In the causes settings. all I/O, which Cavs, through but graphs grows, graphs loads input the of time Clearly size more the increasingly take when time. systems overall three all the that find of we con- percentage graph it for and time struction absolute (averaged) the both epoch, (3) and thread processing graph (1) input compare We of size) (graph training leaves on graphs of number changes average it the from how with 9(a) construction Figure in graph visualize the and computation, separate we encoded overhead, graph the dataflow small a by construct con- takes to Cavs time contrast, By stant graphs. input of set it size training the make the and with to grows still construction overhead graph the lightweight, its optimized Though has samples. of DyNet number as the typ- as construct graphs a to dataflow many has as framework, DyNet, declaration dynamic ical overhead. construction graph Fold’s (right time overall the of percentage (left second in time solute (b) epoch (a) per training overhead construction when graph averaged The 9: Figure

eas odrhwteoeha srltdwt batch with related is overhead the how wonder also We Time (s) Tree-LSTM F 4 8 12 16 hnrasiptgah hog /.T quantify To I/O. through graphs input reads then , 32 (a) Tree-FC(bs=64,h512) aesmlrtm,but time, similar take Fold-1 64 ih3 hed o odsgahpreprocess- graph Fold’s for threads 32 with Num ofLeaves ihvarying with 128 ihdfeetbthsz.Tecre hwab- show curves The size. batch different with Cavs 256 Tree-FC ihfixed with Tree-FC, ae uhlne ieta others, than time longer much takes 512 bs (2) bs 1024 = DyNet ai) n h a rpsso its show graphs bar the and y-axis), ihdfeetsz fiptgraphs input of size different with Fold-1 = 1, eadaohrbaseline another add We bs. 20 40 60 80 ,altresses(except systems three all 1, 16, 0.6 1.2 1.8 2.4 Cavs 1 and 32, Time epo o one for plot We DyNet. while bs, y-axis). DyNet Fold-32 Cavs (b) Tree-LSTM(h=512) 16 hc sFl ihone with Fold is which 64, ae 0 es Nev- less. 20x takes Cavs Batch Size(bs) 32 128, bs 64 aeol 10% only take = Fold Fold-32 5 ee(in- here 256 128 64, Percentage DyNet Fold-32 Fold-1 Cavs bs 256 h unfortu- = makes

512. 20 40 60 80 and Percentage(%) eg ecy uigdnmcbthn.Bitupon Built batching. movement dynamic memory during the less memcpy) Qualita- requires Cavs aspects. (e.g. as find system other we Fold, with tively, to coupled management Cavs highly memory compare where is TensorFlow to on relies difficult Fold reduces is that it management Quan- titatively, continuity. memory guarantees while its movements memory to credits also Management. improvement. obvious Memory yield mul- hardly pipeline can to kernels, kernel tiple tries reduce which both Streaming, they (§3.4). as launches help still Lazy fusion latency. and launching batching kernel by bounded the is and efficiency fragmented be highly to is vertex computation one The high have evaluated. In only exhibit tasks others. batching SST than many deeper in case, much this trees are trees input some the i.e. variance, of depth the found on fective rhge opeiy hl uinmsl ok nele- on with works operations mostly mentwise fusion while complexity, higher or matrix- (e.g. parallelizes operations mainly wise batching lazy expected: are mrvmn aybthn smr eeca iha with beneficial more is larger batching lazy nontrivial – deliver consistently improvement fusion with and baseline batching 10, the Lazy Fig over in epoch one configuration in train time We computation-only record and brings. time it and a 1). speedup at = optimization much (speedup one how baseline on a turn as then Cavs We configuration in this streaming and set fusion and batching, lazy disable performance, final we the to contributes §3.4 in optimization Engine. Execution Graph Optimizations 5.3 next. investigate exe- management we memory graph which better-suited strategy, optimized a an and engine, sources: cution main two from stem on on (Cavs/FoldDyNet) (Cavs speedup time the computation and averaged The 1: Table .x97 n .x24 peusover Tree-FC speedups 7.2x/2.4x and 5.4x/9.7x time. computation-only the 1 ble barrier. this overcomes successfully Cavs while works, 1024 leaves 512 256 128 64 32 temn,cmae oteohrsrtge,i esef- less is strategies, other the to compared Streaming, pr rmtegahcntuto erpr nTa- in report we construction graph the from Apart # Tree-LSTM Tree-FC tf n eotteaeae peusone speedups averaged the report and Tree-LSTM, h while 8 4 1.1 0.6 hl uini oeefciea smaller at effective more is fusion while 16 2 85/70.6 / 18.5 / 33.7 / 10.6 / and ie(s) time . 16 / 6.2 / 2/153 / 32 / Tree-LSTM . 8.0 / 3.9 / 4.1 / 3.1 / ihvrigsz fteipttes(etpr) and part), (left trees input the of size varying with ihvrigbthsz rgtpart). (right size batch varying with epciey h advantages The respectively. Tree-LSTM, prtr hnvrFl efrsdepth- performs Fold whenever operator, Speedup . / 2.1 8.9 / 2.3 8.7 / 2.7 7.9 / 3.0 7.5 / 3.7 5.4 vs hnon than omnywith commonly matmul) 7.1 / o riigoeepoch one training for Fold/DyNet) 9.7 as efrac advantage performance Cavs’ O( h) orva o uheach much how reveal To 256 128 64 32 16 bs sw have we as Fixed-LSTM, 1 opeiy[19]. complexity bs Cavs = 2.3 2.9 USENIX Association 4.1 6.2 / 550 / 76 9.8 4btvarying but 64 ie(s) time 58/5.4 / 15.8 / 5.9 / 20.5 / 9/7.4 / 29 / 9.9 / 43 / hw maximally shows 9/12 / 69 / Fold/DyNet Fixed-LSTM 62 which h, Speedup . / 7.0 2.0 / 7.1 7.2 1.6 / 7.0 1.2 / 7.0 7.2 O( 1.8 / 0.8 / 2.4 h on 2 h. ) Programming Graph Construction Graph Frameworks Expressiveness Batching Model Overhead Optimization √ static declaration Caffe, TensorFlow × low beneficial dynamic declaration √ PyTorch, Chainer × N/A unavailable (eager evaluation) dynamic declaration √ √ DyNet high limited benefits (lazy evaluation) √ √ Fold TensorFlow-Fold √ √ high unknown Vertex-centric Cavs low beneficial Table 2: A side-by-side comparison of existing programming models for dynamic NNs, and their advantages and disadvantages.

(a) Fixed-LSTM (bs = 64) (b) Tree-LSTM (bs = 64) 1.5 Lazy-batching 1.5 Lazy-batching RNNs, such as static unrolling [17], bucketing [15] and Fusion Fusion Streaming Streaming dynamic unrolling [16]. The ideas are to pad zero at the 1.4 1.4 end of samples so that they have the same structure (i.e. 1.3 1.3 same length) for batched computation. However, they all 1.2 1.2

Speedup (x) result in unnecessary computation and can not express

1.1 1.1 more complex structures than sequences. Asynchronous

1.0 1.0 model-parallelism [13] enables the concurrent execution 64 128 256 512 1024 2048 64 128 256 512 1024 2048 Hidden Size (h) Hidden Size (h) of different graphs similar to batched execution in Cavs, Figure 10: Improvement of each optimization strategy on exe- it however may suffer from insufficient cache re-usage cution engine over a baseline configuration (speedup = 1). and overhead by multiple kernel launches (on GPUs). Memory operations Computation (s) Execution optimization. A variety of developed tech- bs (s) (Cavs / DyNet) (Cavs / DyNet) Train Inference Train Inference niques from other areas (e.g. kernel fusion, constant 16 1.14 / 1.33 0.6 / 1.33 9.8 / 12 2.9 / 8.53 folding) have been adapted to speed the computation of 32 0.67 / 0.87 0.35 / 0.87 6.1 / 9.8 1.9 / 5.35 DL dataflow graphs [1, 5, 12, 18]. Cavs separates the 64 0.39 / 0.6 0.21 / 0.6 4.0 / 7.4 1.3 / 3.48 128 0.25 / 0.44 0.13 / 0.44 2.9 / 5.9 0.97 / 2.52 static vertex function from the dynamic-varying input 256 0.17 / 0.44 0.09 / 0.44 2.3 / 5.4 0.77 / 2.58 graph, so it benefits from most of the aforementioned Table 3: Breakdowns of average time per epoch on memory- optimizations. We learn from these strategies and reflect related operations and computation, comparing Cavs to DyNet them in Cavs’ execution engine. We further propose lazy on training and inference of Tree-LSTM with varying bs. batching and concurrent execution to exploit more paral- based batching at depth d, it has to move all the contents lelism exposed by our APIs. of nodes in the dataflow graphs at depth d −1 to a desired Graph-based systems. The vertex-centric programming location, as the control flow does not support cross-depth model has been extensively developed in graph comput- memory indexing. This results in redundant memcpy, es- ing [29, 14, 4, 41]. Cavs draws insights from the GAS pecially when the graphs are highly skewed. By contrast, model [14], but is fundamentally different: gather and Cavs only copies contents that are necessary to the batch- scatter in Cavs are fully symbolic – they allow back- ing task. DyNet has a specialized memory management propagation through them; graph computing systems strategy for dynamic NNs. Compared to Cavs, it however compute on large natural graphs, while Cavs addresses suffers substantial overhead caused by repeated checks problems that each sample has a unique graph and the of the memory continuity – whenever DyNet wants to training is iterative on batches of samples. In terms of batch operators with same signatures, it checks whether system design, Cavs also faces different challenges, such their inputs are continuous on memory [34]. The check- as scheduling for batched execution of different graphs, ing overhead increases with bs and is more prominent on guaranteeing the memory continuity. There are also GPUs. Thanks to the simplicity of both systems, we are some graph-based ML systems, such as GraphLab [28], able to profile the memory-related overhead during both but they do not handle instance-based graphs, and do not training and inference, and separate it from computation. offer batching advantages for dynamic DL workloads. We compare them on TreeLSTM, and report the break- down time per epoch in Table 3 under different bs. We 7 Conclusion observe the improvement is significant (2x - 3x) at larger We present Cavs, an efficient system for dynamic neu- bs, especially during inference where DyNet has its con- ral networks. With a novel representation, designed tinuity checks concentrated. scheduling policy, memory management strategy, and graph execution optimizations, Cavs avoids substantial 6 Related Work graph construction overhead, allows for batched com- DL programming models. In addition to §2.2, we sum- putation over different structured graphs, and can bene- marize in Table 2 the major programming models and fit from well-established graph optimization techniques. frameworks for dynamic NNs, and their pros and cons, in We compare Cavs to state-of-the-art systems for dynamic contrast to Cavs. Within static frameworks, there are also NNs and report a near one order of magnitude speedup efforts on adapting static declaration to support sequence across various dynamic NN architectures and settings.

USENIX Association 2018 USENIX Annual Technical Conference 947 References [10] DYNET VARIABLE LENGTH LSTM. https:// github.com/neulab/dynet-benchmark. [1] ABADI,M.,BARHAM, P., CHEN,J.,CHEN,Z., DAVIS,A.,DEAN,J.,DEVIN,M.,GHEMAWAT, [11] FACEBOOK. http://pytorch.org/. S.,IRVING,G.,ISARD,M., ETAL. Tensorflow: [12] FACEBOOK OPEN SOURCE. Caffe2 is a A system for large-scale machine learning. arXiv lightweight, modular, and scalable deep learn- preprint arXiv:1605.08695 (2016). ing framework. https://github.com/caffe2/ [2] BAHDANAU,D.,CHO,K., AND BENGIO, Y. caffe2, 2017. Neural machine translation by jointly learning to [13] GAUNT,A.,JOHNSON,M.,RIECHERT,M.,TAR- align and translate. arXiv preprint arXiv:1409.0473 LOW,D.,TOMIOKA,.,VYTINIOTIS,D., AND (2014). WEBSTER, S. Ampnet: Asynchronous model- [3] BERGSTRA,J.,BASTIEN, F., BREULEUX,O., parallel training for dynamic neural networks. LAMBLIN, P., PASCANU,R.,DELALLEAU,O., arXiv preprint arXiv:1705.09786 (2017). DESJARDINS,G.,WARDE-FARLEY,D.,GOOD- [14] GONZALEZ,J.E.,LOW, Y., GU,H.,BICKSON, FELLOW,I.J.,BERGERON,A., AND BENGIO, Y. D., AND GUESTRIN, C. Powergraph: Distributed Theano: Deep Learning on GPUs with Python. In graph-parallel computation on natural graphs. NIPSW (2011). [15] GOOGLE. Tensorflow bucketing. https:// [4] CHEN,R.,SHI,J.,CHEN, Y., AND CHEN,H. www.tensorflow.org/versions/r0.12/api_ Powerlyra: Differentiated graph computation and docs/python/contrib.training/bucketing. partitioning on skewed graphs. In Proceedings of the Tenth European Conference on Computer Sys- [16] GOOGLE. Tensorflow dynamic rnn. tems (2015), ACM, p. 1. https://www.tensorflow.org/api_docs/ python/tf/nn/dynamic_rnn. [5] CHEN, T., LI,M.,LI, Y., LIN,M.,WANG,N., WANG,M.,XIAO, T., XU,B.,ZHANG,C., AND [17] GOOGLE. Tensorflow static rnn. https: ZHANG, Z. Mxnet: A flexible and efficient ma- //www.tensorflow.org/api_docs/python/ chine learning library for heterogeneous distributed tf/nn/static_rnn. systems. arXiv preprint arXiv:1512.01274 (2015). [18] GOOGLE TENSORFLOW XLA. https://www. [6] CHETLUR,S.,WOOLLEY,C.,VANDERMERSCH, tensorflow.org/performance/xla/. P., COHEN,J.,TRAN,J.,CATANZARO,B., AND SHELHAMER, E. cudnn: Efficient primitives for [19] GUSTAFSON, J. L. Reevaluating amdahl’s law. deep learning. arXiv preprint arXiv:1410.0759 Communications of the ACM 31, 5 (1988), 532– (2014). 533.

[7] CUI,H.,ZHANG,H.,GANGER,G.R.,GIB- [20] GYSI, T., OSUNA,C.,FUHRER,O.,BIANCO, BONS, P. B., AND XING, E. P. Geeps: Scal- M., AND SCHULTHESS, T. C. Stella: A domain- able deep learning on distributed gpus with a gpu- specific tool for structured grid methods in weather specialized parameter server. In Proceedings of the and climate models. In High Performance Comput- Eleventh European Conference on Computer Sys- ing, Networking, Storage and Analysis, 2015 SC- tems (2016), ACM, p. 4. International Conference for (2015), IEEE, pp. 1– 12. [8] DAVE,C.,BAE,H.,MIN,S.-J.,LEE,S.,EIGEN- MANN,R., AND MIDKIFF, S. Cetus: A source-to- [21] HINTON,G.,DENG,L.,YU,D.,DAHL,G.E., source infrastructure for multicores. Com- MOHAMED,A.-R.,JAITLY,N.,SENIOR,A., puter 42, 12 (2009). VANHOUCKE, V., NGUYEN, P., SAINATH, T. N., ETAL. Deep neural networks for acoustic model- [9] DONAHUE,J.,ANNE HENDRICKS,L.,GUADAR- ing in speech recognition: The shared views of four RAMA,S.,ROHRBACH,M.,VENUGOPALAN,S., research groups. IEEE Signal Processing Magazine SAENKO,K., AND DARRELL, T. Long-term recur- 29, 6 (2012), 82–97. rent convolutional networks for visual recognition and description. In Proceedings of the IEEE con- [22] HOCHREITER,S., AND SCHMIDHUBER, J. Long ference on computer vision and pattern recognition short-term memory. Neural computation 9, 8 (2015), pp. 2625–2634. (1997), 1735–1780.

948 2018 USENIX Annual Technical Conference USENIX Association [23] JIA, Y., SHELHAMER,E.,DONAHUE,J., [34] NEUBIG,G.,GOLDBERG, Y., AND DYER, C. On- KARAYEV,S.,LONG,J.,GIRSHICK,R., the-fly operation batching in dynamic computation GUADARRAMA,S., AND DARRELL, T. Caffe: graphs. arXiv preprint arXiv:1705.07860 (2017). Convolutional architecture for fast feature embed- ding. arXiv preprint arXiv:1408.5093 (2014). [35] NVIDIA. http://docs.nvidia.com/cuda/ nvrtc/index.html. [24] KRIZHEVSKY,A.,SUTSKEVER,I., AND HIN- TON, G. E. ImageNet Classification with Deep [36] PASZKE,A.,GROSS,S.,CHINTALA,S., Convolutional Neural Networks. In NIPS (2012). CHANAN,G.,YANG,E.,DEVITO,Z.,LIN,Z., DESMAISON,A.,ANTIGA,L., AND LERER,A. [25] LIANG,X.,HU,Z.,ZHANG,H.,GAN,C., Automatic differentiation in pytorch. AND XING, E. P. Recurrent topic-transition gan for visual paragraph generation. arXiv preprint [37] QUINLAN, D. Rose: Compiler support for object- arXiv:1703.07022 (2017). oriented frameworks. Parallel Processing Letters 10, 02n03, 215–226. [26] LIANG,X.,SHEN,X.,FENG,J.,LIN,L., AND YAN, S. Semantic object parsing with graph [38] RAGAN-KELLEY,J.,BARNES,C.,ADAMS,A., lstm. In European Conference on Computer Vision PARIS,S.,DURAND, F., AND AMARASINGHE, (2016), Springer, pp. 125–143. S. Halide: a language and compiler for optimiz- ing parallelism, locality, and recomputation in im- [27] LOOKS,M.,HERRESHOFF,M.,HUTCHINS, age processing pipelines. ACM SIGPLAN Notices D., AND NORVIG, P. Deep learning with 48, 6 (2013), 519–530. dynamic computation graphs. arXiv preprint arXiv:1702.02181 (2017). [39] SIMONYAN,K., AND ZISSERMAN, A. Very Deep Convolutional Networks for Large-Scale Im- [28] LOW, Y., GONZALEZ,J.E.,KYROLA,A.,BICK- age Recognition. In ICLR (2015). SON,D.,GUESTRIN,C.E., AND HELLERSTEIN, J. Graphlab: A new framework for parallel ma- [40] SOCHER,R.,PERELYGIN,A.,WU,J.,CHUANG, chine learning. arXiv preprint arXiv:1408.2041 J.,MANNING,C.D.,NG,A., AND POTTS,C. (2014). Recursive deep models for semantic composition- ality over a sentiment treebank. In Proceedings of [29] MALEWICZ,G.,AUSTERN,M.H.,BIK,A.J., the 2013 conference on empirical methods in natu- DEHNERT,J.C.,HORN,I.,LEISER,N., AND ral language processing (2013), pp. 1631–1642. CZAJKOWSKI, G. Pregel: a system for large- scale graph processing. In Proceedings of the 2010 [41] SUNDARAM,N.,SATISH,N.,PATWARY,M. ACM SIGMOD International Conference on Man- M.A.,DULLOOR,S.R.,ANDERSON,M.J., agement of data (2010), ACM, pp. 135–146. VADLAMUDI,S.G.,DAS,D., AND DUBEY, P. Graphmat: High performance graph analytics made [30] MIKOLOV, T., CHEN,K.,CORRADO,G., AND productive. Proceedings of the VLDB Endowment DEAN, J. Efficient estimation of word rep- 8, 11 (2015), 1214–1225. resentations in vector space. arXiv preprint arXiv:1301.3781 (2013). [42] SUNDERMEYER,M.,SCHLUTER¨ ,R., AND NEY, H. Lstm neural networks for language modeling. In [31] MITCHELL, D. C. Sentence parsing. Handbook of Thirteenth Annual Conference of the International psycholinguistics (1994), 375–409. Speech Communication Association (2012).

[32] MURRAY,D.G.,MCSHERRY, F., ISAACS,R.,IS- [43] SUTSKEVER,I.,VINYALS,O., AND LE, Q. V. Se- ARD,M.,BARHAM, P., AND ABADI, M. Naiad: quence to sequence learning with neural networks. a timely dataflow system. In Proceedings of the In Advances in neural information processing sys- Twenty-Fourth ACM Symposium on Operating Sys- tems (2014), pp. 3104–3112. tems Principles (2013), ACM, pp. 439–455. [44] SZEGEDY,C.,LIU, W., JIA, Y., SERMANET, P., [33] NEUBIG,G.,DYER,C.,GOLDBERG, Y., REED,S.,ANGUELOV,D.,ERHAN,D.,VAN- MATTHEWS,A.,AMMAR, W., ANASTASOPOU- HOUCKE, V., RABINOVICH,A., ETAL. Going LOS,A.,BALLESTEROS,M.,CHIANG,D., deeper with convolutions. CLOTHIAUX,D.,COHN, T., ETAL. Dynet: The dynamic neural network toolkit. arXiv preprint [45] TAI,K.S.,SOCHER,R., AND MANNING, arXiv:1701.03980 (2017). C. D. Improved semantic representations from

USENIX Association 2018 USENIX Annual Technical Conference 949 tree-structured long short-term memory networks. learning on multiple machines. arXiv preprint arXiv preprint arXiv:1503.00075 (2015). arXiv:1512.06216 (2015).

[46] TENSORFLOW FIXED-SIZED LSTMLANGUAGE [57] ZHANG,H.,ZHENG,Z.,XU,S.,DAI, W., HO, MODEL. https://github.com/tensorflow/ Q.,LIANG,X.,HU,Z.,WEI,J.,XIE, P., AND models/blob/master/tutorials/rnn/ptb/ XING, E. P. Poseidon: An efficient communication ptb_word_lm.py. architecture for distributed deep learning on GPU clusters. In 2017 USENIX Annual Technical Con- [47] TENSORFLOW FOLD BENCHMARK CODE. ference (USENIX ATC 17) (Santa Clara, CA, 2017), https://github.com/tensorflow/fold/ USENIX Association, pp. 181–193. tree/master/tensorflow_fold/loom/ benchmarks.

[48] THE PENN TREE BANK (PTB)DATASET. http://www.fit.vutbr.cz/~imikolov/ rnnlm/simple-examples.tgz.

[49] TOKUI,S.,OONO,K.,HIDO,S., AND CLAYTON, J. Chainer: a next-generation open source frame- work for deep learning. In Proceedings of work- shop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural infor- mation processing systems (NIPS) (2015), vol. 5.

[50] TOKUI,S.,OONO,K.,HIDO,S., AND CLAYTON, J. Chainer: a next-generation open source frame- work for deep learning. In Proceedings of Work- shop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS) (2015).

[51] VINYALS,O.,KAISER,Ł.,KOO, T., PETROV,S., SUTSKEVER,I., AND HINTON, G. Grammar as a foreign language. In Advances in Neural Informa- tion Processing Systems (2015), pp. 2773–2781.

[52] WALT,S. V. D.,COLBERT,S.C., AND VARO- QUAUX, G. The numpy array: a structure for ef- ficient numerical computation. Computing in Sci- ence & Engineering 13, 2 (2011), 22–30.

[53] YAN,Z.,ZHANG,H.,JAGADEESH, V., DE- COSTE,D.,DI, W., AND PIRAMUTHU, R. Hd- cnn: Hierarchical deep convolutional neural net- work for image classification. ICCV (2015).

[54] YAN,Z.,ZHANG,H.,WANG,B.,PARIS,S., AND YU, Y. Automatic photo adjustment using deep neural networks. ACM Transactions on Graphics (TOG) 35, 2 (2016), 11.

[55] ZAREMBA, W., SUTSKEVER,I., AND VINYALS, O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329 (2014).

[56] ZHANG,H.,HU,Z.,WEI,J.,XIE, P., KIM, G.,HO,Q., AND XING, E. Poseidon: A system architecture for efficient gpu-based deep

950 2018 USENIX Annual Technical Conference USENIX Association