Software Frameworks for Deep Learning at Scale
Total Page:16
File Type:pdf, Size:1020Kb
Software Frameworks for Deep Learning at Scale James Fox Yiming Zou Judy Qiu Georgia Institute of Indiana University Indiana University Technology Bloomington Bloomington [email protected] [email protected] [email protected] ABSTRACT nodes. Two forms of parallelism also exist on the application level: model and data parallelism. Other aspects of frame- The study and adoption of deep learning methods has led works include release date, core language, user-facing API, to significant progress in different application domains. As computation model, communication model, deep learning deep learning continues to show promise and its utilization types, programming paradigm, fault tolerance, and visual- matures, so does the infrastructure and software needed to ization. This choice of criteria is explained in detail in Sec- support it. Various frameworks have been developed in re- tion 2. Tensorflow, CNTK, Deeplearning4j, MXNet, H2O, cent years to facilitate both implementation and training of Caffe, Theano, and Torch do not necessarily encompass the deep learning networks. As deep learning has also evolved entire space of frameworks for deep learning, but were se- to scale across multiple machines, there's a growing need for lected by a combination of factors: their being open-source, frameworks that can provide full parallel support. While level of documentation, maturity as a complete product, and deep learning frameworks restricted to running on a sin- level of adoption by the community. These frameworks, as gle machine have been studied and compared, frameworks well as some others not included in the Section 2 chart, are which support parallel deep learning at scale are relatively examined in detail in Section 3. Section 4 discusses finer less known and well-studied. This paper seeks to bridge that points of parallelism and scalability in deep learning which gap by surveying, summarizing, and comparing frameworks may escape but are pertinent to the framework discussion. which currently support distributed execution, including but Section 5 offers concluding remarks. not limited to Tensorflow, CNTK, Deeplearning4j, MXNet, H2O, Caffe, Theano, and Torch. 2. FRAMEWORK COMPARISON 1. INTRODUCTION The relevance of release date, core language, user-facing Deep learning has been quite successful in improving predic- APIs are self-explanatory. Synchronization model speci- tive power in domains such as computer vision and natural fies the nature of data consistency through execution, i.e. language processing. State-of-the-art performance in com- whether updates are synchronous or asynchronous. In con- puter vision is driven by the convolutional neural network text of optimization kernels like stochastic gradient descent model, a special kind of feed-forward deep learning model. (SGD), synchronous execution has better convergence guar- The high-level idea is to learn images filters for extracting antees by maintaining consistency or near-consistency with meaningful features and predictions. On the other hand, sequential execution. Asynchronous SGD can exploit more natural language processing has had a lot of success apply- parallelism and train faster, but with less guarantees of con- ing recurrent neural networks, a type of feedback model well- vergence speed. Frameworks like Tensorflow and MXNet suited for learning order and context-sensitive sequences (as leave this tradeoff as a choice to the user. in natural languages). The communication model tries to categorize the nature of As accuracies continue to increase in both domains, so do across-machine execution according to well-known paradigms. the complexity of network architectures and the size of the There are three possible levels of parallelism at the hardware parameter space. Google's network for unsupervised learn- level: cores within a CPU/GPU device, across multiple de- ing of image features reached a billion parameters [22], and vices (usually GPUs for deep learning), or across machines. was increased to 11 billion parmeters in a separate experi- Most lower-level library kernels (e.g. for linear algebra) are ment at Stanford [27]. In the NLP space, Digital Reasoning designed to use multiple cores of a device by default, so this Systems trained a 160 billion parameter network [29] fairly is not a major point of comparison. At this point, all the recently. Handling problems of this size involves looking be- frameworks also support parallelism across multiple GPUs. yond the single machine, which Google first demonstrated Theano and Torch do not yet support multi-machine paral- through its distributed DistBelief framework [21]. lelism. The goal of this paper is to survey the landscape of deep Data and model parallelism are the two prevalent oppor- learning frameworks with full support for parallelization. tunities for parallelism in training deep learning networks Three levels of parallelization exist on the hardware level: at the distributed level. In data parallelism, copies of the within a GPU, between GPUs on a single node, and between model, or parameters, are each trained on its own subset of the training data, while updating the same global model. In model parallelism, the model itself is partitioned and trained in parallel. Table 1: Open-source Frameworks Platform Tensorflow CNTK Deeplearning4j MXNet H2O Caffe Theano Torch 2011 (deep Release Date 2016 2016 2015 2015 2014 2014 2010 learning) Core Language C++ C++ C++ C++ Java C++ C++ C C++, Java, R, Python, R, Python, C++, Scala, Python, API NDL Java, Scala Scala, Python Lua Python Matlab, Matlab Javascript, Javascript, web-UI Go, Julia Synchronization Sync or Sync or Sync Sync Async Sync Async Sync Model async async Communication Parameter Iterative Parameter Distributed MPI N/A N/A N/A Model server MapReduce server fork-join Multi-GPU 3 3 3 3 3 3 3 3 Multi-node 3 3 3 3 3 7 7 7 Data 3 3 3 3 3 3 3 3 Parallelism Model 3 N/A 7 3 7 7 3 3 Parallelism DBN, DBN, DBN, DBN, DBN, Deep Learning DBN, CNN, DBN, CNN, CNN, CNN, CNN, DBN CNN, CNN, Models RNN RNN RNN RNN RNN RNN RNN Programming Imperative Imperative Declarative Both Declarative Declarative Imperative Imperative Paradigm Checkpoint- Fault Checkpoint- Checkpoint- Checkpoint- Checkpoint- Checkpoint- and- N/A N/A Tolerance and-resume and-resume and-resume and-resume and-resume recovery Graph (in- teractive), Graph Training Summary Graph Visualization None None Plots training (static) monitoring Statistics (static) monitoring Deep learning models can be categorized into three major 3.1 Tensorflow types: deep-belief networks (DBNs), convolutional neural Tensorflow was released by Google Research as open source networks (CNNs), and recurrent neural networks (RNNs). in November 2015, and included distributed support in 2016. CNNs and RNNs were briefly described in the introduc- The user-facing APIs are C++ and Python. Programming tion. DBNs are less domain-specific compared to CNNs and with Tensorflow leans more imperative. While plenty of ab- RNNs, and could be considered a precursor to CNNs, but straction power is expressed in its library, the user will prob- are fundamental nonetheless. ably also be working with computational primitive wrap- Programming paradigm falls into the categories of imper- pers such as matrix operations, element-wise math opera- ative, declarative, or a mix of both. Conventionally, im- tors, and looping control. In other words, the user is ex- perative programming specifies how a computation is done, posed to some of the internal workings of deep learning net- where as declarative programming specifies what needs to works. Tensorflow treats networks as a directed graph of be done. There is plenty of gray area, but the distinction is nodes encapsulating dataflow computation and required de- made in this paper based on whether the API exposes the pendencies [15]. Each node, or computation, gets mapped user to computation details that require some understand- to devices (CPUs or GPUs) according to some cost func- ing of the inner math of neural networks (imperative), or tion. This partitions the overall graph into subgraphs, one whether the abstraction is yet higher (declarative). per device. Cross-device edges are replaced to encode nec- Fault tolerance is included for two reasons. Distributed ex- essary synchronization between device pairs. Distributed ecution tends to be more failure prone, especially at scale. execution appears to be a natural extension of this arrange- Furthermore, any failures (not necessarily limited to dis- ment, except that TCP or Remote Direct Memory Access tributed execution) that interrupt training part-way can be (RDMA) is used for inter-device communication on sepa- very costly, if all the progress made on the model is simply rate machines. This approach of mapping subgraphs onto lost. devices also offers potential scalability, because each worker Finally, UI/Visualization is a feature supported to very dif- can schedule its own subgraph at runtime instead of relying ferent degrees across the frameworks studied. The ability on a centralized master [15]. Parallelism in Tensorflow can to monitor the progress of training and the internal state be expressed at several levels, notably both data parallelism of networks over time could be useful for debugging or hy- and model parallelism. Data parallelism can happen both perparameter tuning, and could be an interesting direction. across and within workers, by training separate batches of Tensorflow and Deeplearning4j both support this kind of data on model replications. Model parallelism is expressed visualization.