Comparative Study of Deep Learning Software Frameworks
Total Page:16
File Type:pdf, Size:1020Kb
Comparative Study of Deep Learning Software Frameworks Soheil Bahrampour, Naveen Ramakrishnan, Lukas Schott, Mohak Shah Research and Technology Center, Robert Bosch LLC {Soheil.Bahrampour, Naveen.Ramakrishnan, fixed-term.Lukas.Schott, Mohak.Shah}@us.bosch.com ABSTRACT such as dropout and weight decay [2]. As the popular- Deep learning methods have resulted in significant perfor- ity of the deep learning methods have increased over the mance improvements in several application domains and as last few years, several deep learning software frameworks such several software frameworks have been developed to have appeared to enable efficient development and imple- facilitate their implementation. This paper presents a com- mentation of these methods. The list of available frame- parative study of five deep learning frameworks, namely works includes, but is not limited to, Caffe, DeepLearning4J, Caffe, Neon, TensorFlow, Theano, and Torch, on three as- deepmat, Eblearn, Neon, PyLearn, TensorFlow, Theano, pects: extensibility, hardware utilization, and speed. The Torch, etc. Different frameworks try to optimize different as- study is performed on several types of deep learning ar- pects of training or deployment of a deep learning algorithm. chitectures and we evaluate the performance of the above For instance, Caffe emphasises ease of use where standard frameworks when employed on a single machine for both layers can be easily configured without hard-coding while (multi-threaded) CPU and GPU (Nvidia Titan X) settings. Theano provides automatic differentiation capabilities which The speed performance metrics used here include the gradi- facilitates flexibility to modify architecture for research and ent computation time, which is important during the train- development. Several of these frameworks have received ing phase of deep networks, and the forward time, which wide attention from the research community and are well- is important from the deployment perspective of trained developed allowing efficient training of deep networks with networks. For convolutional networks, we also report how billions of parameters, thanks to their strong GPU backends. each of these frameworks support various convolutional al- Developers have constantly improved these frameworks by gorithms and their corresponding performance. From our adding more features (e.g. by adding support for different experiments, we observe that Theano and Torch are the types of convolution algorithms) and speed improvements most easily extensible frameworks. We observe that Torch to attract more users and foster research [4, 7, 21, 1, 13]. is best suited for any deep architecture on CPU, followed Recently, the efficacy of several deep learning frameworks by Theano. It also achieves the best performance on the have been evaluated in [5]. However, the comparison is only GPU for large convolutional and fully connected networks, focused on speed performance of the convolutional frame- followed closely by Neon. Theano achieves the best perfor- works. Hence, this paper expands on the previous bench- mance on GPU for training and deployment of LSTM net- marks and evaluates five deep learning frameworks, namely: works. Caffe is the easiest for evaluating the performance of Caffe, Neon, TensorFlow, Theano, and Torch. Among the standard deep architectures. Finally, TensorFlow is a very available software frameworks, Caffe, Theano, and Torch are flexible framework, similar to Theano, but its performance indeed the top three well developed and widely used frame- is currently not competitive compared to the other studied works by the deep learning community. The reason for in- frameworks. cluding Neon in this study is its recently reported state-of- arXiv:1511.06435v3 [cs.LG] 30 Mar 2016 the-art performance for training several deep learning archi- 1. INTRODUCTION tectures [5]. Finally, TensorFlow has received much atten- tion since its first release recently and we have included it in Deep learning methods have recently influenced several our benchmarking for completeness even though it does not application domains, namely computer vision [14, 20], speech yet 1 officially support cuDNNv3, which is the latest official recognition [10, 12], and nature language processing [8], where version supported by all the other studied frameworks. We they have enjoyed significant performance improvements com- evaluate these frameworks from the perspective of practi- pared to state-of-art methods in the respective domains. tioners, on the following aspects: For the latest list of domains and challenges on benchmark datasets where deep learning performed better than the ex- • Extensibility: Their capability to incorporate differ- isting state-of-art, see http://deeplearning4j.org/accuracy.html. ent types of deep learning architectures (convolutional, Most of the successful deep learning architectures are com- fully-connected, and recurrent networks), different train- posed of a combination of different types of layers such as ing procedures (unsupervised layer-wise pre-training fully connected, convolutional, and recurrent layers and are and supervised learning), and different convolutional usually trained with a variant of the stochastic gradient de- algorithms (e.g. FFT-based algorithm). scent algorithm along with various regularization techniques • Hardware utilization: Their efficacy to incorporate hard- 1as of the time of submission of this paper ware resources in either (multi-threaded) CPU or GPU 3. BENCHMARKING SETUP setting. 3.1 Evaluation Metrics • Speed: Their speed performance from both training and deployment perspectives. We use the two following evaluation metrics to obtain a holistic understanding of speed of the five deep learning The study will provide the users and enterprises a broad frameworks under various system scenarios and application picture of the strengths and (current) limitations of the stud- domains: ied deep learning frameworks to enable them to assess suit- ability in the context of their requirements. Moreover, the • Forward Time: We measure the time it takes for an discussions highlight the current limitations of the respective input batch of a pre-selected batch size, for a given frameworks which can be addressed in their future develop- dataset and network, to flow through the entire net- ments 2. We plan to share the code for all the frameworks work and produce the corresponding output. This is in the near future through a publicly available webpage. important as it indicates the latency of a deep network The rest of the paper is organized as follows: Section 2 when deployed in real-world. gives a brief overview of the software frameworks we focus • Gradient Computation Time: We also measure the on in this paper; Section 3 describes the benchmarking set time it takes to get the gradients for each measur- up which is followed by results and conclusions in Section 4 able parameter in the deep network for a given input and Section 5, respectively. batch. This is an important indicator of the training time. Note that, for most of the cases (e.g. Torch), 2. OVERVIEW OF THE DEEP LEARNING this gradient computation time is the summation of FRAMEWORKS the times spent in calling the corresponding forward With deep learning methods gaining popularity in many and backward functions as these two functions should applications domains over the last few years, there have been be called consecutively to compute the gradients. But quite a lot of interest from many academic (e.g. Univ. of for Theano, this gradient computation time is mea- California Berkeley, NYU) and industry groups (e.g. Google, sured by calling a Theano function that is compiled Facebook) to develop software frameworks (e.g. Theano, to generate the gradients given the inputs to the net- Caffe) that help easily create and test various deep architec- works which implicitly performs the forward and back- tures. At the time this paper was written, some of the widely ward steps through computational graphs. It should used software frameworks for deep learning were: Caffe, be noted that the gradient computation time we re- Theano, Torch, Neon, TensorFlow, Chainer, DeepLearning4J, port, does not include the time taken to update the deepmat, Eblearn, MXNet, etc. (for a more complete list of network parameters, such as computation of learning Deep Learning Softwares see http://deeplearning.net/software links/). rate, weight decay, momentum term, etc. Many of these frameworks are mature already as of today For Theano, one initially need to compile forward and gra- and are very fast in training deep networks with billions of dient computation functions before calling them during ex- ˘ parameters ˆaA¸Sthanks to their strong CUDA backends. To- ecution. To provide a complete picture, these compilations day, almost every group training deep networks use Graph- times are also reported (See Tabel 8). We also report the ical Processing Units (GPU) to accelerate the training pro- GPU memory usage for large networks. cess and this has led to joint development of software li- braries (e.g. cuDNN) between academic (e.g. Berkeley, 3.2 System setup NYU) and industry players (e.g. Nvidia). Table 1 shows the All the experiments are performed on a single machine number of users in Google groups and the number of con- 3 running on Ubuntu 14.04 with Intel Xeon CPU E5-1650 v2 @ tributors for each of the frameworks in their corresponding 3.50GHz A˜U˚ 12; Nvidia GeForce GTX Titan X/PCIe/SSE2; GitHub repositories. It is clear that the top three widely de- 32 GiB DDR3 memory; and SSD hard drive. We used veloped and supported deep learning frameworks are Caffe, openCV 3.0. and OpenBLAS 0.2.14 with commit ID 3684706. Theano, Torch, and are thus selected in this paper for the For Caffe, commit ID 8c8e832 is used. For Neon, version benchmarking purposes. We also evaluated Neon frame- 1.0.0.rc1 (2015-09-08) with the commit ID of a6766ff is used. work from Nervana as it has recently shown the state-of- For TensorFlow, we installed version 0.6.0 using pip instal- the-art performance for training convolutional networks [5]. lation. Theano version 0.7.0.dev and Torch7 used here have We have also included TensorFlow from Google in our ex- commit IDs 662ea98 and 8c8e832, respectively.