Benchmarking Deep Learning Frameworks: Design Considerations, Metrics and Beyond

Benchmarking Deep Learning Frameworks: Design Considerations, Metrics and Beyond Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, Qi Zhang Georgia Institute of Technology Abstract—With increasing number of open-source deep of parameters substantially more tricky than what learning (DL) software tools made available, benchmarking have been experimented and understood from systems DL software frameworks and systems is in high demand. administration of conventional computer systems, This paper presents design considerations, metrics and challenges towards developing an effective benchmark for software tools and applications. DL software frameworks and illustrate our observations This paper presents design considerations, metrics and through a comparative study of three popular DL insights towards benchmarking DL software frameworks frameworks: TensorFlow, Caffe, and Torch. First, we through a comparative study of three popular deep show that these deep learning frameworks are optimized learning frameworks: TensorFlow [10], Caffe [11] and with their default configurations settings. However, the default configuration optimized on one specific dataset may Torch [12]. First, we show that although deep learning not work effectively for other datasets with respect to software frameworks are optimized with their default runtime performance and learning accuracy. Second, the configuration settings, for a given DL framework, default configuration optimized on a dataset by one DL its default configuration optimized to train on one framework does not work well for another DL framework dataset may not work effectively for other datasets. on the same dataset. Third, we show through experiments that different DL frameworks exhibit different levels of Second, the default configuration optimized on a robustness against adversarial examples. Through this dataset by one DL framework may not work well study, we conjecture that effectively benchmarking deep when used to train the same dataset by another learning software frameworks and systems is significantly DL framework. Hence, it may not be meaningful more challenging than traditional performance-driven to compare different DL frameworks under the same benchmarks. configuration. Third, different DL frameworks exhibit I. INTRODUCTION different levels of robustness in response to adversarial Deep learning (DL) applications and systems have behaviors, and different sensitivity boundaries over blossomed in recently years as more variety of data potential biases or noise levels inherent in different is entering cyber space and increasing number of training datasets. Through this experimental study, open-source deep learning (DL) software frameworks we show that system runtime performance, learning are made available. Benchmarking DL frameworks accuracy, and model robustness against adversarial and systems is in high demand [1], [2], [3], [4], behaviors and consequence of overfitting are the three [5], [6], [7], [8], [9]. However, benchmarking deep sets of metrics that are equally important for effectively learning software frameworks and systems is notably configuring, measuring and comparing different deep more difficult than traditional performance-driven learning software frameworks. benchmarks. This is simply because big data powered deep learning systems are inherently both II. DEEP LEARNING REFERENCE MODEL computation-intensive and data-intensive, demanding Deep learning software frameworks are scalable intelligent integration of massive data parallelism and software implementations of deep neural networks massive computation-parallelism at all levels of a deep (DNN) on modern computers with many-core CPUs learning framework. For instance, a deep learning without or with GPUs. framework typically has a large set of model parameters and system parameters that need to be configured and A. DNN Model tuned. Many of the parameters are interacting with A deep neural network refers to a N-layer neural one another in a complex manner from both model network with N ≥ 2. Learning over the input data to learning perspective and system runtime optimization a DNN is typically performed through a sequence of perspective, making the tuning of such large space transformations of input data layer by layer with each layer representing a network of neurons extracted from each layer of the neural networks. Usually more feature input data. Although different layers extract different maps enable a deep learning model to give the input data representations of features of the input data, each layer a more refined representation, but at higher runtime cost. learns to extract more complex, deeper features from Other hyperparameters, such as the network its previous layer. A typical layer of a neural network architecture, number of layers, paddings, strides, consists of weights w, biases b and activation function kernel sizes of layers, the type of loss function, act(), in addition to the set of loosely connected neurons optimizer, and regularization method, also influence the (parameters). It takes as input the neuron map produced performance of the deep learning model, each from from its previous layer, and produces the output as their own perspective. Note that regularization method y = act(w ∗ x + b), where x is the input of the can reduce overfitting. Overfitting occurs when the layer. By connecting these layers together, a deep neural deep learning model is able to achieve high accuracy network is a function y = F (x) in which x 2 Rn is on the training data but such high accuracy cannot be a n-dimension input and y 2 Rm is the output of a generalized to the testing data. m-dimension vector. Once the DNN is trained, the parameters θF are fixed, The neural network model F consists of many model producing a deep learning model that is used in the testing phase to make predictions and classifications. parameters θF . The values of parameters θF are tuned during the training phase where a large number of Usually, the trained DNN model may be re-trained input-output pairs as training dataset is fed into the Before its actual deployment, to ensure that it passes neural network. The training process is conducted in the validation test and the system uses the trained DNN multiple rounds/iterations. During each round, the neural model can provide sufficiently accurate results. The network uses the parameters from the previous iteration testing phase refers to both the validation and the use with the training data input to predict output forwardly of a trained DNN model in real application system. on the N-layer neural network. Then, the DNN computes B. Reference DL Frameworks a loss function between the predicted output and the real Three mainstream DL frameworks: TensorFlow, Caffe (pre-labeled) output. Using the loss function, the DNN and Torch, are selected for this study. updates the parameters using backpropagation with TensorFlow [10] is a open source software library an optimizer, e.g., stochastic gradient descent (SGD) and implemented based on data flow graph. Neurons are [13] or Adam [14], which minimizes the pre-defined tensors (data arrays) flow between nodes via the edges. loss function. The general principle of defining a loss The nodes are used to represent mathematical operations function is to measure the difference between the and the edges represent the data flow. The tensor computed output with the ground truth (real output). based data flow makes TensorFlow an ideal API and In addition to network parameters tuned in the training implementation tool for Convolutional Neural Networks process, hyperparameters also need to be tuned. The (CNNs) and Recurrent Neural Neworks (RNNs). learning rate and batch size are among the most However, TensorFlow does not support dynamic input important ones. Both are used to control the extent of size, which are crucial for applications like NLP. parameter updates for each iteration. In general, larger Caffe [11] supports many different types of deep learning rate and/or larger batch size will bring about learning architectures (CNN, RCNN, LSTM and fully faster convergence. However, if the learning rate is connected neural network) geared towards image too large, the training process may not be sophisticated classification and segmentation. Caffe can be used enough and may suffers from fluctuation. If the batch simply in command lines and it builds neural network size is too large to fit each mini-batch in memory models by transforming the data to LMDB format, of GPU or CPU core, the training process may take defining network architecture in the .prototxt file, much longer to complete. Smaller learning rate leads to defining hyperparameters such as learning rate and slower convergence but makes the training process more training epochs in the “solver” file and then training. fine-grained. Smaller batch size leads to larger number Caffe works layer-wisely for deep learning applications. of bagging and thus larger number of epochs and may Torch [12] is an open source machine learning library result in lower accuracy, but it can avoid out of memory that provides a wide range of algorithms for DL. Its (OOM) induced runtime performance penalty. Another computing framework is based on the Lua programming key hyperparameter is the number of kernels (weight language [15], a script language. Torch has the most filers), which produces the number of feature maps in comprehensive set of convolutions, and it supports 2 temporal convolution with variable input length. in this study: untargeted Fast Gradient

Load more