Benchmarking Frameworks: Design Considerations, Metrics and Beyond

Ling Liu, Yanzhao Wu, Wenqi Wei, Wenqi Cao, Semih Sahin, Qi Zhang Georgia Institute of Technology

Abstract—With increasing number of open-source deep of parameters substantially more tricky than what learning (DL) software tools made available, benchmarking have been experimented and understood from systems DL software frameworks and systems is in high demand. administration of conventional computer systems, This paper presents design considerations, metrics and challenges towards developing an effective benchmark for software tools and applications. DL software frameworks and illustrate our observations This paper presents design considerations, metrics and through a comparative study of three popular DL insights towards benchmarking DL software frameworks frameworks: TensorFlow, , and Torch. First, we through a comparative study of three popular deep show that these deep learning frameworks are optimized learning frameworks: TensorFlow [10], Caffe [11] and with their default configurations settings. However, the default configuration optimized on one specific dataset may Torch [12]. First, we show that although deep learning not work effectively for other datasets with respect to software frameworks are optimized with their default runtime performance and learning accuracy. Second, the configuration settings, for a given DL framework, default configuration optimized on a dataset by one DL its default configuration optimized to train on one framework does not work well for another DL framework dataset may not work effectively for other datasets. on the same dataset. Third, we show through experiments that different DL frameworks exhibit different levels of Second, the default configuration optimized on a robustness against adversarial examples. Through this dataset by one DL framework may not work well study, we conjecture that effectively benchmarking deep when used to train the same dataset by another learning software frameworks and systems is significantly DL framework. Hence, it may not be meaningful more challenging than traditional performance-driven to compare different DL frameworks under the same benchmarks. configuration. Third, different DL frameworks exhibit I.INTRODUCTION different levels of robustness in response to adversarial Deep learning (DL) applications and systems have behaviors, and different sensitivity boundaries over blossomed in recently years as more variety of data potential biases or noise levels inherent in different is entering cyber space and increasing number of training datasets. Through this experimental study, open-source deep learning (DL) software frameworks we show that system runtime performance, learning are made available. Benchmarking DL frameworks accuracy, and model robustness against adversarial and systems is in high demand [1], [2], [3], [4], behaviors and consequence of overfitting are the three [5], [6], [7], [8], [9]. However, benchmarking deep sets of metrics that are equally important for effectively learning software frameworks and systems is notably configuring, measuring and comparing different deep more difficult than traditional performance-driven learning software frameworks. benchmarks. This is simply because big data powered deep learning systems are inherently both II.DEEP LEARNING REFERENCE MODEL computation-intensive and data-intensive, demanding Deep learning software frameworks are scalable intelligent integration of massive data parallelism and software implementations of deep neural networks massive computation-parallelism at all levels of a deep (DNN) on modern computers with many-core CPUs learning framework. For instance, a deep learning without or with GPUs. framework typically has a large set of model parameters and system parameters that need to be configured and A. DNN Model tuned. Many of the parameters are interacting with A deep neural network refers to a N-layer neural one another in a complex manner from both model network with N ≥ 2. Learning over the input data to learning perspective and system runtime optimization a DNN is typically performed through a sequence of perspective, making the tuning of such large space transformations of input data layer by layer with each layer representing a network of neurons extracted from each layer of the neural networks. Usually more feature input data. Although different layers extract different maps enable a deep learning model to give the input data representations of features of the input data, each layer a more refined representation, but at higher runtime cost. learns to extract more complex, deeper features from Other hyperparameters, such as the network its previous layer. A typical layer of a neural network architecture, number of layers, paddings, strides, consists of weights w, biases b and activation function kernel sizes of layers, the type of , act(), in addition to the set of loosely connected neurons optimizer, and regularization method, also influence the (parameters). It takes as input the neuron map produced performance of the deep learning model, each from from its previous layer, and produces the output as their own perspective. Note that regularization method y = act(w ∗ x + b), where x is the input of the can reduce overfitting. Overfitting occurs when the layer. By connecting these layers together, a deep neural deep learning model is able to achieve high accuracy network is a function y = F (x) in which x ∈ Rn is on the training data but such high accuracy cannot be a n-dimension input and y ∈ Rm is the output of a generalized to the testing data. m-dimension vector. Once the DNN is trained, the parameters θF are fixed, The neural network model F consists of many model producing a deep learning model that is used in the testing phase to make predictions and classifications. parameters θF . The values of parameters θF are tuned during the training phase where a large number of Usually, the trained DNN model may be re-trained input-output pairs as training dataset is fed into the Before its actual deployment, to ensure that it passes neural network. The training process is conducted in the validation test and the system uses the trained DNN multiple rounds/iterations. During each round, the neural model can provide sufficiently accurate results. The network uses the parameters from the previous iteration testing phase refers to both the validation and the use with the training data input to predict output forwardly of a trained DNN model in real application system. on the N-layer neural network. Then, the DNN computes B. Reference DL Frameworks a loss function between the predicted output and the real Three mainstream DL frameworks: TensorFlow, Caffe (pre-labeled) output. Using the loss function, the DNN and Torch, are selected for this study. updates the parameters using with TensorFlow [10] is a open source software library an optimizer, e.g., stochastic gradient descent (SGD) and implemented based on data flow graph. Neurons are [13] or Adam [14], which minimizes the pre-defined tensors (data arrays) flow between nodes via the edges. loss function. The general principle of defining a loss The nodes are used to represent mathematical operations function is to measure the difference between the and the edges represent the data flow. The tensor computed output with the ground truth (real output). based data flow makes TensorFlow an ideal API and In addition to network parameters tuned in the training implementation tool for Convolutional Neural Networks process, hyperparameters also need to be tuned. The (CNNs) and Recurrent Neural Neworks (RNNs). learning rate and batch size are among the most However, TensorFlow does not support dynamic input important ones. Both are used to control the extent of size, which are crucial for applications like NLP. parameter updates for each iteration. In general, larger Caffe [11] supports many different types of deep learning rate and/or larger batch size will bring about learning architectures (CNN, RCNN, LSTM and fully faster convergence. However, if the learning rate is connected neural network) geared towards image too large, the training process may not be sophisticated classification and segmentation. Caffe can be used enough and may suffers from fluctuation. If the batch simply in command lines and it builds neural network size is too large to fit each mini-batch in memory models by transforming the data to LMDB format, of GPU or CPU core, the training process may take defining network architecture in the .prototxt file, much longer to complete. Smaller learning rate leads to defining hyperparameters such as learning rate and slower convergence but makes the training process more training epochs in the “solver” file and then training. fine-grained. Smaller batch size leads to larger number Caffe works layer-wisely for deep learning applications. of bagging and thus larger number of epochs and may Torch [12] is an open source library result in lower accuracy, but it can avoid out of memory that provides a wide range of algorithms for DL. Its (OOM) induced runtime performance penalty. Another computing framework is based on the Lua programming key hyperparameter is the number of kernels (weight language [15], a script language. Torch has the most filers), which produces the number of feature maps in comprehensive set of convolutions, and it supports

2 temporal convolution with variable input length. in this study: untargeted Fast Gradient Sign Method (FGSM) [17] and targeted Jacobian-based attacks [18]. . Evaluation Metrics Adversarial examples x0 consists of an input x, and its Metrics are a critical component for a benchmark. The adversarial perturbation δ . With some perturbation, the following metrics are used in this measurement study. x adversarial example is classified as a new class different Training Time. It is a key performance indicator 0 from its original class, i.e., x = x + δx, F (x) = y and for DL frameworks. Training time is the time spent on F (x0) = y0 both hold while y0 6= y. building a DNN model over the training dataset. For FGSM: a simple and fast way to generate untargeted optimization purposes, models need to be trained for adversarial examples: several times with different parameters in order to find x0 = x + sign(∇ L(x, y)), (1) the best parameters that achieves the optimal design. x Although pre-trained models are made available on many where L(x, y) is the loss function of original input and platforms, such as Caffe Model Zoo [16], training is still the true label. sign() is a mathematical function: ( 1, x > 0, essential for new model development and for retraining sign(x) = 0, x = 0, over new datasets or incrementally enhanced datasets. −1, x < 0 Testing Time. It is another important performance Jacobian-based attacks: a targeted attack launched indicator for DL frameworks. Testing time is the time by adversaries to generate adversarial examples such that spent on testing the trained model using a validation they are classified as targeted class t instead of their truly dataset. It indicates the potential latency of using the legitimate source class. Concretely, for each feature i, the trained model for the prediction or classification based perturbations δ are formed by decreasing saliency map inference when the model is deployed in real-world x S(x, t)[i] of the network rather than the loss function, applications. Thus, testing time affects user experience  e.g., 0, if ∂Ft (x) < 0 to a large extent and affects the performance of actual  ∂xi  or P ∂Fj (x) > 0, applications. Both training and testing time can be j6=t ∂xi S(x, t)[i] = ∂F (2) influenced by the configurations of system-specific and ∂Ft (x)| P j (x)|,  ∂xi j6=t ∂xi model specific parameters.  otherwise, Learning/Prediction Accuracy. The learning ∂Fj where matrix JF = [ ]i,j is the Jacobian matrix ∂xi accuracy metric measures the utility of the training of the neural network function. The goal of Equation framework in the training phase, and the prediction (2) is to reject input features with a negative target accuracy measures the utility of the trained DNN model derivative or overall positive derivative on classes other at testing phase. Accuracy measurement is highly than the targeted class. In fact, saliency maps exploit sensitive to both data-specific parameters, such as the input features that contribute to increasing the probability type of datasets, the characteristics of the datasets, such of target class or decreasing source class or other classes as the number of classes, the number of training samples significantly, or both. per class, and the type of deep learning architectures and machine learning library used, such as the collection of III.EXPERIMENTS algorithms/optimizations included, the configurations of We conduct three sets of experiments to answer the many model specific parameters. following questions: (i) How effective the default setting Adversarial Robustness. This metric is designed (configuration) of one DL framework is in comparison to measure the resilience of the DL framework and with that of another DL framework for the same its trained DNN model against adversarial behaviors, datasets? (ii) How efficient the default setting used to including targeted attacks and random (untargeted) train one dataset will be when it is deployed to train attacks, as well as effect of overfitting against potential another dataset using the same DL framework? (iii) biases and noise levels in the training dataset during the Would the default setting used by one DL framework testing phase. It can also be used a measure to evaluate be effective when used by another DL framework to the effectiveness of regularization techniques deployed in train the same dataset (i.e., dataset dependent default different DL frameworks. For example, TensorFlow uses configuration)? (iv) How well the default setting of a dropout, while Caffe has weight decay. In this paper, we DL framework, which is optimized for training on one use the success rate of crafting adversarial examples as dataset, can perform when it is used by another DL a measure of adversarial Robustness. framework to train on the same dataset (i.e., framework Two types of adversarial attacks are considered dependent default configuration)?

3 Frameworks Version Hash Tag Library Interface LoC License Website TensorFlow 1.3.0 ab0fcac Eigen & CUDA Java, Python, Go, R 1281085 Apache https://www.tensorflow.org/ Caffe 1.0.0 c430690 OpenBLAS & CUDA Python, Matlab 69608 BSD http://caffe.berkeleyvision.org/ Torch torch7 0219027 optim & CUDA Lua 29750 BSD http://torch.ch/ TABLE I: Deep Learning Software Frameworks and Basic Properties Framework TensorFlow Caffe Torch All experiments are conducted on Intel Xeon(R) Algorithm Adam SGD SGD E5-1620 server with CPU: 3.6Ghz, Memory: DDR3 Base Learning Rate 0.0001 0.01 0.05 1600 MHz 8GB × 4 (32GB), Hard drive: SSD 256GB, Batch Size 50 64 10 #Max Iterations 20,000 10,000 120,000 GPU: Nvidia GeForce GTX 1080 Ti (11GB), installed #Epochs 16.67 10.67 20 with Ubuntu 16.04 LTS, CUDA 8.0 and cuDNN 6.0. TABLE II: Default training parameters on MNIST DL Frameworks. TensorFlow [10], Caffe [11], and Framework TensorFlow Caffe Torch Torch [12] are selected for this measurement study. Algorithm SGD SGD SGD Table I shows some of their statistics. Base Learning Rage 0.1 0.001 → 0.0001 0.001 Batch Size 128 100 1 Datasets. Datasets play a definitive role in the #Max Iterations 1,000,000 5,000 100,000 performance of deep learning in most cases [19], [20], #Epochs 2560 8+2 20 [21], [22]. For the objectives of this study, we choose the TABLE III: Default training parameters on CIFAR-10 two classic datasets: MNIST [23] and CIFAR-10 [24], as Next, we compare their primary default parameters all 3 DL frameworks have tuned their configurations for about neural network structures, which are configured these two datasets. MNIST consists of 70,000 images by the framework creators to work optimally with of ten handwritten digits, each image is 28 × 28 in MINST and CIFAR10, shown in Table IV and Table size. CIFAR-10 consists of 60,000 colorful images of V respectively. For MNIST, three frameworks adopt a 10 classes, each is 32 × 32 in size. similar network structure with 2 convolution layers and A. Default Settings in DL Frameworks 2 fully connected layers, derived from LeNet [23]. All We compare their primary hyperparameter settings used set their kernel/neuron size as 5 × 5. 1 → 32 indicates for MINST and CIFAR-10 in Table II and Table III the number of input feature maps is 1. However, other respectively. All three DL frameworks select their own parameters are selected differently for MNIST, such as preferred default training parameters for both datasets. the activation operation, the number of kernels/feature For MINST, the base learning rate. TensorFlow prefers maps extracted. In comparison, for CIFAR-10, their Adam[14] as its optimizer while Caffe and Torch network structures vary more significantly from the ones use SGD [13]. TensorFlow uses the smallest base configured for MNIST. Caffe and TensorFlow use a learning rate, and Caffe uses the largest batch size 5-layer network structure and Torch uses 4-layer instead. along with the smallest number of training epochs, Caffe employs 3 convolution layers and TensorFlow and Torch uses the largest base learning rate and larger Torch both use 2 convolution layers. number of training epochs. For MNIST, TensorFlow sets We make two observations from Table IV and its maximum steps to 20,000, and Caffe sets its max Table V. First, a DL framework may vary their iterations to 10,000, thus by #Epochs = max steps * default parameters for different datasets, as shown. batch size / #Training Samples, we obtain 20,000 * 50 We refer to this type of default setting variations / 60,000= 16.67 epochs for TensorFlow and 10,000 * 64 as dataset-dependent default settings. Second, different / 60,000 = 10.67 epochs for Caffe. For Torch, the max frameworks may NOT use the same default configuration #Epochs is manually set to 20 and the #max iterations to train the same dataset because each framework is set to (20 * 60,000) / 10 = 120,000. For CIFAR-10, may optimize its performance using a different setting SGD is used by all as their optimizer. Caffe adopts a of model-turning and system-turning parameters, even two-phase training, the learning rate for its first phase is when trained over the same dataset. We call this type of 0.001 and 0.0001 for the second phase. Also Caffe uses default setting variations as framework-dependent default 8 epochs for the first phase training and 2 epochs for settings, which refers to the default settings used by the second phase. Using the same formula, TensorFlow different DL frameworks to train on a specific dataset. has its maximum steps set to (2,560 * 50,000) / 128 = Interesting questions arise: 1,000,000, Caffe sets its max iterations to (10 * 50,000) / • Can the default setting optimized to train one 100 = 5,000, and Torch sets it to 20 * 5,000 / 1 = 100,000 dataset be effective to train a different dataset for CIFAR-10, since #Training Samples of CIFAR-10 is using the same DL software framework? (same 50,000 while 60,000 for MNIST. framework on different datasets)

4 Framework TensorFlow Caffe Torch 5 × 5, 1 → 32 5 × 5, 1 → 20 5 × 5, 1 → 32 1st Layer (conv) ReLU, MaxPooling(2 × 2) MaxPooling(2 × 2) Tanh, MaxPooling(3 × 3) 5 × 5, 32 → 64 5 × 5, 20 → 50 5 × 5, 32 → 64 2nd Layer (conv) ReLU, MaxPooling(2 × 2) MaxPooling(2 × 2) Tanh, MaxPooling(3 × 3) 7 × 7 × 64 → 1024 4 × 4 × 50 → 500 3 × 3 × 64 → 200 3rd Layer (fc) ReLU ReLU Tanh 4th Layer (fc) 1024 → 10 500 → 10 200 → 10 TABLE IV: Primary Default Neural Network Parameters on MNIST Framework TensorFlow Caffe Torch 5 × 5, 3 → 64 5 × 5, 3 → 32 5 × 5, 3 → 16 1st Layer (conv) ReLU, MaxPooling(3 × 3), Normalization MaxPooling(3 × 3), ReLU Tanh, MaxPooling(2 × 2) 5 × 5, 64 → 64 5 × 5, 32 → 32 5 × 5, 16 → 256 2nd Layer (conv) ReLU, Normalization, MaxPooling(3 × 3) ReLU, AveragePooling(3 × 3) Tanh, MaxPooling(2 × 2) 7 × 7 × 64 → 384 5 × 5, 32 → 64 5 × 5 × 256 → 128 3rd Layer fc, ReLU conv, ReLU, AveragePooling(3 × 3) fc, Tanh 384 → 192 4 × 4 × 64 → 64 128 → 10 4th Layer (fc) ReLU — — 5th Layer(fc) 192 → 10 64 → 10 — TABLE V: Primary Default Neural Network Parameters on CIFAR-10

• Can the default setting trained on a dataset by Caffe. But TensorFlow has the highest accuracy with one DL framework be used effectively to configure Caffe ranking the second, followed by Torch. Note that another framework to train the same dataset? Torch GPU has slightly lower accuracy (65.61%) than (different frameworks on the same dataset) Torch-CPU (66.16%). One reason could be that Torch • Will the default setting, optimized by one uses SpatialConvolutionMap [25] on CPU for CIFAR-10, framework, work effectively for other DL but it lacks the corresponding implementation on GPU. frameworks? (different frameworks on different Thus, SpatialConvolutionMM [26] is used as the default. datasets) (3) All three frameworks have shortened their training and testing time with GPU for both datasets. Concretely, B. Impact of Default Settings GPU acceleration for MNIST, TensorFlow is faster by We first study the impact of default settings by one 16 times and 10 times, Caffe is faster by 5 times and framework on different datasets to answer the first 6 times, and Torch is faster by 28 times and 32 times, question (same framework on different datasets). in training time and testing time, respectively. However, This set of experiments compares the performance TensorFlow CPU and Torch CPU obtain slightly higher of three DL frameworks on MNIST using their own accuracy on MNIST compared to TensorFlow GPU default setting optimized for MNIST (Figure 1) and and Torch GPU respectively, though accuracy of all similar comparison also on CIFAR-10 (Figure 2). We settings on MNIST are above 99% with highest by highlight three observations: (1) For both datasets, Torch TF CPU (99.28%) and lowest by Caffe CPU (99.03%). spent longest time in testing as well as its training For CIFAR-10, Torch CPU has slightly higher accuracy on MNIST. TensorFlow has the highest accuracy. Even than Torch GPU. It is worth to note that TensorFlow though Torch’s accuracy is higher than Caffe and lower CPU with 256 epochs can achieve 86.6% accuracy with than TensorFlow for MNIST, its accuracy for CIFAR-10 training time of 21673.81sec ( 6.02 hours), whereas is the worst of the three. One reason that Caffe has TensorFlow CPU with 2560 epochs achieves 86.90% shorter training and testing time might be that Caffe was accuracy at the cost of training time of 60.88 hours. This trained for the least number of epochs and less feature result demonstrates that GPU acceleration shortens the maps were extracted for inference, whereas TensorFlow training/testing time but it may not ensure high accuracy has largest number of feature maps, which helps it due to multiple factors, such as mini-batch size, bagging to achieve the highest accuracy with low testing time. algorithm, per unit memory capacity of GPU. For more (2) For MNIST, TensorFlow and Caffe show similar detail, see [27]. training and testing time, but Caffe has the lowest In summary, the deep learning model demonstrates accuracy. For CIFAR-10, Caffe spent least time for much better accuracy on the sparse and gray-scale training in CPU or GPU settings and least testing time MNIST dataset over the color-rich and content-rich for GPU setting. TensorFlow has significantly longer CIFAR-10 dataset. The sparseness and gray scale of training time and its GPU testing time doubles that of MNIST give the data low entropy. We attribute the better

5 al VadTbeV l he rmwrschoose frameworks three Recall all time. V, testing Table and and MNIST training IV on Table setting longer make default with We CIFAR-10 dataset own 3c. 3b, their Figure Figure using in in is observations. is two comparison comparison accuracy and time bar time and Caffe red Testing training for the Torch. the is settings for and setting evaluation default Similar bar MNIST CIFAR-10 respectively. blue own own For the its its framework. using is using each setting time for MNIST.default training bars on the frameworks colored compare time. TensorFlow, three we two a 3a, the are at Figure of preferred There one In time setting frameworks, results. training DL the default the shows three the 3 the choose Figure of same we each the by fair, performance. using their be compare frameworks and WeTo first, frameworks. DL MNIST of on DL three setting impact different all of the variations configure performance their studies the and experiments on settings of default set dataset-dependent Settings next Default The Dataset-dependent of Impact CIFAR-10.C. of than learn. testing MNIST on and to faster training model much the model learning DNN makes deep the also the data entropy for the low easier of The is entropy lower it the since to performance accuracy

Training Time (s) Training Time (s) 1000000 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 (a) (a) i.2: Fig. TF 1114.34 TF 1: Fig. riigTime Training Time Training

219169.14 CPU CPU First Caffe 1730.89 Caffe 512.18 xeietlRslso IA-0 sn IA-0DfutStig y3frameworks 3 by Settings Default CIFAR-10 using CIFAR-10, on Results Experimental l rmwrspromworst perform frameworks all , Torch frameworks 3 by Settings Default MNIST using MNIST, on Results Experimental Torch 38268.67 16096.62 TF TF 68.51

12477.05 GPU GPU Caffe 163.51 Caffe 97.02

Torch 722.15 Torch 563.28

Testing Time (s) Testing Time (s) 1000 100 100 0.1 10 10 1 1 (b) (b) TF 4.80 TF 2.73 CPU CPU etn Time Testing etn Time Testing Caffe 14.35 Caffe 3.33 6

n af ufrfo oe cuaywe using when accuracy TensorFlow lower both from and that suffer training shows shorter Caffe responds 4c and accuracy and Figure structure how time. NN at testing simpler look the we to Now setting default faster. MNIST runs the Thus, dataset. and MNIST training for testing structures frameworks network all neural because simpler simply choose test is and This train CIFAR-10. to the settings three on MNIST time all default testing their trend: and using similar training when a shorter show much have 4b frameworks Figure and 4a settings. default Figure dependent dataset for results comparison over that overfitting. worst is the guarantee about explanation bring necessarily possible may training not A accuracy. do surprisingly higher structures is using more NN and MNIST time performance complex training on longer its the summary, setting and In worsened. CIFAR-10 result own own same their its the does using the enjoy Caffe However, when MNIST. to not on identical produced setting default almost they MNIST and accuracy both accuracy allow best high MNIST testing achieve on and to settings default training CIFAR-10 structures longer their network the neural time. thus, CIFAR-10, complex more for and deeper Torch 121.11 Torch 56.62 iial,frCFR1 aae,Fgr hw the shows 4 Figure dataset, CIFAR-10 for Similarly, TF 2.34 TF 0.26 Second GPU GPU Caffe 1.36 Caffe 0.55

upiigy esrlwadTrhusing Torch and TensorFlow surprisingly, , Torch 3.66 Torch 1.76

Accuracy (%) Accuracy (%) 98.95 99.05 99.15 99.25 60 65 70 75 80 85 90 98.9 99.1 99.2 99.3 99

TF 86.90 TF (c) (c) CPU 99.28 Caffe 75.39 CPU Accuracy Accuracy Caffe 99.03

Torch 66.16 Torch 99.20

TF TF 87.00

GPU 99.22 GPU Caffe 75.52 Caffe 99.13

Torch 65.61 Torch 99.18 rmwr eal etnswe rie ntesame the on trained different when using of settings impact default the framework compare setting and evaluate default We framework-dependent datasets. of other Impact for D. well optimized work setting not default may own dataset its one framework, for DL a for the provides only iteration. Caffe each since for iterations not statistics varying of its will by loss number dataset using training the CIFAR-10 the measures Caffe on 5 Figure train that converge. almost to indicating is setting CIFAR-10 87.34%, MNIST using on of at setting loss constant, default training Caffe MNIST the of Caffe while 5,000 rate to (expected), proceeds loss training iterations the training as declines the training during dataset, CIFAR-10 setting default progresses on CIFAR-10 Caffe process the Using training experiment. shows the this 5 as in very Caffe Figure a for phrase. in loss testing training resulting at converge, such also CIFAR-10 not accuracy on does low train training to the fails that it its setting, dataset-dependent uses MNIST Caffe its when own CIFAR-10, For to Thus, sensitive setting. default Caffe’s very that noting is own worth its similar performance is or It very default setting. MNIST on default shows own CIFAR-10 test its Torch and either but using train accuracy dataset, to CIFAR-10 settings MNIST the default own their nsmay hstost feprmnsso that show experiments of sets two this summary, In

Training Time(s) Training Time(s) 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 i.4 xeietlRslso IA-0(aae-eedn eal etnso GPU) on Settings Default (Dataset-dependent CIFAR-10 on Results Experimental 4: Fig. TensorFlow TensorFlow (a) (a) MNIST 151.67 GPU) on Settings Default (Dataset-dependent MNIST on Results Experimental 3: Fig. MNIST 68.51 riigTime Training Time Training CIFAR-10 CIFAR-10 12477.05 14273.59 Caffe Caffe MNIST 115.30 MNIST 97.02

CIFAR-10 163.51 CIFAR-10 164.68 Torch MNIST 638.00 Torch MNIST 563.28

CIFAR-10 722.15 CIFAR-10 2978.52

Testing Time (s) Testing Time (s) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 TensorFlow TensorFlow MNIST 1.32 MNIST 0.26 (b) (b) etn Time Testing etn Time Testing CIFAR-10 2.34 CIFAR-10 0.60 7

Caffe MNIST 0.55 Caffe MNIST 0.64 etnso NS,adtefis e ftrebr in bars three of set first the of and time MNIST, testing on the shows settings default 6b framework-dependent three Figure first the in using the Torch TensorFlow bars Similarly using three bar). (iii) of (orange set and setting Caffe bar) default using (red MNIST (ii) setting bar), default (blue its three MNIST setting using of (i) default settings MNIST set to different TensorFlow own three first with using MNIST the of on time 6a, train training Figure the compares In bars results. the shows train to framework dataset. frameworks one same other the by for on effectively dataset is work a that not on setting may default training the for that optimized show results The dataset. efis odc xeiet o NS.Fgr 6 Figure MNIST. for experiments conduct first We

i.5: Fig. CIFAR-10 1.36 CIFAR-10 1.47 IA-0wt t NS n IA-0default CIFAR-10 and MNIST its with CIFAR-10 Torch

Training Loss Torch MNIST 1.76

87.5 88.5 89.5 MNIST 87 88 89 90

0 1 3.47 0020 0040 0060 0080 0010000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 riigLs cnegne aeo af on Caffe of rate (convergence) Loss Training CIFAR-10 CIFAR-10 3.66 3.70

TrainingLossofCaffeonCIFAR-10 Accuracy (%) 100

20 40 60 80 Accuracy (%) 0 101 91 93 95 97 99 Dataset DefaultSettings TensorFlow Training Iteration

MNIST TensorFlow 69.76 MNIST (c) (c) 99.22 CIFAR-10 Accuracy Accuracy 87.00 CIFAR-10 99.31

Caffe MNIST 11.03

Caffe MNIST

CIFAR-10-Settings MNIST-Settings 99.13 CIFAR-10 CIFAR-10 91.79 75.52

Torch MNIST 66.40 Torch MNIST 99.18 CIFAR-10 65.61 CIFAR-10 99.17 sdb af n oc otano IA-0dataset. CIFAR-10 on train to when Torch time and training Caffe long by very used with poorly the default works CIFAR-10 setting number own is TensorFlow’s least three and epochs. setting structures training NN highlight default training of simple CIFAR-10 uses the We it own as that shortest 7. Caffe’s shows of Figure 7a time Figure in First, results observations. show of time. and setting both default training considering the dataset the accuracy. that MNIST and say twice time for to training/testing than optimal fair is higher more still Caffe a is of has it cost Caffe Thus, but the MNIST, setting, at default on MNIST TensorFlow using test when settings accuracy and highest default MNIST train the own to their simple. achieved using relatively Torch when is the is accuracy and structure that setting TensorFlow NN MNIST is its Second, Caffe reason and in smallest, One epochs the MNIST. training on of time We number time training 6. shorter testing second have Figure and frameworks the in three respectively, in all default MNIST bars setting, Caffe shown using 3 First, as observations. of two make Torch, set and third are and Caffe measurements the for same using The done TensorFlow settings. of framework accuracy three the shows 6c Figure et ecnuttesm xeiet o CIFAR-10 for experiments same the conduct we Next,

Training Time (s) Training Time (s) 1000000 100 200 300 400 500 600 100000 10000 0 1000 100 10 i.7 xeietlRslso IA-0(rmwr-eedn eal etnso GPU) on Settings Default (Framework-dependent CIFAR-10 on Results Experimental 7: Fig. 1 TensorFlow

i.6 xeietlRslso NS FaeokdpnetDfutStig nGPU) on Settings Default (Framework-dependent MNIST on Results Experimental 6: Fig. TF-MNIST 68.51 TensorFlow (a) (a) TF-CIFAR-10 12477.05 Caffe-MNIST 21.32

riigTime Training Caffe-CIFAR-10 32.98 Time Training Torch-MNIST 176.23 Torch-CIFAR-10 2100.61 TF-MNIST 206.66 TF-CIFAR-10 Caffe Caffe 33908.43 Caffe-MNIST 97.02 Caffe-CIFAR-10 163.51 Torch-MNIST 235.57 Torch-CIFAR-10 682.58 TF-CIFAR-10 TF-MNIST 321.63 Torch Torch 126304.27 Caffe-CIFAR-10 396.86 Caffe-MNIST 187.54 563.28 Torch-CIFAR-10 722.15 Torch-MNIST

Testing Time (s)

0 1 2 3 4 5 6 7 8 Testing Time (s) 0.2 0.4 0.6 0.8 1.2 1.4 1.6 1.8 0 1 2

TensorFlow TF-CIFAR-10 2.34 TensorFlow TF-MNIST 0.26 (b) (b) Caffe-CIFAR-10 1.40 Caffe-MNIST 0.12 Torch-CIFAR-10 etn Time Testing etn Time Testing 7.10 Torch-MNIST 0.13 TF-CIFAR-10 0.91 8

Caffe TF-MNIST 0.71 Caffe h xeietlrslswt framework-dependent with results settings. default present VIIc experimental Table dataset-dependent and the VIc of Table while performance settings, default the Table compared and VIb three Table VIIb these baseline. the of as CIFAR-10 settings serving frameworks, default and the with MNIST show results VIIa on Table experimental and far VIa Table so respectively. datasets reported results the 5. Figure those training to and the similar III.C is as Section reason in again The explained model converge. not DNN Caffe did a CIFAR-10, process own generate on TensorFlow’s to train using failed to setting is default Caffe than CIFAR-10 when training setting Finally, on cost time. huge a default at default, default CIFAR-10 CIFAR-10 own Torch’s CIFAR-10 TensorFlow’s own accuracy using higher much their achieved Torch However, offer using settings. Caffe of accuracy and TensorFlow complexity higher Third, of structures. levels framework-specific NN the their different time. to and testing attributed CIFAR-10, and be implementation training on can longer reason TensorFlow not much The by observe took does used we it setting because when 7b, default well Figure CIFAR-10 work and own Torch’s 7a that Figure from Second, Caffe-CIFAR-10 1.36 Caffe-MNIST 0.55 al IadTbeVIpoieasmayof summary a provide VII Table and VI Table Torch-CIFAR-10 0.58 Torch-MNIST 0.76

TF-CIFAR-10 4.18 TF-MNIST 1.53 Torch Torch Caffe-CIFAR-10 4.11 Caffe-MNIST 1.37

Torch-CIFAR-10 3.66 Torch-MNIST 1.76

Accuracy (%) Accuracy (%) 100 101 100 94 95 96 97 98 99 10 20 30 40 50 60 70 80 90 0 TensorFlow TensorFlow TF-CIFAR-10 TF-MNIST 99.22 87.00 (c) (c) Caffe-CIFAR-10 55.96 Caffe-MNIST 98.51 Torch-MNIST 99.10 Accuracy Accuracy Torch-CIFAR-10 55.04

TF-CIFAR-10 10.10 TF-MNIST 99.94 Caffe Caffe Caffe-CIFAR-10 75.52 Caffe-MNIST 99.13

Torch-CIFAR-10 59.27 Torch-MNIST 94.14

TF-CIFAR-10 73.74 TF-MNIST 99.11 Torch Torch Caffe-CIFAR-10 31.47 Caffe-MNIST 98.78

Torch-CIFAR-10 65.61 Torch-MNIST 99.18 Framework Default Training Testing Accuracy (GPU) Settings Time (s) Time (s) (%) Training Testing Accuracy Framework Default Training Testing Accuracy Framework TF TF MNIST 68.51 0.26 99.22 Time (s) Time (s) (%) (GPU) Settings Time (s) Time (s) (%) TF Caffe MNIST 21.32 0.12 98.51 TF-CPU 1114.34 2.73 99.28 TF TF MNIST 68.51 0.26 99.22 TF Torch MNIST 176.23 0.13 99.10 Caffe-CPU 512.18 3.33 99.03 TF TF CIFAR-10 14273.59 0.60 99.31 Caffe TF MNIST 206.66 0.71 99.94 Torch-CPU 16096.62 56.62 99.20 Caffe Caffe MNIST 97.02 0.55 99.13 Caffe Caffe MNIST 97.02 0.55 99.13 TF-GPU 68.51 0.26 99.22 Caffe Caffe CIFAR-10 164.68 1.47 91.79 Caffe Torch MNIST 235.57 0.76 94.14 Caffe-GPU 97.02 0.55 99.13 Torch Torch MNIST 563.28 1.76 99.18 Torch TF MNIST 321.63 1.53 99.11 Torch-GPU 563.28 1.76 99.18 Torch Torch CIFAR-10 2978.52 3.70 99.17 Torch Caffe MNIST 187.54 1.37 98.78 Torch Torch MNIST 563.28 1.76 99.18 (a) Baseline Default Comparison (b) Dataset-dependent Default Comparison (c) Framework Default Comparison TABLE VI: Configurations for Training MNIST using TensorFlow (TF), Caffe and Torch

Framework Default Training Testing Accuracy (GPU) Settings Time (s) Time (s) (%) Training Testing Accuracy Framework Default Training Testing Accuracy Framework TF TF CIFAR-10 12477.05 2.34 87.00 Time (s) Time (s) (%) (GPU) Settings Time (s) Time (s) (%) TF Caffe CIFAR-10 32.98 1.40 55.96 TF-CPU 219169.14 4.80 86.90 TF TF MNIST 151.67 1.32 69.76 TF Torch CIFAR-10 2100.61 7.10 55.04 Caffe-CPU 1730.89 14.35 75.39 TF TF CIFAR-10 12477.05 2.34 87.00 Caffe TF CIFAR-10 33908.43 0.91 10.10 Torch-CPU 38268.67 121.11 66.16 Caffe Caffe MNIST 115.30 0.64 11.03 Caffe Caffe CIFAR-10 163.51 1.36 75.52 TF-GPU 12477.05 2.34 87.00 Caffe Caffe CIFAR-10 163.51 1.36 75.52 Caffe Torch CIFAR-10 682.58 0.58 59.27 Caffe-GPU 163.51 1.36 75.52 Torch Torch MNIST 638.00 3.47 66.40 Torch TF CIFAR-10 126304.27 4.18 73.74 Torch-GPU 722.15 3.66 65.61 Torch Torch CIFAR-10 722.15 3.66 65.61 Torch Caffe CIFAR-10 396.86 4.11 31.47 Torch Torch CIFAR-10 722.15 3.66 65.61 (a) Baseline Default Comparison (b) Dataset-dependent Default Comparison (c) Framework Default Comparison TABLE VII: Configurations for Training CIFAR-10 using TensorFlow (TF), Caffe and Torch

TensorFlow MNIST Caffe MNIST 1.2 1.2 Success Rate Difference 0.1

SR

0.087

0.975

0.977

0.998 0.992

1.000 0.979 0.991

SR 1.000

0.997

0.988 0.977

0.892 0.989 0.979

0.995 0.986

0.995 0.988 0.985 1 1 0.984 0.08 0.8 0.8 0.06 0.6 0.6 0.04 0.4 0.4

0.02 0.018 0.009

0.2 0.009 0.006

0.2 0.006

0.003

0.002 -0.004 0 -0.007 0 0 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 -0.02

(a) TensorFlow Model under Untargeted (b) Caffe Model under Untargeted Attack (c) Success Rate Difference on Attack Untargeted FGSM Attacks Fig. 8: Experimental Results on Untargeted Attacks

1.2 is set to 0.001 (recall Formula (1) in Section II.C).

0.998

0.995

0.993 0.991

1 0.982 Figure 8a and Figure 8b show the success rates of

0.925

0.924

0.912

0.899

0.893

0.870

0.823 0.802

0.8 0.802 the ten digits for TensorFlow trained DNN model and

0.721

0.721

0.633

0.596 0.584

0.582 Caffe trained DNN model respectively. It is observed that

0.6

0.533

0.482 0.441

0.421 some digits tend to be crafted more easily into specific

0.4 0.377 0.327

0.271 classes than to other classes. For instance, consider digit

0.2

0.119 0.113

0.070 5, for TensorFlow trained MNIST model, the attack can

0.049

0.046

0.025

0.022

0.018

0.014

0

0 0 0 0 successfully change its class to digit 3 with highest 0 1 2 3 4 5 6 7 8 9 TF (TF) TF (Caffe) Caffe (TF) Caffe (Caffe) probability, followed by digit 8, then digit 2 and with small probability to digit 9 and so forth. Similarly, for Fig. 9: Success Rate of Crafting digit 1 Caffe trained MNIST model, the FGSM attack has the non-zero probability to misclassify digit 5 to all other 9 E. The Impact of Adversarial Behaviors classes with different probabilities, but the top 4 highest In this section, we generate adversarial examples in are the same as TensorFlow: 3, 8, 2, 9. More in-depth Tensorflow and Caffe on MNIST with their default analysis can be found in [28]. settings and compare the effective of these adversarial examples in terms of the success rate. We first use Framework TF TF Caffe Caffe untargeted FGSM to launch adversarial attack on neural parameter TF Caffe TF Caffe network (NN) models trained by TensorFlow and Caffe average time 113min 92min 187min 134min respectively, measure and compare the attack success TABLE VIII: Average Crafting Time of Targeted Attacks rates of these models. The parameter  in the experiment on MNIST

9 Framework third-layer Regularization 0 2 3 4 5 6 7 8 9 TF (TF) 3136 → 1024 drop out 0.014 0.802 0.596 0.421 0.022 0.070 0.633 0.991 0.271 TF (Caffe) 800 → 500 drop out 0.018 0.721 0.482 0.377 0.025 0.113 0.582 0.823 0.119 Caffe (TF) 3136 → 1024 weight decay 0.584 0.893 0.802 0.721 0.046 0.533 0.912 0.925 0.327 Caffe (Caffe) 800 → 500 weight decay 0.924 0.995 0.995 0.993 0.049 0.870 0.982 0.998 0.441 TABLE IX: Impact of Default Feature Maps/Regularization Methods on MNIST

Figure 8c compares the average success rate of ten We observe that larger number of feature maps, in digits by the difference of subtracting the success rate most cases, would introduce higher robustness regardless of TensorFlow trained MNIST model (Figure 8a) from of the frameworks. Also, TensorFlow trained model the success rate of Caffe trained MNIST model (Figure demonstrates higher robustness than that of Caffe. One 8b). We make an interesting observation: in general, possible reason is that the dropout in TensorFlow is the success rate of generating adversarial examples in slightly weaker regularization than the weight decay in TensorFlows tained DNN model is lower than that of Caffe. Such difference may affect the inductive bias of Caffe. The high success rate for digit 2 shows that digits algorithms using one regularizer or the other. Further 2 is the most possible class to which an untargeted study on this subject refers to [28], [27]. adversarial example crafted by FGSM method would fall IV. RELATED WORKAND CONCLUSION in. This paper rethinks the problems of benchmarking We then study the impact of Jacobian-based targeted deep learning software frameworks. attack on the deep learning models trained on MNIST Several DL benchmark efforts have been put by TensorFlow and Caffe respectively. To understand forward [8], [1], [7], [9], [3], [6], [2], [4], [5], whether the size of the feature maps may impact on the each studied a small subset of the popular DL attack success rate, for these two sets of experiments, frameworks. However, these proposals do not examine we reduced the feature maps of both at the third layer model specific parameters about both neural network by the same percentage: for TensorFlow, the feature structure and hyperparameters of DL frameworks and maps are reduced from 3136 to 1024 and for Caffe, their interactions with system runtime performance the feature maps are condensed from 800 to 500. We parameters. compare the two different sizes of feature maps for We presented a comparative study of TensorFlow, both models. It is observed that the smaller number Caffe and Torch with respect to training and testing of feature maps tends to result in a much faster rate time, learning and prediction accuracy, as well as of crafting adversarial examples, no matter the network model robustness against adversarial examples [29], model is trained by TensorFlow or Caffe. The result is [30], [18], [31], [32], [33], [17]. We highlight three shown in Table VIII. First, TensorFlow MNIST model observations from our in-depth experiments: (1) These is much faster than Caffe MNIST model when all the deep learning software frameworks are optimized with parameters are the same (either using TF parameter: their default configurations settings. However, the default 113mins v.s. 187mins or Caffe parameter: 92mins v.s. configuration optimized on one specific dataset may 134mins). Second, the smaller amount of feature maps not work effectively for other datasets. (2) The default accelerates the training process even more. Caffe MNIST configuration optimized for one framework to train model with 500 feature maps could generate adversarial on a dataset may not work well when used by examples two times as much as TensorFlow with 1024 another DL framework to train on the same dataset. feature maps in the same amount of time. For MNIST (3) Different DL frameworks exhibit different levels dataset, the DNN model trained by TensorFlow is to of robustness against adversarial examples. Our study some extend more robust against both types of attacks demonstrates that benchmarking deep learning software than the DNN model trained by Caffe. Figure 9 shows frameworks is significantly more challenging than the success rate of crafting digit 1 to other nine classes. traditional performance-driven benchmarks. Table IX shows the success rate result for digit 1 with default regulation methods under different amount of feature maps. The notation TF(TF) means that the TensorFlow framework uses the TensorFlow default parameters, and TF(Caffe) means that the TensorFlow framework uses the Caffe default parameter setting.

10 REFERENCES [15] L. Developers, “The programming language lua,” https://www. lua.org/, 2018, [Online; accessed 15-Jan-2018]. [1] S. Bahrampour, N. Ramakrishnan, L. Schott, and M. Shah, [17] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining “Comparative study of caffe, neon, , and torch for deep and harnessing adversarial examples,” arXiv preprint learning,” CoRR, vol. abs/1511.06435, 2015. [Online]. Available: arXiv:1412.6572, 2014. http://arxiv.org/abs/1511.06435 [18] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, [2] S. Shams, R. Platania, K. Lee, and S. J. Park, “Evaluation of deep and A. Swami, “The limitations of deep learning in adversarial learning frameworks over different hpc architectures,” in 2017 settings,” in 2016 IEEE European Symposium on Security and IEEE 37th International Conference on Distributed Computing Privacy (EuroS&P). IEEE, 2016, pp. 372–387. Systems (ICDCS), June 2017, pp. 1389–1396. [19] X. W. Chen and X. Lin, “Big data deep learning: Challenges and [3] P. Xu, S. Shi, and X. Chu, “Performance Evaluation of Deep perspectives,” IEEE Access, vol. 2, pp. 514–525, 2014. ArXiv e-prints Learning Tools in Docker Containers,” , Nov. 2017. [20] O. Y. Al-Jarrah, P. D. Yoo, S. Muhaidat, G. K. Karagiannidis, [4] C. Coleman, D. Narayanan, D. Kang, T. Zhao, J. Zhang, L. Nardi, and K. Taha, “Efficient machine learning for big data: P. Bailis, K. Olukotun, C. Re,´ and M. Zaharia, “Dawnbench: An A review,” Big Data Research, vol. 2, no. 3, pp. 87 NIPS end-to-end deep learning benchmark and competition,” in – 93, 2015, big Data, Analytics, and High-Performance ML Systems Workshop, 2017 . Computing. [Online]. Available: http://www.sciencedirect.com/ [5] B. Research, “Benchmarking deep learning operations science/article/pii/S2214579615000271 on different hardware,” https://github.com/baidu-research/ [21] T. Condie, P. Mineiro, N. Polyzotis, and M. Weimer, “Machine DeepBench, 2017, [Online; accessed 28-Jan-2018]. learning on big data,” in 2013 IEEE 29th International [6] H. Kim, H. Nam, W. Jung, and J. Lee, “Performance analysis Conference on Data Engineering (ICDE), April 2013, pp. of cnn frameworks for gpus,” in 2017 IEEE International 1242–1244. Symposium on Performance Analysis of Systems and Software [22] D. C. Cirean, U. Meier, L. M. Gambardella, and J. Schmidhuber, (ISPASS), April 2017, pp. 55–64. “Deep, big, simple neural nets for handwritten digit recognition,” [7] S. Shi, Q. Wang, P. Xu, and X. Chu, “Benchmarking Neural Computation, vol. 22, no. 12, pp. 3207–3220, 2010, state-of-the-art deep learning software tools,” CoRR, vol. pMID: 20858131. [Online]. Available: https://doi.org/10.1162/ abs/1608.07249, 2016. [Online]. Available: http://arxiv.org/abs/ NECO a 00052 1608.07249 [23] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based [8] S. Kovalev, “Performance of deep learning frameworks: learning applied to document recognition,” Proceedings of the Caffe, , tensorflow, theano, and torch,” IEEE, vol. 86, no. 11, pp. 2278–2324, Nov 1998. https://www.altoros.com/performance-deep-learning- [24] A. Krizhevsky and G. Hinton, “Learning multiple layers of frameworks-caffe-deeplearning4j-tensorflow-theano-torch.html, features from tiny images,” 2009. 2016, [Online; accessed 04-Dec-2017]. [9] S. Shi and X. Chu, “Performance Modeling and Evaluation [25] T. Developers, “Convolutional layers-spatialconvolutionmap,” of Distributed Deep Learning Frameworks on GPUs,” ArXiv https://nn.readthedocs.io/en/rtd/convolution/index.html#nn. e-prints, Nov. 2017. SpatialConvolutionMap, 2018, [Online; accessed 15-Jan-2018]. [10] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, [26] T. Developters, “SpatialConvolutionMM Source Code,” https: C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, //.com/torch/nn/blob/master/SpatialConvolutionMM.lua, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, 2018, [Online; accessed 15-Jan-2018]. Y. Jia, R. Jozefowicz,´ L. Kaiser, M. Kudlur, J. Levenberg, [27] Y. Wu, L. Liu, C. Pu, and W. Wei, “GIT DLBench: A D. Mane,´ R. Monga, S. Moore, D. G. Murray, C. Olah, Benchmark Suite for Deep Learning Frameworks: Characterizing M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, Performance, Accuracy and Adversarial Robustness,” Georgia P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viegas,´ Institute of Technology, School of Computer Science, Tech. Rep., O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, 02 2018. and X. Zheng, “Tensorflow: Large-scale machine learning on [28] W. Wei, L. Liu, S. Truex, L. Yu, E. Gursoy, and Y. Wu, heterogeneous distributed systems,” CoRR, vol. abs/1603.04467, “Demystifying Adversarial Behaviors in Deep Learning,” 2016. [Online]. Available: http://arxiv.org/abs/1603.04467 Georgia Institute of Technology, School of Computer Science, [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, Tech. Rep., 02 2018. R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: [29] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, Convolutional architecture for fast feature embedding,” in I. Goodfellow, and R. Fergus, “Intriguing properties of neural Proceedings of the 22Nd ACM International Conference networks,” arXiv preprint arXiv:1312.6199, 2013. on Multimedia, ser. MM ’14. New York, NY, [30] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks USA: ACM, 2014, pp. 675–678. [Online]. Available: are easily fooled: High confidence predictions for unrecognizable http://doi.acm.org/10.1145/2647868.2654889 images,” in Proceedings of the IEEE Conference on Computer [12] R. Collobert and K. Kavukcuoglu, “Torch7: A -like Vision and Pattern Recognition, 2015, pp. 427–436. environment for machine learning,” in In BigLearn, NIPS [31] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, Workshop, 2011. “Detecting adversarial samples from artifacts,” arXiv preprint [13] L. Bottou, “Stochastic gradient descent tricks,” in Neural arXiv:1703.00410, 2017. networks: Tricks of the trade. Springer, 2012, pp. 421–436. [32] N. Carlini and D. Wagner, “Towards evaluating the robustness [14] D. P. Kingma and J. Ba, “Adam: A method for stochastic of neural networks,” in 2017 IEEE Symposium on Security and optimization,” CoRR, vol. abs/1412.6980, 2014. [Online]. Privacy (SP). IEEE, 2017, pp. 39–57. Available: http://arxiv.org/abs/1412.6980 [33] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, [16] Y. Jia, “Caffe model zoo,” http://caffe.berkeleyvision.org/model “Towards deep learning models resistant to adversarial attacks,” zoo.html, 2017, [Online; accessed 04-Dec-2017]. arXiv preprint arXiv:1706.06083, 2017.

11