<<

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2018

Benchmarking TensorFlow on a personal computer not specialised for

SEBASTIAN GEDIN

JAKOB HOLM

KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Benchmarking TensorFlow on a personal computer not specialised for machine learning

SEBASTIAN GEDIN AND JAKOB HOLM

Bachelor of Computer Science Date: June 1, 2018 Supervisor: Stefano Markidis Examiner: Örjan Ekeberg Swedish title: Prestandatestning av TensorFlow på en dator ej specialiserad för maskininlärning School of Electrical Engineering and Computer Science (EECS)

iii

Abstract

Many recent advancement of modern technologies can be attributed to the rapid growth of the machine learning field and especially . A big challenge for deep learning is that the learning process can be very time-consuming. TensorFlow is a framework which allows developers to make use of GPUs and other processing units in order to tackle this and other tasks involved in machine learning. In this study we benchmark and investigate the performance of TensorFlow in terms of images per second on a personal computer not specialised for machine learning. We investigate how the performance when training different convolutional neural networks is affected by batch size and available GPU memory capacity. We also profile the execution. Our results suggest that increasing the memory capacity of the GPU can be beneficial both for being able to train network using larger batch sizes but also for improving performance in some situations. We also conclude that improving the GPU, rather than the CPU, has greater potential for improving performance, but to what extent differs significantly between network models. Furthermore, using a power of two for batch size appears to be advantageous for performance. iv

Sammanfattning

Framsteg inom maskininlärning, och särskilt djupinlärning, har gett upphov till många tekniska framsteg under de senaste åren. Ett pro- blem vid djupinlärning är dock att träningsfasen kan vara väldigt tidskrävande. TensorFlow är ett ramverk utvecklat för att lösa just detta problem, genom att göra det enkelt för utvecklare att nyttja grafikkort och andra slags processorer för att utföra beräkningar som används inom maskininlärning. I den här studien prestandatestar vi TensorFlow genom att hitta det maximala antalet bilder per sekund vid träning av olika faltnings- nätverk på en dator ej byggd för maskininlärning. Vi undersöker hur prestandan påverkas av batchstorlek och mängden tillgängligt grafik- minne. Utöver detta profilerar vi även exekveringen av prestandates- terna. Våra resultat pekar på att större mängd grafikminne både möjliggör större batchstorlek vid träning men även leder till högre prestanda i vissa situationer. Vi drar även slutsatsen att förbättringar av grafikkortet har större potential än av processorn för att öka prestanda, men till vilken grad varierar avsevärt mellan de olika faltningsnätverken. Vidare tycks det vara fördelaktigt prestandamässigt att använda en tvåpotens som batchstorlek. Contents

1 Introduction 1 1.1 Research Question ...... 2

2 Background 3 2.1 TensorFlow ...... 3 2.1.1 Dataflow graphs ...... 3 2.1.2 Interface ...... 4 2.1.3 Execution ...... 5 2.2 Processing units for deep learning ...... 5

3 Methods 7 3.1 Environment ...... 7 3.2 Benchmarking Script ...... 8 3.3 Investigation ...... 8 3.3.1 Batch size ...... 9 3.3.2 Benchmark ...... 9 3.3.3 GPU Memory capacity ...... 9 3.3.4 Profiling ...... 10

4 Results 11 4.1 Batch size ...... 11 4.2 Benchmarks ...... 13 4.3 GPU memory capacity ...... 13 4.4 Profiling ...... 15

5 Discussion 21 5.1 GPU vs. CPU ...... 21 5.2 GPU Memory Capacity ...... 21 5.3 Limitations and future work ...... 22

v vi CONTENTS

6 Conclusions 24

Bibliography 25 Chapter 1

Introduction

Recent advancement of many modern technologies, such as the im- proved camera in mobile phones, smart personal digital assistants and especially autonomous vehicles, can be attributed to the rapid growth of the machine learning field [1]. Within the field of machine learning exists several approaches, one of which is called deep learning, and in recent years it has been proven to be highly effective [5]. A common application of deep learning is image recognition using deep neural networks; one may for example use deep neural networks to classify pictures of suspected skin cancer as either malignant or benign [4]. However, the neural network cannot perform this task without initial training, which means processing large amount of already classified images. One of the biggest challenges in deep learning is that this training process can be very time-consuming. In 2012, Krizhevsky et al. published a seminal paper on image classification using deep learning, in which they showed that GPUs can be used to significantly speed up training [9]. Following that discovery many different frameworks have been developed to allow developers to easily utilise the computational power of one or more GPUs in their machine learning programs. One of these frameworks that has become very popular in the last couple of years is TensorFlow. Based on GitHub and Stack Overflow activity, as well as searches, TensorFlow is dominating the market for deep learning libraries[2]. TensorFlow operates on large-scale data sets in heterogeneous en- vironments [1]. Machine learning models are represented as dataflow graphs in which nodes describe operations and edges between nodes represent the data flowing between operations. Scalability has been

1 2 CHAPTER 1. INTRODUCTION

taken in to great consideration when designing TensorFlow, meaning that, as the amount of CPUs and GPUs in computers rises, the perfor- mance of TensorFlow will increase. In order to achieve that, TensorFlow has algorithms that assigns and distributes the work on machines that contain multiple CPUs and GPUs [5].

1.1 Research Question

Being an open-source software, it is likely that TensorFlow will be used by many students and hobby programmers. It is therefore of relevance to investigate how TensorFlow performs when being run on computers that such users may have access to: personal computers that are not spe- cialised for the workload involved in machine learning computations. The purpose of this study is to provide benchmarks for TensorFlow performance on such a computer and to identify limiting factors on performance, knowledge of which is crucial for efficiently upgrading hardware and software to achieve better performance. This study there- fore aims to answer the following question: How does TensorFlow perform on a personal computer not specialised for machine learning and what are the limiting factors? Chapter 2

Background

2.1 TensorFlow

In this section we discuss the principles of TensorFlow. We describe what dataflow graphs are and how they are used in TensorFlow to model computations involved in machine learning algorithms. We then continue to describe the interface provided by TensorFlow to allow users to create such graphs. Finally, we explain how dataflow graphs are mapped to the available processing units for execution.

2.1.1 Dataflow graphs A dataflow or computational graph is a directed graph in which nodes represent units of calculation and edges represent data being consumed and produced by those units. A simple example of this is if z is to be the result of adding x and y, a dataflow graph like the one in Figure 2.1 can be used to represent this computation. Note that, as in this example, a variable or constant is considered to be an unit of calculation and is therefore represented by a node in the graph. This is one of the strengths of a dataflow graph: what a node represents can be very general. Therefore, a node can represent many different things, such as: constants, variables, mathematical operations, functions or I/O operations – basically anything with zero or more inputs which produces zero or more outputs. In this sense, a constant is a unit of calculation with zero inputs which always produces one output corresponding to the constant it represents.

3 4 CHAPTER 2. BACKGROUND

Figure 2.1: Example of a dataflow graph representing the addition of two input variables x and y with the result z.

2.1.2 Interface The following paragraphs will outline how dataflow graphs are used in TensorFlow to model computations involved in machine learning algorithms and how a user can construct such graphs.

Operations In TensorFlow units of calculation are referred to as op- erations. Each operation is associated with an implementation, which is referred to as a kernel. The kernel specifies how to execute the oper- ation and on which kind of hardware device (e.g., CPU or GPU) the implementation can be executed.

Tensors The data flowing along the edges of the graph is modelled as tensors. Tensors are n-dimensional arrays with homogeneous ele- ments of primitive types (such as int32, float32 or string). The number of dimensions of the array representing the tensor is the rank of the tensor and a tensor’s shape describes the number of elements in each dimension. For example, constants, scalars and matrices are represented as tensors of rank 0, 1 and 2 respectively and a 3 × 4 matrix has the shape [3,4]. Operations can be seen as calculations that may consume and/or produce tensors.

Sessions A session is used to execute the computations defined in graphs. The purpose of a session is to provide a connection between the client program and the C++ runtime. It allocates the physical resources used for the computations and holds any intermediate values during execution. The graph, consisting of operations and tensors, defines the computations while the session is used to execute them. CHAPTER 2. BACKGROUND 5

Variables Variables are used to represent a shared, persistent state that can be manipulated by the program. They can be accessed and modified by multiple sessions simultaneously. Like for other Tensor- Flow operations the interface makes it possible to place variables on particular devices, which gives the user ample control over memory management.

2.1.3 Execution The strength of using dataflow graphs for modelling machine learn- ing algorithms comes from how they facilitate parallel and scalable execution. If the inputs of a node are available the operation for that node can be executed independently from the rest of the graph, which allows for parallel execution of such operations. The execution is also scalable since the dataflow graph does not depend on how operations are performed; as long as the input is transformed to the correct output it does not matter how the operation was implemented and on what device it was executed. Operations can even be distributed among several processing units of different types. In TensorFlow execution is divided among the client, the master, a set of workers and the available devices. The client is the program written by the user that defines the computational graph. It requests evaluation of the graph by creating a session which is sent to the master. The master is responsible for partitioning the graph and delegate the partitions to the workers. Each worker is in turn responsible for one or more devices, which are the available physical processing units. Workers use the kernel of the operations to determine how and on what device they can be executed. A device is often a CPU or a GPU but TensorFlow also allows the user to register additional types of execution units, such as a Tensor processing unit (TPU), which is a special processing unit developed by Google that is optimised for tensor computations [13].

2.2 Processing units for deep learning

Most computers have two different kinds of processing units: a (CPU) and a (GPU). The CPU consists of one or a few cores and is optimised for serial processing, while a GPU have thousands of small cores and is designed to effi- ciently handle many small tasks in parallel. Traditionally the GPU, as 6 CHAPTER 2. BACKGROUND

suggested by its name, is used to manipulate computer graphics, but the emergence of deep learning has given rise to a new type of work for GPUs. As explained above, the execution of a dataflow graph involves many independent calculations that can be run in parallel. Thus, GPUs have much to offer when it comes to speeding up the executions of such graphs. In 2012 Alex Krizhevsky and his team trained AlexNet in five to six days on two consumer grade gaming GPUs [9]. Today the same task can be done in 18 minutes [12]. This improvement can in large be attributed to the development of processing units specialised for machine learning computations. Gaming GPUs are built with rendering graphics in mind, but the aforementioned 18 minutes was not done on gaming GPUs, instead it was done on a DGX-2 system, which is built with AI and deep learning in mind. As an alternative to a GPU, Google has developed the TPU, which is a processing units specialised for working with neural networks. One way in which the TPU is optimised for neural network calculations is called quantization. It allows neural network calculations to be done with 8-bit integers – rather than 32-bit or 16-bit floating points used by most CPUs or GPUs – while still maintaining acceptable accuracy. Google claims that by using quantization the memory needed for the image recognition model Inception is reduced to one-fourth the original size [13]. Chapter 3

Methods

3.1 Environment

All tests are conducted on a system with the specifications as listed below. In order to get reliable measurements we also decided run Ten- sorFlow through SSH, thus avoiding any strain on the GPU and CPU from processes that otherwise would be operational when operating the computer using the graphical user interface.

• CPU: Intel® Core™ i5-2500K @ 3.30 GHz

• Installed memory (RAM): 8 GB

• GPU: 1 x NVIDIA® GeForce® GTX 970

• OS: Ubuntu 16.04.4 LTS

• CUDA / cuDNN: 9.0 / 7.0.5

• TensorFlow version: 1.7.0-dev20180318

• Python version: Python 3.6.1 :: Anaconda 4.4.0 (64-bit)

• Benchmark GitHub hash: 5ed3045

• Disk: Intel® SSDSA2M040G2GC @ 40 GB

7 8 CHAPTER 3. METHODS

3.2 Benchmarking Script

We measured the performance of TensorFlow by using a benchmarking script, which is also used by TensorFlow developers [3]. The script mea- sures the performance, in terms of images processed per second during network training, when using TensorFlow to train various convolu- tional neural networks (CNNs). Unless otherwise stated, ’performance’ in this paper refers to the speed at which images are processed during training. The script allows us to specify the network model to use as well as the batch size for training. For each run of the script 10 warm up steps are done followed by 100 steps for which the performance is measured in terms of images per second. In each step the network is trained with one batch of images. By default, the script uses synthetic data for training, but can be run with real data if supplied by the user. Due to hardware limitations we have conducted our tests on synthetic data. Running the script using actual ImageNet data would have required more disk memory than our environment had available. The script is open-source and available at the TensorFlow GitHub page1.

3.3 Investigation

In order to identify potential limiting factors when running TensorFlow on our setup we investigated the performance in four different ways. Firstly, we investigated how the performance depends on the batch size, identifying the maximum batch size possible for training with the setup and at which batch size the performance peaks. Secondly, we benchmarked the performance at the batch size where performance was found to peak. Thirdly, we investigated how limiting the available memory on the GPU affects the performance. Finally, we profiled the training performed by the script. All tests were performed for the following network models: InceptionV3 [15], ResNet-50 [6], ResNet- 152 [6], AlexNet [10] and VGG16 [14]. The following sections describe in detail how the four investigations were conducted.

1https://github.com/tensorflow/benchmarks/tree/master/ scripts/tf_cnn_benchmarks CHAPTER 3. METHODS 9

3.3.1 Batch size The batch size is the number of images processed simultaneously in each iteration of training. There is a non-trivial relationship between the batch size and the performance of TensorFlow (note that batch size also affects the performance in terms of accuracy for the trained model, but we are not concerned with that in this study). The batch size to be used in the script can easily be specified via the --batch_size flag. By using this flag we found the maximum batch size for each network on our test environment at which the script is able to execute. We measured performance for each batch size from one up to the maximum batch size for all network models except AlexNet, for which we ran the script on batch sizes starting at two up to the maximum in steps on two in order to save time (training AlexNet allows significantly higher batch sizes compared to the other network models).

3.3.2 Benchmark The script was used to benchmark the performance of the system with the configurations described in Section 3.1. The results from Section 3.3.1 were used to determine the batch size on which each network model had optimal performance on our setup. The purpose of this part of the investigation is to provide best case benchmarks for our setup on various network models that can be used for comparison with other setups. In order to get a reliable result and to investigate potential fluctua- tions in the performance we run the script 10 times per network and take the arithmetic average of the results. We do this by modifying the script to output the time for each test and then calculating the total number of images processed in the 10 runs. The total number of images processed divided by the sum of the times from the 10 runs gives the arithmetic mean of the performance in images/sec. We also calculate the standard error of the mean using the times collected from each of the 10 runs.

3.3.3 GPU Memory capacity We endeavoured to test how GPU memory capacity impacts the perfor- mance of TensorFlow. The script makes such tests possible, without us needing to use many different actual GPUs, through the 10 CHAPTER 3. METHODS

--gpu_memory_frac_for_testing flag. By utilising this flag we could specify the fraction of the total GPU memory that were to be used by the script. We verified that this flag worked as intended by monitoring the memory usage of the GPU while running the script. For each network, we found a batch size equal to a power of two, which did not result in a warning message that stated that memory shortage could be influencing performance (see Section 4.1). Our inten- tion with this was to use a batch size for which memory was abundant when using 100% of the available GPU memory. We then found the lowest fraction of GPU memory for which the script still would execute at that batch size. We then ran the script making various amount of memory available for execution. The granularity of our measurements was altered based on when the available amount of memory seemed to influence performance. For all networks we increased granularity close to the lower limit of available memory needed to run the script because we expected memory capacity to have the greatest effect on performance in that range.

3.3.4 Profiling TensorFlow allows users to profile the execution. Through the script’s --tfprof_file flag we gained access to this data. The script saves the relevant profiling information in a proto-file. The profiler gives information regarding what nodes that are used when training the various network models and how much time and memory is spent on each nodes. It also states how much of the execution time for each node is done on the accelerator (GPU) vs. the CPU. The profiler divides the result into five columns: node, requested , total execution time, accelerator execution time and CPU execution time. All of the five different network models were profiled in order to identify similarities and differences in the workload. Chapter 4

Results

The result section is divided into four parts corresponding to the differ- ent areas of investigation described Section 3.3.

4.1 Batch size

In this section we present the data from our batch size tests. The data presented in Figure 4.1 is from varying the batch size within the limitations of our test environment – see Table 4.1 for the upper limit for each network model. The performance for all of the network models when changing the batch size follow the same trend. It generally increases as the batch sizes increases but peaks on batch sizes equal to a power of two and then drops to some extent close to the batch size limit. After peaking at powers of two the performance

Table 4.1: Batch size (upper) limit for the different network models. Network Batch size

AlexNet 753 InceptionV3 36 ResNet-50 42 ResNet-152 17 VGG16 34

11 12 CHAPTER 4. RESULTS

Figure 4.1: The performance on the benchmark script for different network models with varying batch size.

1200 100

1000 80

800 60 600 40 400

20 200

0 0 0 200 400 600 800 0 10 20 30 40 50

35 60

30 50

25 40 20 30 15

20 10

5 10 0 5 10 15 20 0 5 10 15 20 25 30 35

60

50

40

30

20

10 0 10 20 30 40

is reduced instantly on next next batch size. The reason for the peaks in performance on batch sizes equal to a power of two, in our case 16, 32 and 512, could be due to how memory is organised. Memory architecture in both CPUs and GPUs is usually organised in powers of two. It is interesting to note that the largest batch size does not lead to the best result. On the contrary, the performance drops at the largest batch sizes. How close to the limit this drop occurs and the magnitude of it various among the different network models. This may be due to memory limitations within the GPU, which is suggested by a warning message that reads: "Allocator (G_0_bfc) ran out of memory trying to allocate 1.52GiB. The caller indicates that this is not a failure, but CHAPTER 4. RESULTS 13

may mean that there could be performance gains if more memory were available." The message occurs for larger batch sizes for all network models, but the batch size at which the message first occurs does not correspond to where the performance starts to drop.

4.2 Benchmarks

The average peak performance for each network on our setup is pre- sented in Table 4.2. The relative uncertainty of around 0.03% or less for all of the measurements suggests that the performance has not been significantly affected by any momentary external factors.

Table 4.2: Performance for each network model with optimal batch size (in terms of images/sec). Network Batch size Images/sec AlexNet 512 1119.74 ± 0.32 InceptionV3 32 55.73 ± 0.01 ResNet-50 32 82.24 ± 0.02 ResNet-152 16 31.93 ± 0.01 VGG16 32 52.49 ± 0.01

4.3 GPU memory capacity

The batch sizes used for each network model used in these tests and the lowest possible memory fraction for which the script could be executed is presented in Table 4.3. Figure 4.2 shows how the performance is affected when limiting the amount of available memory. Measurements are taken from 100% of total memory to 10% more than the lower limit as presented in Table 4.3 in steps of 20%, then the step size is reduced to 1% for the final 10 measurements. Limiting the memory capacity seem to affect the performance of training AlexNet significantly, but not at all for any of the other network models. 14 CHAPTER 4. RESULTS

Table 4.3: Batch size and the lowest percentage of GPU memory for which the benchmark script still executes. Network Batch size Memory limit (%) AlexNet 256 40 InceptionV3 8 24 ResNet-50 8 22 ResNet-152 4 30 VGG16 4 43

Figure 4.2: Graphs demonstrating the impact on performance for dif- ferent network models from only having a fraction of GPU-memory available.

1100 80

75 1000 70

900 65

60 800 55

700 50 1 0.9 0.8 0.7 0.6 0.5 0.4 1 0.8 0.6 0.4 0.2

30 30

28 25

26 20 24

15 22

10 20 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1 0.9 0.8 0.7 0.6 0.5 0.4

50

45

40

35

30 1 0.8 0.6 0.4 0.2 CHAPTER 4. RESULTS 15

Table 4.4: GPU proportion of total execution time for each network model. Network GPU execution time AlexNet 98% VGG16 98% ResNet-50 75% InceptionV3 67% ResNet-152 58%

4.4 Profiling

Profiles of each network model are presented in Tables 4.5-4.9. The batch sizes which resulted in the best performance as found in Section 4.1 were used when profiling the execution. Using the profiling information we can deduce that for all network models the bulk of the execution is performed on the GPU rather than the CPU, although the ratio varies between the different network models (see Table 4.4). This suggests that improving the performance of the GPU will be more beneficial than improving it for the CPU. The large difference in the ratios between GPU and CPU work for the different network models (58% for VGG16 versus 98% for Inception3 and AlexNet) can be attributed to differences in architecture depth and computational complexity, but a more in-depth analysis of such differences is out of scope of this investigation. For all network models the three nodes Conv2dBackpropFilter, Conv2DBackpropInput and Conv2D request the most memory and have the highest total and GPU execution time. These nodes are respon- sible for various key computations for training CNNs. 16 CHAPTER 4. RESULTS

Table 4.5: Profile of ResNet50 with batch size 32. Node name Requested bytes Total execution time GPU execution time CPU execution times Conv2DBackprop- 1733.55MB (10.87%) 121.66ms (23.73%) 94.38ms (24.53%) 27.26ms (21.36%) Filter Conv2DBackprop- 5436.45MB (34.08%) 88.99ms (17.36%) 68.77ms (17.87%) 20.21ms (15.83%) Input Conv2D 5692.69MB (35.69%) 68.25ms (13.31%) 64.10ms (16.66%) 4.13ms (3.23%) FusedBatchNorm- 1399.98MB (8.78%) 62.79ms (12.25%) 44.09ms (11.46%) 18.68ms (14.64%) Grad FusedBatchNorm 1395.88MB (8.75%) 30.95ms (6.04%) 28.56ms (7.42%) 2.37ms (1.85%) ReluGrad 0B (0.00%) 30.67ms (5.98%) 25.07ms (6.51%) 5.58ms (4.37%) AddN 0B (0.00%) 36.21ms (7.06%) 17.27ms (4.49%) 18.90ms (14.81%) Relu 0B (0.00%) 17.25ms (3.36%) 16.37ms (4.25%) 864µs (0.68%) Add 0B (0.00%) 14.95ms (2.92%) 14.63ms (3.80%) 309µs (0.24%) MaxPoolGrad 102.76MB (0.64%) 3.08ms (0.60%) 3.02ms (0.79%) 57µs (0.04%) ApplyGradient- 0B (0.00%) 17.22ms (3.36%) 2.37ms (0.61%) 14.83ms (11.62%) Descent Mul 104.08MB (0.65%) 7.53ms (1.47%) 1.82ms (0.47%) 5.68ms (4.45%) L2Loss 142.08KB (0.00%) 5.85ms (1.14%) 1.25ms (0.33%) 4.57ms (3.58%) MaxPool 25.69MB (0.16%) 1.18ms (0.23%) 1.14ms (0.30%) 39µs (0.03%) Transpose 19.27MB (0.12%) 467µs (0.09%) 430µs (0.11%) 37µs (0.03%) Pad 20.31MB (0.13%) 373µs (0.07%) 327µs (0.08%) 46µs (0.04%) Tile 12.85MB (0.08%) 340µs (0.07%) 316µs (0.08%) 23µs (0.02%) AssignSub 0B (0.00%) 2.10ms (0.41%) 224µs (0.06%) 1.85ms (1.45%) Sub 0B (0.00%) 2.25ms (0.44%) 222µs (0.06%) 2.01ms (1.57%) SparseSoftmaxCross- 768B (0.00%) 309µs (0.06%) 219µs (0.06%) 89µs (0.07%) EntropyWithLogits MatMul 8.63MB (0.05%) 304µs (0.06%) 213µs (0.06%) 90µs (0.07%) CHAPTER 4. RESULTS 17

Table 4.6: Profile of AlexNet with batch size 512. Node name Requested bytes Total execution time GPU execution time CPU execution time Conv2DBackprop- 6670.88MB (26.64%) 126.59ms (27.18%) 125.73ms (27.40%) 852µs (12.45%) Filter Conv2D 9062.38MB (36.20%) 100.62ms (21.61%) 99.77ms (21.75%) 847µs (12.38%) Conv2DBackprop- 7464.78MB (29.82%) 78.12ms (16.77%) 77.38ms (16.87%) 734µs (10.73%) Input MatMul 288.98MB (1.15%) 45.50ms (9.77%) 45.09ms (9.83%) 408µs (5.96%) ReluGrad 0B (0.00%) 23.81ms (5.11%) 23.58ms (5.14%) 227µs (3.32%) BiasAdd 0B (0.00%) 16.86ms (3.62%) 16.61ms (3.62%) 244µs (3.57%) Relu 0B (0.00%) 15.10ms (3.24%) 14.91ms (3.25%) 190µs (2.78%) MaxPoolGrad 771.75MB (3.08%) 14.89ms (3.20%) 14.71ms (3.21%) 180µs (2.63%) BiasAddGrad 43.52KB (0.00%) 10.03ms (2.15%) 9.66ms (2.11%) 366µs (5.35%) MaxPool 196.31MB (0.78%) 8.64ms (1.85%) 8.49ms (1.85%) 147µs (2.15%) Transpose 316.59MB (1.26%) 7.09ms (1.52%) 7.05ms (1.54%) 42µs (0.61%) AddN 0B (0.00%) 5.69ms (1.22%) 5.15ms (1.12%) 539µs (7.88%) ApplyGradient- 0B (0.00%) 5.63ms (1.21%) 5.12ms (1.12%) 508µs (7.43%) Descent Mul 265.03MB (1.06%) 4.06ms (0.87%) 3.55ms (0.77%) 506µs (7.40%) L2Loss 35.84KB (0.00%) 2.31ms (0.50%) 1.67ms (0.36%) 640µs (9.36%) SparseSoftmaxCross- 6.14KB (0.00%) 514µs (0.11%) 338µs (0.07%) 175µs (2.56%) EntropyWithLogits RandomUniformInt 2.05KB (0.00%) 105µs (0.02%) 7µs (0.00%) 98µs (1.43%) Sum 256B (0.00%) 44µs (0.01%) 3µs (0.00%) 41µs (0.60%) Select 0B (0.00%) 38µs (0.01%) 3µs (0.00%) 35µs (0.51%) RealDiv 0B (0.00%) 32µs (0.01%) 2µs (0.00%) 30µs (0.44%) Add 0B (0.00%) 34µs (0.01%) 2µs (0.00%) 32µs (0.47%) 18 CHAPTER 4. RESULTS

Table 4.7: Profile of InceptionV3 with batch size 32. Node name Requested bytes Total execution time GPU execution time CPU execution time Conv2DBackprop- 2469.67MB (12.17%) 219.63ms (26.32%) 154.18ms (27.60%) 65.40ms (23.78%) Filter Conv2D 4083.31MB (20.12%) 118.63ms (14.22%) 108.59ms (19.44%) 10.00ms (3.63%) Conv2DBackprop- 8280.24MB (40.79%) 167.32ms (20.06%) 107.92ms (19.32%) 59.36ms (21.58%) Input FusedBatchNorm- 1197.85MB (5.90%) 68.69ms (8.23%) 37.98ms (6.80%) 30.66ms (11.15%) Grad AvgPoolGrad 1498.48MB (7.38%) 26.51ms (3.18%) 26.13ms (4.68%) 376µs (0.14%) ReluGrad 0B (0.00%) 35.06ms (4.20%) 24.57ms (4.40%) 10.45ms (3.80%) FusedBatchNorm 1179.38MB (5.81%) 28.60ms (3.43%) 24.22ms (4.33%) 4.35ms (1.58%) Relu 0B (0.00%) 17.98ms (2.15%) 16.22ms (2.90%) 1.72ms (0.62%) AddN 0B (0.00%) 39.38ms (4.72%) 13.16ms (2.36%) 26.15ms (9.51%) MaxPoolGrad 428.70MB (2.11%) 12.51ms (1.50%) 12.29ms (2.20%) 221µs (0.08%) AvgPool 239.92MB (1.18%) 12.32ms (1.48%) 12.03ms (2.15%) 285µs (0.10%) MaxPool 115.02MB (0.57%) 4.94ms (0.59%) 4.77ms (0.85%) 165µs (0.06%) Slice 355.26MB (1.75%) 11.39ms (1.37%) 4.75ms (0.85%) 6.62ms (2.41%) ConcatV2 316.55MB (1.56%) 5.96ms (0.71%) 4.59ms (0.82%) 1.36ms (0.50%) ApplyGradient- 0B (0.00%) 25.51ms (3.06%) 2.25ms (0.40%) 23.21ms (8.44%) Descent Mul 100.36MB (0.49%) 21.20ms (2.54%) 1.80ms (0.32%) 19.36ms (7.04%) L2Loss 138.24KB (0.00%) 6.66ms (0.80%) 1.50ms (0.27%) 5.12ms (1.86%) Transpose 34.33MB (0.17%) 801µs (0.10%) 770µs (0.14%) 31µs (0.01%) AssignSub 0B (0.00%) 5.48ms (0.66%) 384µs (0.07%) 5.07ms (1.84%) Sub 0B (0.00%) 5.44ms (0.65%) 379µs (0.07%) 5.03ms (1.83%) SparseSoftmaxCross- 768B (0.00%) 304µs (0.04%) 219µs (0.04%) 84µs (0.03%) EntropyWithLogits CHAPTER 4. RESULTS 19

Table 4.8: Profile of ResNet152 with batch size 16. Node name Requested bytes Total execution time GPU execution time CPU execution time Conv2DBackprop- 3808.29MB (16.85%) 207.50ms (24.42%) 127.27ms (26.02%) 80.16ms (22.29%) Filter Conv2DBackprop- 7863.78MB (34.80%) 164.10ms (19.31%) 99.13ms (20.27%) 64.90ms (18.05%) Input Conv2D 7547.40MB (33.40%) 117.42ms (13.82%) 97.45ms (19.92%) 19.91ms (5.54%) FusedBatchNorm- 1509.94MB (6.68%) 69.57ms (8.19%) 36.38ms (7.44%) 33.12ms (9.21%) Grad ReluGrad 0B (0.00%) 39.84ms (4.69%) 27.32ms (5.58%) 12.45ms (3.46%) FusedBatchNorm 1528.18MB (6.76%) 32.75ms (3.85%) 25.12ms (5.14%) 7.55ms (2.10%) AddN 0B (0.00%) 64.55ms (7.60%) 23.26ms (4.76%) 41.19ms (11.46%) Relu 0B (0.00%) 21.40ms (2.52%) 18.01ms (3.68%) 3.32ms (0.92%) Add 0B (0.00%) 18.32ms (2.16%) 17.33ms (3.54%) 971µs (0.27%) ApplyGradient- 0B (0.00%) 39.04ms (4.59%) 5.78ms (1.18%) 33.17ms (9.23%) Descent Mul 242.82MB (1.07%) 25.02ms (2.94%) 4.38ms (0.89%) 20.51ms (5.71%) L2Loss 416.51KB (0.00%) 14.31ms (1.68%) 3.44ms (0.70%) 10.78ms (3.00%) MaxPoolGrad 51.38MB (0.23%) 1.58ms (0.19%) 1.52ms (0.31%) 59µs (0.02%) AssignSub 0B (0.00%) 17.49ms (2.06%) 659µs (0.13%) 16.77ms (4.66%) Sub 0B (0.00%) 14.21ms (1.67%) 651µs (0.13%) 13.50ms (3.76%) MaxPool 12.85MB (0.06%) 619µs (0.07%) 577µs (0.12%) 42µs (0.01%) Transpose 9.63MB (0.04%) 252µs (0.03%) 220µs (0.04%) 31µs (0.01%) MatMul 8.40MB (0.04%) 297µs (0.03%) 197µs (0.04%) 98µs (0.03%) Tile 6.42MB (0.03%) 514µs (0.06%) 164µs (0.03%) 350µs (0.10%) Pad 10.16MB (0.04%) 234µs (0.03%) 163µs (0.03%) 70µs (0.02%) SparseSoftmaxCross- 768B (0.00%) 718µs (0.08%) 119µs (0.02%) 599µs (0.17%) EntropyWithLogits 20 CHAPTER 4. RESULTS

Table 4.9: Profile of VGG16 with batch size 32. Node name Requested bytes Total execution time GPU execution time CPU execution time Conv2DBackprop- 7160.58MB (25.23%) 178.88ms (28.84%) 177.44ms (29.09%) 1.43ms (14.06%) Filter Conv2D 10093.91MB (35.57%) 129.75ms (20.92%) 128.33ms (21.04%) 1.42ms (13.99%) Conv2DBackprop- 9008.68MB (31.75%) 123.57ms (19.92%) 122.27ms (20.04%) 1.29ms (12.65%) Input ReluGrad 0B (0.00%) 39.98ms (6.45%) 39.64ms (6.50%) 336µs (3.30%) BiasAdd 0B (0.00%) 27.69ms (4.46%) 27.29ms (4.47%) 401µs (3.94%) Relu 0B (0.00%) 25.17ms (4.06%) 24.84ms (4.07%) 322µs (3.16%) MaxPoolGrad 834.90MB (2.94%) 19.80ms (3.19%) 19.56ms (3.21%) 233µs (2.29%) BiasAddGrad 53.76KB (0.00%) 16.41ms (2.65%) 15.87ms (2.60%) 537µs (5.27%) MatMul 500.99MB (1.77%) 12.83ms (2.07%) 12.52ms (2.05%) 300µs (2.95%) AddN 0B (0.00%) 12.77ms (2.06%) 12.01ms (1.97%) 752µs (7.39%) ApplyGradient- 0B (0.00%) 12.29ms (1.98%) 11.58ms (1.90%) 712µs (6.99%) Descent Mul 560.80MB (1.98%) 8.74ms (1.41%) 7.92ms (1.30%) 812µs (7.97%) MaxPool 197.39MB (0.70%) 6.66ms (1.07%) 6.45ms (1.06%) 204µs (2.00%) L2Loss 66.56KB (0.00%) 4.77ms (0.77%) 3.67ms (0.60%) 1.09ms (10.68%) Transpose 19.27MB (0.07%) 461µs (0.07%) 424µs (0.07%) 37µs (0.36%) SparseSoftmaxCross- 768B (0.00%) 319µs (0.05%) 216µs (0.04%) 102µs (1.00%) EntropyWithLogits RandomUniformInt 256B (0.00%) 100µs (0.02%) 7µs (0.00%) 92µs (0.90%) Sum 256B (0.00%) 35µs (0.01%) 3µs (0.00%) 32µs (0.31%) Select 0B (0.00%) 32µs (0.01%) 3µs (0.00%) 29µs (0.28%) Add 0B (0.00%) 29µs (0.00%) 3µs (0.00%) 26µs (0.26%) RealDiv 0B (0.00%) 26µs (0.00%) 2µs (0.00%) 24µs (0.24%) Chapter 5

Discussion

5.1 GPU vs. CPU

Based on the profiling data presented in Section 4.4 we conclude that im- provements to the GPU, rather than the CPU, has greater potential for improving overall performance on our setup. But for some networks, especially ResNet-152, the difference in potential for improvement is small. Note that more potential for improvement in this sense does not mean that improvement is easy or even possible, but that by improving the CPU the effect on the overall performance of AlexNet or VGG16 is limited by the 2% of work that is executed on the CPU. Further- more, the distribution of computations between the processing units depends on implementation, especially since TensorFlow gives the user considerable control over processing unit usage.

5.2 GPU Memory Capacity

Our results suggest that GPU memory capacity influences performance in two different ways. Firstly, more memory allows one to use larger batch sizes during training. However, whether or not a large batch size is advantageous for performance is a nontrivial question. A large batch size usually means that the performance will be better in terms of im- ages/sec, but the model will need to process each image in the training set more times to reach the same accuracy compared to when doing on-line learning or using a smaller batch size. Large batch sizes tend to outperform smaller ones overall because the increase the computational

21 22 CHAPTER 5. DISCUSSION

parallelism [11]. Furthermore, research suggests that training with a large batch size reduces the quality of the trained model in terms of generalisation [8][7]. Thus, the ability to use larger batch sizes does not necessarily lead to better performance but we cannot conclude that it is not useful in some situations. Secondly, our results regarding how performance is affected by the batch size suggest that performance drops significantly when the GPU is working close the limit of its memory capacity. Although this can be observed to some extent for all networks in Figure 4.1 our results presented in Section 4.3 only shows this behaviour for AlexNet. These conflicting results suggest that the drop in performance as a result of working close to the memory limit depends on the batch size. Nev- ertheless, we can conclude that in some situations increased memory capacity can significantly improve the performance for training CNNs in TensorFlow. A possible explanation for this is that, even though the memory is enough to perform every computation alone, the limitation of memory usage reduce the possibility for parallel execution. In order to determine specifically what the cause is one must look in greater detail at when during execution the limited memory capacity has a bottleneck effect and why that is the case.

5.3 Limitations and future work

In this section we discuss limitations of our study and what effects they had on the conclusions that can be drawn from this study. We also propose ways in which future work could be conducted in order to give more conclusive answers to our research question. Because we have conducted our study on one system alone we can only draw conclusions about that exact system. However, we can expect that the conclusions will apply, to some extent, to similar systems. By including more different systems that also can be considered personal computers not specialised for machine learning we could have reached conclusions that are more widely applicable. As mentioned in Section 3.2, we did not use real ImageNet data in our tests. However, running the benchmark on synthetic data does not seem to skew the results, TensorFlow’s own benchmarks [3] tested this and their results show little difference between performance on real vs. synthetic data. This suggests that our results from benchmarking using CHAPTER 5. DISCUSSION 23

synthetic data are applicable to a real world scenario. We could also have added an additional dimension to our investiga- tion by changing the compute power of our test environment through ”clocking” various hardware components, such as increasing the clock speed of the CPU and/or the GPU. This could possibly have an impact on performance. A more in-depth analysis of the profiling information would plau- sibly lead to additional conclusions regarding possible performance improvements. We were able to identify that it is the same type of com- putations demanding the most resources for all of the network models that we have investigated. This means that a more in-depth analysis of how the load put on the system by these computations can be mitigated – for example through specialised hardware or better implementation – which could lead to significant performance improvements for all of the network models tested. Chapter 6

Conclusions

We have provided a benchmark for TensorFlow on our system, which can be used for future reference when comparing with other systems. Moreover, the results of this study indicates that in some situations, increased GPU-memory capacity leads to increased performance. Fur- thermore, for the five network models tested, our results suggest that when running TensorFlow with maximum images per second in mind, picking a batch size equal to a power of two is advisable. We also con- clude that for our system the GPU is limiting the performance, rather than the CPU. Considering the amount of execution time spent on the GPU compared with the CPU, our advice to anyone running Tensor- Flow on a system similar to ours would be to prioritise upgrading the GPU rather than the CPU.

24 Bibliography

[1] Martin Abadi et al. “TensorFlow: A System for Large-Scale Ma- chine Learning”. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA: USENIX Association, 2016, pp. 265–283. ISBN: 978-1-931971-33-1. URL: https://www.usenix.org/conference/osdi16/technical- sessions/presentation/abadi. [2] Rachel Allen and Michael Li. Ranking Popular Deep Learning Li- braries for Data Science. Oct. 2017. URL: https://www.kdnuggets. com/2017/10/ranking-popular-deep-learning-libraries- data-science.html. [3] Benchmarks | TensorFlow. https://www.tensorflow.org/ performance/benchmarks. Accessed: 2018-03-23. [4] Andre Esteva et al. “Dermatologist-level classification of skin cancer with deep neural networks”. In: Nature 542.7639 (2017), p. 115. [5] Peter Goldsborough. “A Tour of TensorFlow”. In: CoRR abs/1610.01178 (2016). arXiv: 1610.01178. URL: http://arxiv.org/abs/ 1610.01178. [6] Kaiming He et al. “Deep Residual Learning for Image Recogni- tion”. In: CoRR abs/1512.03385 (2015). arXiv: 1512.03385. URL: http://arxiv.org/abs/1512.03385. [7] Elad Hoffer, Itay Hubara, and Daniel Soudry. “Train longer, gener- alize better: closing the generalization gap in large batch training of neural networks”. In: Advances in Neural Information Processing Systems. 2017, pp. 1729–1739.

25 26 BIBLIOGRAPHY

[8] Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima”. In: CoRR abs/1609.04836 (2016). arXiv: 1609 . 04836. URL: http : / / arxiv.org/abs/1609.04836. [9] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Ima- geNet Classification with Deep Convolutional Neural Networks”. In: Proceedings of the 25th International Conference on Neural In- formation Processing Systems - Volume 1. NIPS’12. Lake Tahoe, Nevada: Curran Associates Inc., 2012, pp. 1097–1105. URL: http: //dl.acm.org/citation.cfm?id=2999134.2999257. [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Ima- geNet Classification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Curran Associates, Inc., 2012, pp. 1097–1105. URL: http://papers.nips.cc/paper/4824-imagenet- classification-with-deep-convolutional-neural- networks.pdf. [11] Dominic Masters and Carlo Luschi. “Revisiting Small Batch Train- ing for Deep Neural Networks”. In: arXiv preprint arXiv:1804.07612 (2018). [12] NVIDIA. GTC 2018 Keynote with NVIDIA CEO Jensen Huang. Youtube. 2018. URL: https : / / youtu . be / 95nphvtVf34 ? t=1h9m54s. [13] Kaz Sato, Cliff Young, and David Patterson. An in-depth look at Google’s first Tensor Processing Unit (TPU) | Google Cloud Big Data and Machine Learning Blog | Google Cloud. May 2017. URL: https: //cloud.google.com/blog/big-data/2017/05/an-in- depth-look-at--first-tensor-processing- unit-tpu. [14] Karen Simonyan and Andrew Zisserman. “Very Deep Convolu- tional Networks for Large-Scale Image Recognition”. In: CoRR abs/1409.1556 (2014). arXiv: 1409.1556. URL: http://arxiv. org/abs/1409.1556. [15] Christian Szegedy et al. “Rethinking the Inception Architecture for ”. In: CoRR abs/1512.00567 (2015). arXiv: 1512.00567. URL: http://arxiv.org/abs/1512.00567. www.kth.se