Benchmarking Tensorflow on a Personal Computer Not Specialised for Machine Learning

EXAMENSARBETE INOM TEKNIK, GRUNDNIVÅ, 15 HP STOCKHOLM, SVERIGE 2018 Benchmarking TensorFlow on a personal computer not specialised for machine learning SEBASTIAN GEDIN JAKOB HOLM KTH SKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP Benchmarking TensorFlow on a personal computer not specialised for machine learning SEBASTIAN GEDIN AND JAKOB HOLM Bachelor of Computer Science Date: June 1, 2018 Supervisor: Stefano Markidis Examiner: Örjan Ekeberg Swedish title: Prestandatestning av TensorFlow på en dator ej specialiserad för maskininlärning School of Electrical Engineering and Computer Science (EECS) iii Abstract Many recent advancement of modern technologies can be attributed to the rapid growth of the machine learning field and especially deep learning. A big challenge for deep learning is that the learning process can be very time-consuming. TensorFlow is a framework which allows developers to make use of GPUs and other processing units in order to tackle this and other tasks involved in machine learning. In this study we benchmark and investigate the performance of TensorFlow in terms of images per second on a personal computer not specialised for machine learning. We investigate how the performance when training different convolutional neural networks is affected by batch size and available GPU memory capacity. We also profile the execution. Our results suggest that increasing the memory capacity of the GPU can be beneficial both for being able to train network using larger batch sizes but also for improving performance in some situations. We also conclude that improving the GPU, rather than the CPU, has greater potential for improving performance, but to what extent differs significantly between network models. Furthermore, using a power of two for batch size appears to be advantageous for performance. iv Sammanfattning Framsteg inom maskininlärning, och särskilt djupinlärning, har gett upphov till många tekniska framsteg under de senaste åren. Ett problem vid djupinlärning är dock att träningsfasen kan vara väldigt tidskrävande. TensorFlow är ett ramverk utvecklat för att lösa just detta problem, genom att göra det enkelt för utvecklare att nyttja grafikkort och andra slags processorer för att utföra beräkningar som används inom maskininlärning. I den här studien prestandatestar vi TensorFlow genom att hitta det maximala antalet bilder per sekund vid träning av olika faltnings- nätverk på en dator ej byggd för maskininlärning. Vi undersöker hur prestandan påverkas av batchstorlek och mängden tillgängligt grafikminne. Utöver detta profilerar vi även exekveringen av prestandates- terna. Våra resultat pekar på att större mängd grafikminne både möjliggör större batchstorlek vid träning men även leder till högre prestanda i vissa situationer. Vi drar även slutsatsen att förbättringar av grafikkortet har större potential än av processorn för att öka prestanda, men till vilken grad varierar avsevärt mellan de olika faltningsnätverken. Vidare tycks det vara fördelaktigt prestandamässigt att använda en tvåpotens som batchstorlek. Contents 1 Introduction 1 1.1 Research Question . .2 2 Background 3 2.1 TensorFlow . .3 2.1.1 Dataflow graphs . .3 2.1.2 Interface . .4 2.1.3 Execution . .5 2.2 Processing units for deep learning . .5 3 Methods 7 3.1 Environment . .7 3.2 Benchmarking Script . .8 3.3 Investigation . .8 3.3.1 Batch size . .9 3.3.2 Benchmark . .9 3.3.3 GPU Memory capacity . .9 3.3.4 Profiling . 10 4 Results 11 4.1 Batch size . 11 4.2 Benchmarks . 13 4.3 GPU memory capacity . 13 4.4 Profiling . 15 5 Discussion 21 5.1 GPU vs. CPU . 21 5.2 GPU Memory Capacity . 21 5.3 Limitations and future work . 22 v vi CONTENTS 6 Conclusions 24 Bibliography 25 Chapter 1 Introduction Recent advancement of many modern technologies, such as the improved camera in mobile phones, smart personal digital assistants and especially autonomous vehicles, can be attributed to the rapid growth of the machine learning field [1]. Within the field of machine learning exists several approaches, one of which is called deep learning, and in recent years it has been proven to be highly effective [5]. A common application of deep learning is image recognition using deep neural networks; one may for example use deep neural networks to classify pictures of suspected skin cancer as either malignant or benign [4]. However, the neural network cannot perform this task without initial training, which means processing large amount of already classified images. One of the biggest challenges in deep learning is that this training process can be very time-consuming. In 2012, Krizhevsky et al. published a seminal paper on image classification using deep learning, in which they showed that GPUs can be used to significantly speed up training [9]. Following that discovery many different frameworks have been developed to allow developers to easily utilise the computational power of one or more GPUs in their machine learning programs. One of these frameworks that has become very popular in the last couple of years is TensorFlow. Based on GitHub and Stack Overflow activity, as well as Google searches, TensorFlow is dominating the market for deep learning libraries[2]. TensorFlow operates on large-scale data sets in heterogeneous en- vironments [1]. Machine learning models are represented as dataflow graphs in which nodes describe operations and edges between nodes represent the data flowing between operations. Scalability has been 1 2 CHAPTER 1. INTRODUCTION taken in to great consideration when designing TensorFlow, meaning that, as the amount of CPUs and GPUs in computers rises, the performance of TensorFlow will increase. In order to achieve that, TensorFlow has algorithms that assigns and distributes the work on machines that contain multiple CPUs and GPUs [5]. 1.1 Research Question Being an open-source software, it is likely that TensorFlow will be used by many students and hobby programmers. It is therefore of relevance to investigate how TensorFlow performs when being run on computers that such users may have access to: personal computers that are not specialised for the workload involved in machine learning computations. The purpose of this study is to provide benchmarks for TensorFlow performance on such a computer and to identify limiting factors on performance, knowledge of which is crucial for efficiently upgrading hardware and software to achieve better performance. This study therefore aims to answer the following question: How does TensorFlow perform on a personal computer not specialised for machine learning and what are the limiting factors? Chapter 2 Background 2.1 TensorFlow In this section we discuss the principles of TensorFlow. We describe what dataflow graphs are and how they are used in TensorFlow to model computations involved in machine learning algorithms. We then continue to describe the interface provided by TensorFlow to allow users to create such graphs. Finally, we explain how dataflow graphs are mapped to the available processing units for execution. 2.1.1 Dataflow graphs A dataflow or computational graph is a directed graph in which nodes represent units of calculation and edges represent data being consumed and produced by those units. A simple example of this is if z is to be the result of adding x and y, a dataflow graph like the one in Figure 2.1 can be used to represent this computation. Note that, as in this example, a variable or constant is considered to be an unit of calculation and is therefore represented by a node in the graph. This is one of the strengths of a dataflow graph: what a node represents can be very general. Therefore, a node can represent many different things, such as: constants, variables, mathematical operations, functions or I/O operations – basically anything with zero or more inputs which produces zero or more outputs. In this sense, a constant is a unit of calculation with zero inputs which always produces one output corresponding to the constant it represents. 3 4 CHAPTER 2. BACKGROUND Figure 2.1: Example of a dataflow graph representing the addition of two input variables x and y with the result z. 2.1.2 Interface The following paragraphs will outline how dataflow graphs are used in TensorFlow to model computations involved in machine learning algorithms and how a user can construct such graphs. Operations In TensorFlow units of calculation are referred to as operations. Each operation is associated with an implementation, which is referred to as a kernel. The kernel specifies how to execute the operation and on which kind of hardware device (e.g., CPU or GPU) the implementation can be executed. Tensors The data flowing along the edges of the graph is modelled as tensors. Tensors are n-dimensional arrays with homogeneous elements of primitive types (such as int32, float32 or string). The number of dimensions of the array representing the tensor is the rank of the tensor and a tensor’s shape describes the number of elements in each dimension. For example, constants, scalars and matrices are represented as tensors of rank 0, 1 and 2 respectively and a 3 × 4 matrix has the shape [3,4]. Operations can be seen as calculations that may consume and/or produce tensors. Sessions A session is used to execute the computations defined in graphs. The purpose of a session is to provide a connection between the client program and the C++ runtime. It allocates the physical resources used for the computations and holds any intermediate values during execution. The graph, consisting of operations and tensors, defines the computations while the session is used to execute them. CHAPTER 2. BACKGROUND 5 Variables Variables are used to represent a shared, persistent state that can be manipulated by the program. They can be accessed and modified by multiple sessions simultaneously. Like for other Tensor- Flow operations the interface makes it possible to place variables on particular devices, which gives the user ample control over memory management.

Benchmarking Tensorflow on a Personal Computer Not Specialised for Machine Learning

Improved Policy Networks for Computer Go

Fault Tolerance and Re-Training Analysis on Neural Networks

In-Datacenter Performance Analysis of a Tensor Processing Unit

Abstractions for Programming Graphics Processors in High-Level Programming Languages

P1360R0: Towards Machine Learning for C++: Study Group 19

Lecture Notes Geoffrey Hinton

CSC321 Lecture 23: Go

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Uncertainty in Deep Learning

AI Chips: What They Are and Why They Matter

Mesh-Tensorflow: Deep Learning for Supercomputers

Shinjae Yoo Computational Science Initiative Outline