DEPARTMENT OF INFORMATICS

TECHNISCHE UNIVERSITÄT MÜNCHEN

Bachelor’s Thesis in Informatics

Evaluation of Machine Learning Inference Workloads on Heterogeneous Edge Devices

Marco Rubin DEPARTMENT OF INFORMATICS

TECHNISCHE UNIVERSITÄT MÜNCHEN

Bachelor’s Thesis in Informatics

Evaluation of Machine Learning Inference Workloads on Heterogeneous Edge Devices

Evaluierung von Inferenzaufgaben für maschinelles Lernen auf heterogenen Edge-Geräten

Author: Marco Rubin Supervisor: Prof. Dr. rer. nat. Martin Schulz Advisor: Dai Yang, Amir Raoofy Submission Date: 16th of March 2020 I confirm that this bachelor’s thesis in informatics is my own work and I have documented all sources and material used.

Munich, 16th of March 2020 Marco Rubin Abstract

Machine learning has traditionally been a thing of powerful computers and even supercomput- ers. Recent increases in computational power and the development of specialized architecture led to the possibility to perform machine learning, especially inference, on the edge. Those edge devices include small computers like the Nvidia Jetson Nano and accelerators like the Neural Compute Sticks. In this thesis these devices and their respective frameworks are compared amongst each other and to an Apple MacBook Pro which represents a more traditional machine learning device. The comparisons are done by using an almost identical inference script across all platforms. The maximum possible throughput is also tested with tools specialized on the respective framework. Four models with a different number of convolutional layers were used to put a special emphasis on this kind of layer. The precision of the models dropped after the conversion to OpenVINO for the Neural Compute Sticks, but remained equally high on the other frameworks. Regarding performance, the MacBook was still the fastest device overall, but lost more performance proportionally to edge devices when used with an increasing number of convolutional layers. Amongst the edge devices, the Jetson Nano was the fastest, followed by the 2nd gen Neural Compute Stick and the 1st gen Neural Compute Stick.

iii Contents

Abstract iii

1 Introduction 1 1.1 Introduction ...... 1 1.2 A brief Overview of Neural Networks ...... 1 1.2.1 Structure ...... 1 1.2.2 Training ...... 2 1.2.3 Inferencing ...... 3 1.2.4 Layers ...... 3 1.3 Goal of the Thesis ...... 5

2 Implementation 6 2.1 Experiences ...... 6 2.2 Final Implementation ...... 10 2.2.1 Training ...... 10 2.2.2 Conversion to OpenVINO ...... 12 2.2.3 Inference ...... 12 2.2.4 Limitations of the Final Implementation ...... 13

3 Experiments 15 3.1 Setup ...... 15 3.1.1 Hardware ...... 15 3.1.2 ...... 17 3.2 Benchmarking Methods ...... 18 3.3 Limitations of the Benchmarking Methods ...... 19

4 Results 20 4.1 TensorFlow ...... 20 4.1.1 Results of the Evaluate Script ...... 20 4.1.2 Results of the Predict Script ...... 21 4.2 OpenVINO ...... 22 4.2.1 Results of the Benchmark App ...... 22 4.2.2 Results of the Predict Script ...... 23 4.3 ONNX Runtime ...... 24 4.4 Comparison of the Results between Frameworks ...... 25 4.4.1 Accuracy ...... 25 4.4.2 Runtimes ...... 26

iv Contents

5 Conclusion and Outlook 29 5.1 Conclusion ...... 29 5.2 Outlook ...... 30

List of Figures 31

List of Tables 32

Bibliography 33

v 1 Introduction

1.1 Introduction

Since the last century, informatics has been a driving force for all kinds of related industries. However, in all that time, computers were not able to gain knowledge when processing data. Learning from experience is a typical human behaviour, enabling people to get better at their task. Machine learning tries to replicate this. It is believed to be able to solve problems in automatization, classification, recognition and many more domains. Its application field contains amongst others face recognition of specific persons in pictures and analyses of speech samples. Furthermore it can be used in robots to detect objects, or warn selfdriving cars when an obstacle is recognized. While the idea of machine learning is not new, the term has already been used in 1959 by Arthur Samuel while working at IBM [1], the ongoing development in processor technology made it increasingly useful for the variety of applications it is used today. These developments enabled the usage on increasingly smaller devices, in contrast to the supercomputers and workstations needed before. The latest target devices are the so called edge devices, which include small computers like the Raspberry Pi and Nvidia Jetson and accelerator devices like the Intel Movidius NCS, short for Neural Compute Stick [2]. Their market is quite heterogeneous as for many applications there is a specialised device. Edge devices do not possess the massive computing power of a supercomputer, but can provide real time data processing close to the destination. One of the application fields would be facial recognition performed decentralised close to the security camera. Because of the small size of an edge device, it could even be integrated into the security camera.

1.2 A brief Overview of Neural Networks

The following introduction to neural networks covers the structure, the training process and in- ference of neural networks as well as a short explanation of different layers. As this thesis only uses classification neural networks, all the following explanation refers to this kind of network.

1.2.1 Structure A neural network consists of neurons which loosely resemble neurons of a brain [3]. Neurons use an input and an activation function to compute an output. They are aggregated in layers. Between layers, the neurons are connected with different weights as parameters. These

1 1 Introduction

Figure 1.1: A simple neural network with three layers

source: https://commons.wikimedia.org/wiki/File:Neural_network_example.svg weights model the probability that an edge is taken when a signal runs through the network. A higher value indicates a higher probability of taking the edge [4]. Networks consist of an input layer which handles the input, for example a picture, and an output layer which gives the probabilities for the respective classes. In between these layers there may be one or more hidden layers. In figure 1.1, each layer is indicated by a different colour, for example green for the input layer. The weights are indicated by the thickness of the arrows. The term is used when the network has more than one hidden layer [5].

1.2.2 Training In the training process of a neural network, the network learns by predicting test data. For every sample the predicted result is compared to the provided label. If they match, the weights of used edges are increased or decreased if the opposite is the case. The input shape specifies the way data is presented to the network. For example, the input shape could specify batch size, height and width of the picture and the number of channels. The batch size specifies the number of samples which are processed simultaneously. After each batch, the weights of the network are updated [6]. Height and width of the picture specify the amount of pixel in the respective direction. The channel count depends on the colours of the picture. For example, an RGB picture would take three channels, while one in greyscale only requires one [7]. An epoch represents a training run over the whole input dataset. Typically, multiple epochs are used for training. More epochs most of the time lead to better results, but too many make the network prone to overfitting which leads to wore accuracy on unknown data as the network is too closely adapted to the training data. Accuracy is defined as the number of

2 1 Introduction correct predictions divided by the total number of predictions [8]. A trained neural network can be saved as a model which includes the nodes and edges as well as the weights. The model can later be loaded for further training or inference runs.

1.2.3 Inferencing Inferencing describes the process of using a previously trained network to make predictions on new data. Just like in the training process, the network is presented an input which it processes. Unlike in the training process, the output cannot be compared to a label to verify it. Instead the class with the highest probability is taken as the prediction of what the input is. More complex models often have higher accuracy but tend to be slower. Batching can also be used in inference to increase performance.

1.2.4 Layers

Figure 1.2: Visual representation of one of the used models in this thesis

3 1 Introduction

Neural networks consist of multiple layers which serve different purposes. Some of the most important ones are described below. Knowing the different layers and their impact on accuracy and inference times is crucial to understand the benchmark results in the following chapters of the thesis. In figure 1.2 the layers of the network are shown top-down. Additional parameters are in the boxes underneath the name of the layer.

Convolutional layers One of the latest revolutions in neural networks was the introduction of convolutional layers. Their purpose is the extraction of features, such as the detection of edges, by considering smaller squares of a picture. This is done by sliding a smaller matrix, called filter, over the matrix representing the picture. The filter is filled with predefined values which describe the shape of the feature to be extracted. For each position of the filter the dot product is performed. This is shown in figure 1.3 where the blue lines refer to the multiplication of the elements of the input matrix and the filter. The green lines represent the addition of the previously multiplied values. The stride determines how many pixels the filter slides every time.

Figure 1.3: Functionality of a convolutional layer source: https://www.researchgate.net/figure/Outline-of-the-convolutional-layer_ fig1_323792694

In order to enable sliding the filter closer to the border, Zero-padding is used. This process extends the original image and fills the newly created pixels with zeros [7].

Pooling layers Pooling layers are typically used after convolutional layers. They are used to prevent overfit- ting by reducing the parameter count which also reduces the computational complexity. They operate by the building the maximum or average, or summing values of parts of the input matrix. A possible size of one part would be 2x2. The probably most well-known variant is MaxPooling which selects the maximum of all values in the part [7].

4 1 Introduction

Flatten layers Flatten layers are used to transform an input matrix to a vector [9].

Dense layers The simplest layers are Dense Layers, also called Fully Connected Layers. They connect every neuron of one layer with every neuron of another layer [9].

Dropout layers Dropout layers are used to prevent overfitting by randomly dropping out neurons and their edges during the training of a neural network. Thereby the otherwise less used paths are taken, and their weights refined [10].

1.3 Goal of the Thesis

The goal of this Bachelor thesis is to further investigate the current state of edge accelerators, including small computers and accelerator devices alike, by using a simple neural network. The network is trained on one device and then ported to the other devices to benchmark the interference process on the respective device. Using multiple models trained with the same parameters on different devices is not an option because the training process is not deterministic and therefore does not lead to equal models. Subject of investigation are the benchmarking and the process of porting the model alike. For a meaningful comparison the most important frameworks and edge accelerators from Google, Nvidia and Intel are included.

5 2 Implementation

2.1 Experiences

Early stages The first task was to define the hardware to be tested. The initial selection included the Nvidia Jetson and the Intel Neural Compute sticks (NCS) from the first and second generation. For comparison an Apple MacBook Pro 2015 13-inch was used. Hosting the NCS were Raspberry Pis of the 3rd and 4th generation. Therefore, the target frameworks were TensorRT, OpenVINO and TensorFlow. It had to be decided whether to get a pretrained model or to train one myself in TensorFlow. The model should be simple and widely known to provide good comparability. After some research I decided to focus on MNIST, a database of handwritten digits [11]. Due to non-availability of a fitting model across all platforms, I decided to train the model myself. The software of the Nvidia Jetson was updated to Jetpack version 4.2.2. for better compati- bility with TensorFlow. The first training script [12] which I tried proved incompatible with the Jetson, as it was only compatible to TensorFlow 2 whereas the edge device was running TensorFlow 1.15. For that reason, the choice fell on an older script [13]. This script uses the native TensorFlow API. After successful training, the model was saved in the pb data format. The pb format is created using protocol buffers which is a method to serialize structured data developed by Google [14]. As a next step, this model should be converted to OpenVINO’s model format.

Conversion of the model to OpenVINO Converting the exported pb file using the official OpenVINO model_optimizer failed, because it could not read the encoding, which was not UTF-8 or binary. I worked around this issue by saving the model as a pbtxt, an UTF-8 encoded text file (by setting the flag “as_text = true” in the save method of the training script). After setting the corresponding flags in the model optimizer, the conversion was successful. The next task was to create an inference script for OpenVINO which I based on a sample script [15]. As input data I used MNIST saved as PNG pictures [16]. These were imported in grayscale using OpenCV. In a first experiment, single pictures were predicted. It was successful, but I discovered that both generations of the NCS only support FP16 (16-bit floating point numbers), so I had to convert the model to FP16 precision. When using FP16 across all devices, the NCSs were less precise than the MacBook. While the difference from the NCS1 to the MacBook was slight, the NCS2 was far less precise than the first generation for an unknown reason. In the figures 2.1 and 2.2 this difference in precision is displayed. These screenshots show the probabilities of the classes, in this case the

6 2 Implementation

Figure 2.1: Precision of the NCS2 Figure 2.2: Precision of the MacBook numbers from zero to nine, when I ran an inference with the same sample image on both devices. The probabilities were not normalized, so there are positive and negative values. Starting with those tests, the MacBook was used as host for the sticks instead of the Rasp- berry Pis. To perform inference on all of the 10,000 test pictures a few more changes were made to the inference script: the input was now a list with n-dimensional-arrays which contained one pic- ture each and a second list for the labels. A loop compared the label and the prediction with highest probability and incremented the number of right or wrong predictions respectively. However, because of the input shape of the converted model, the batch size was fixed to 1.

Focusing on TensorFlow inference After getting a working inference script in OpenVINO, I focused on doing the same in TensorFlow. First, I wanted to determine which input format TensorFlow expects to adjust the import of the pictures accordingly. It uses inputs in the (?,28, 28) format and saves the pixel values using floats to save their in greyscale whereby 1.0 represents black. Inferencing in TensorFlow posed major problems as the original pb model file cannot be imported using and further inference attempts using saved_model.loader and Predictor (from the TensorFlow.contrib package) failed.

General observations While experimenting with the TensorFlow inference, I made multiple observations. Models trained on the Jetson using the TensorFlow MNIST training script did not work on the Mac- Book and run with poor accuracy on the Jetson. When converting these models to OpenVINO, runtime was poor on the MacBook. In contrast, models trained on the MacBook run much faster in OpenVINO.

Getting a pretrained ONNX model Because training in TensorFlow did not work out, it was decided to get a pretrained ONNX

7 2 Implementation model and convert it to TensorFlow and OpenVINO models. However, this did not work using TensorFlow’s original API, because it missed the “serve tag”. Inference in OpenVINO was working fine and the precision issue of the NCS2 did not appear anymore. Due to inconsistent results of the inference runtime in OpenVINO, I decided to include the build-in benchmark app in the C++ implementation for the benchmarks, since the python version did not run.

Using Keras for the training script After having the previously mentioned issues with TensorFlow’s native API, I decided to give Keras a try. For a first implementation of a training script I used a tutorial [17] where I only exchanged the fashion MNIST dataset with the regular MNIST dataset. Training worked well and also the prediction was working fine for the first time in TensorFlow, so I decided to use Keras for the whole TensorFlow part from now on. The model was saved in h5 and ONNX using keras2onnx [18] for the conversion to OpenVINO which was also successful. For the first time I was able to run a consistent model across all devices.

Solving the TensorFlow accuracy problem (mentioned in general observations) While adapting the TensorFlow inference script for Keras, I fixed the issue which led to the poor accuracy. It seems to directly relate to the bitwise not operation I mentioned before, which was used to mimic the way the test picture of the original implementation were saved (inverted grayscale [19]).

Experimenting with Async inference in OpenVINO After discovering a sample script for asynchronous inference in OpenVINO [20], I started experimenting with it. However, there were no performance gains, probably due to the batch size being fixed to 1.

Adding multiple models to the comparison As the previous experiments have shown, depending on the model, different platforms have their benefits regarding throughput and accuracy. So for the final benchmarks multiple models should be included. There were four chosen models, three of which are named after their number of convolutional layers, 0_conv [17], 1_conv [7] and 2_conv [21], as well as a slightly changed version of the popular LeNet-5 network [22]. For compatibility reasons, all of them use (?,28,28,1) inputs. The number of epochs for training were optimised. They ended up at 40 epoches for 0_conv and 1_conv and 35 epoches for 2_conv and LeNet-5.

Experimenting with batch size in TF using built-in evaluate After the decision to additionally use the built-in evaluate function of the Keras API in TensorFlow, I had the chance to experiment with the batch size which was fixed to 1 in the other inference scripts. The evaluate function uses 32 as the default batch size. However, runtimes can be optimised using larger batches, for example 512 proved to be best on the MacBook for LeNet-5.

8 2 Implementation

Since the other scripts only support batch size 1, I decided to not further investigate on this topic.

Focusing on TensorRT With TensorFlow and OpenVINO ready for the final benchmarks, it was time to focus on the last part, TensorRT. It needs a CUDA enabled GPU to run, so running it on the MacBook was not an option. Using Windows with the popular data science platform Anaconda and experimenting with Docker containers also did not result in working solutions. So, the last option was to build it on the Jetson. Due to version incompatibilities with the TensorRT 5 on the Jetson Nano, its firmware was upgraded to Jetpack 4.3. which included TensorRT6 and was compatible to the onnx_tensorrt [23] version we could get. This extension was used to successfully convert the model to the TRT format. However, given the complexity of TensorRT’s API, I also wanted to use ONNX as frontend in conjunction with TensorRT’s backend. This was not successful, as importing the model failed.

Experimenting with onnx-tf During this timeframe I also experimented with onnx-tf [24], which uses ONNX as frontend for TensorFlow. A problem occurred here, as onnx-tf did not support convolutional layers. The only model without convolutional layers, 0_conv, was executed but took around 45 minutes on the MacBook and far over one hour on the Jetson. Accuracy was fine. The problem seemed to be that it reloads the session for every inference request.

Final benchmarking With all planned experiments done, I started the final benchmarking session with five runs each for every measurement. During the benchmark runs, the NCS of both generations disconnected after a couple of runs. The cause of this is still not known to me.

Adding ONNX runtime to the benchmarks As the last framework, the ONNX runtime was added. It worked out of the box on the MacBook and produced faster inference times on the MacBook compared to both TensorFlow and OpenVINO, with comparable accuracy. Installation on the Jetson using TensorRT as accelerator is also supported officially. After fixing some issues in the building process, it ran on the Jetson with competitive execution times as well. Hence these results, I decided to include benchmarks with the ONNX runtime in this thesis.

9 2 Implementation

2.2 Final Implementation

All of the below mentioned scripts are written in python if not mentioned otherwise. In figure 2.3 an overview over the used scripts (grey) and models (white) is given.

Figure 2.3: Overview of the training and inference scripts

2.2.1 Training For the final implementation four models were trained on the MacBook. All of them were created and trained using the Keras API in TensorFlow 1.15.0. After the Training the models were saved in Keras’ h5 format as well as in the more universal ONNX data format for inference with the ONNX runtime and for conversion to OpenVINO. Saving in ONNX is done by converting the Keras model to ONNX first by using the keras2onnx extension and then saving it with ONNX. As the training data set, MNIST was imported from the predefined datasets of Keras. The models were trained using FP32 precision. The four models mainly differ in the number of convolutional layers. Therefore, the names 0_conv, 1_conv, 2_conv and LeNet-5 are used. The parameters of the layers are kept similar across the models to strengthen the effect of the layers themselves. In figure 2.4 all of the models are shown side by side to give an impression of the different complexity. As mentioned before in the introduction, the layers of the model are arranged top-down in the figure. The attributes of the layers, shown in the boxes underneath the names of the layers, are also of great importance. While the LeNet-5 model has the most layers of these four models, 1_conv and 2_conv work on more complex data during a run through the model, as can be seen when comparing the sizes of the kernel of the Dense layers following the Flatten layer in the respective networks. Therefore the computational effort is 0_conv < LeNet-5 < 1_conv < 2_conv.

10 2 Implementation

(a) 0_conv

(b) 1_conv (c) 2_conv

(d) LeNet-5

Figure 2.4: The four models side by side to show the added complexity

11 2 Implementation

Input format All of the models expect inputs in the (?,28,28,1) format. The LeNet5 model was slightly modified to accept inputs in this format instead of the normally specified (?,32,32,1) format. The "?" denotes the batch size which is left undeclared so that it can be set at runtime. The "28,28" (respective "32,32") gives the pixels of the picture, in this case 28 in height and 28 in width. The "1" is the number of channels used and stands for grayscale. This input format is generally known as (N,H,W,C).

Compilation and fitting The models were compiled using the "Adam" optimizer, "sparse_categorical_crossentropy" as loss function and using accuracy as the metric. Fitting the model was done with the batch size set to 128. For the 0_conv and 1_conv model I used 40 epochs, the 2_conv and LeNet-5 are trained with 35 epochs for optimal accuracy of the models.

2.2.2 Conversion to OpenVINO To convert the model trained in TensorFlow to OpenVINO I used the official “model_optimizer” provided by Intel. As input the model in the ONNX format was used. The “input_shape” parameter was set to (1,28,28,1). I saved a version in FP16 as well as one in FP32 to examine if there are any differences in accuracy.

2.2.3 Inference MNIST dataset The test dataset of the MNIST contains 10,000 pictures of handwritten digits from 0 to 9. In this thesis I used a dataset consisting of pictures in the png data format which I got from a repository on github [16]. The colour values of the pixels are saved as integer values between 0 and 255. The pictures are saved in corresponding folders named from 0 to 9. All of the picture are 28x28 pixels in grayscale.

Figure 2.5: Picture of a handwritten 8 from the MNIST dataset

source: https://github.com/myleott/MNIST_png

12 2 Implementation

TensorFlow For testing the performance in TensorFlow I used two scripts. Both use the Keras API. The first one imports the model in the h5 format and gets the MNIST dataset from the predefined datasets of Keras. It then uses the built-in evaluate function of the Keras interface which takes a set of pictures and labels as well as a batch size as input parameters. For my tests, the batch size was left at the default value of 32. The second script as well imports the model as Keras model in the h5 format, but imports the pictures from the MNIST png dataset. They are imported in grayscale using OpenCV. The colour values of the pixels are converted to float values and the resulting image is saved in a list of numpy arrays in the format of (1,28,28,1). The script then loops through the list using the built-in predict function of Keras. Of the output, the highest probability value is picked, and its class is compared to the one in a label list created beforehand. It serves as a basis for the implementations in OpenVINO and ONNX.

OpenVINO For benchmarking in OpenVINO also two scripts are used. The first one is the official benchmark app of OpenVINO which is written in C++. It imports the model and predicts inputs filled with random values for one minute. I used the bench- mark script to test the theoretical throughput of the network. In uses the batch size of the model, in this case 1. For direct comparison, the second OpenVINO script is based on the second TensorFlow and only has some minor differences. The colour values of the pixels are left unchanged as integers, the model import and the inference calls are changed to the ones used by OpenVINO.

ONNX The ONNX script is also based on the second TensorFlow script. The only changed parts are the import of the model and the inference call. On the Jetson, it uses TensorRT as accelerator.

2.2.4 Limitations of the Final Implementation To correctly interpret the results of the benchmarks, an understanding of the limitations of the implementation is crucial. In my self-written inference script, the batch size is fixed to 1. This is due to the imple- mentation with the direct one-by-one comparison of the prediction and the label. Using a larger batch size, a higher performance is almost certain as the experiments mentioned in the “Experiences” section showed. Therefore, the whole implementation would have to be changed. The same problem exists for the asynchronous inference mode in OpenVINO which does not have any benefit when using batch size = 1 which was also already mentioned in the “Experiences” section.

The second limitation is the use of TensorRT only as an accelerator for the ONNX runtime.

13 2 Implementation

Thereby the performance could be reduced compared to a native TensorRT implementation. In contrast, using the ONNX runtime frontend provides the benefit of compatibility as the inference script is usable across multiple platforms.

14 3 Experiments

3.1 Setup

3.1.1 Hardware For the benchmarks the Nvidia Jetson Nano, the first and second generation of the Intel Movidius Neural Compute Stick and an Apple MacBook Pro 13-inch 2015 are used. In the following those four devices will be explained in more detail.

Nvidia Jetson Nano The Jetson Nano was announced in March 2019 and is targeted at the hobbyist market

Figure 3.1: Nvidia Jetson Nano source: https://upload.wikimedia.org/wikipedia/commons/c/c6/NVIDIA_Jetson_Nano_ Developer_Kit_%2847616885631%29.jpg as a cheap entry into machine learning with a retail price of 99$ [25]. Its specifications include a 64-bit quad-core ARM Cortex-A57 CPU and a Nvidia Maxwell GPU with 128 CUDA-cores. The GPU delivers 472 GFLOPs on float operations. The 4 GB LPDDR4 RAM are shared between CPU and GPU. Its energy consumption is specified with 10 watts. The board runs a derivation of Ubuntu. The installed version of Nvidias JetPack in this thesis is 4.3, which includes Ubuntu 18.04 LTS aarch64 as well as multiple preinstalled tools, like OpenCV, compiled specifically for the Jetson Nano [26].

15 3 Experiments

The Jetson was operated from the command line using an SSH connection during benchmark- ing to minimize load on the GPU.

Intel Movidius Neural Compute Stick (NCS1)

Figure 3.2: Intel Movidius Neural Compute Stick

The first generation of the Movidius NCS was released in July 2017. The NCS is a USB device which can be plugged into a USB Port of the host device. Its price is about the same as for the Nvidia Jetson Nano. The energy consumption is about one watt, excluded the hosting machine. The NCS is based on a Myriad-2 VPU which contains 12 SHAVE VLIW-128-bit- vectorprocessors running at 600 Mhz with 0,9V. It supports FP16 and FP32 operations as well as integer operations with 8, 16 and 32 bit precision [27]. The NCS was hosted by the MacBook for the benchmarking.

Intel Movidius Neural Compute Stick 2 (NCS2)

Figure 3.3: Intel Movidius Neural Compute Stick 2

The successor to the Movidius NCS was announced in November 2018. Like the original NCS it can be plugged into a USB Port of a host computer and supports computers with Ubuntu 16.04.3 LTS (64 bit), Windows 10 (64 bit) and CentOS* 7.4 (64 bit). Its suggested price is $69 USD as on July 14, 2019. It uses the Intel Movidius Myriad X VPU which contains 16 SHAVE vectorprocessors and a dedicated neural compute engine. It natively supports FP16 and 8 bit integer operations. Intel claims the processing power to be eight times the one of the first gen NCS [28]. The NCS2 was hosted by the MacBook for the benchmarking.

16 3 Experiments

Apple MacBook Pro 13-inch 2015 The used MacBook Pro 13-inch 2015 was announced in March 2015 and runs MacOS 10.15 (Catalina). The CPU is a i5-5257U with two cores and a clock frequency of 2.70 GHz. 8GB of LPDDR3-1866 RAM are installed. Throughout the benchmarks it was useful for the compar- isons because it runs every used framework apart from TensorRT [29]. For all benchmarks on the MacBook, the device was plugged in to prevent power consumption constrained throttling. There were always the same programs running: Excel, iTerm2, PCalc, Boostnote, Dropbox, Google Drive, Magnet, DeepL and Macs Fan Control.

3.1.2 Software The frameworks to use were determined by the hardware described in the subsection before. This subsection gives an overview of them.

TensorFlow TensorFlow is a machine learning framework developed by the Google Brain research team of Google and was released as a free and open source software library in November 2015. It supports 64-bit , macOS, Windows as well as Android and iOS. There are special variants of TensorFlow for GPU and TensorFlow Lite for edge devices. Since 2017 the open-source neural-network library Keras developed by François Chollet is also part of TensorFlow, which acts as an easy, high level interface. Keras was released in March 2015. TensorFlow provides stable APIs for Python and C. The current version is 2.0.0., but in this thesis mostly 1.14 and 1.15 were used [30, 31].

OpenVINO OpenVINO, Open Visual Inference & Neural Network Optimization, is a machine learning framework developed by Intel which was released in 2018. It was formerly known as In- tel Computer SDK and primarily optimises the performance of neural networks on Intel Hardware including CPU, GPU, VPU, like the Movidius NCS, and FPGA. Supported operating systems are Windows 10 (64 bit), Ubuntu 18.04.3 LTS (64 bit), CentOS 7.4 (64 bit), Yocto Project version Poky Jethro 2.0.3 (64 bit) and macOS (64 bit), but depending on the used acceleration hardware, not all operation systems can be used. It provides APIs for C and Python. The OpenVINO toolkit contains the model optimizer which is used to import models from other frameworks like TensorFlow or and to optimize them specifically for OpenVINO and the inference engine which runs the actual inference of the model. The inference engine is also able to run different layers on different hardware, for example the GPU and the CPU. The current OpenVINO version is 3.1 2019, in this thesis 3 2019 was used [32].

17 3 Experiments

ONNX ONNX, short for Open Neural Network Exchange, is an open source ecosystem for machine learning. It was announced in 2017 by Facebook and . It also got support by Intel, AMD, ARM and more companies. It provides an own format for the computation graph and has built-in operators and data types. Supported frameworks are among others Caffe2, OpenVINO, OpenCV, TensorFlow and TensorRT are supported via an extension. ONNX models can also be directly executed using the ONNX runtime. Apart from the default CPU acceleration, MLAS (Microsoft Linear Algebra Subprograms) + Eigen, it supports Nvidia CUDA, TensorRT, OpenVINO and more [33, 34].

TensorRT TensorRT is a machine learning framework developed by the Nvidia which mainly optimizes Inference for their graphics processors. It is build on Nvidia‘s CUDA and provides support for FP32, FP16 and INT8 optimizations. TensorRT selects specific kernels depending on the target platform, for example Tesla GPUs or Jetson Nano. It can import models from most machine learning frameworks including TensorFlow, Matlab and ONNX. ONNX can also be used as frontend for TensorRT. The current version is TensorRT 7, however in the thesis TensorRT 6 was used due to compatibility reasons [35, 23]. In this thesis, TensorRT is only used as accelerator for ONNX runtime.

3.2 Benchmarking Methods

I ran the benchmarks using the frameworks and hardware mentioned in the Setup section. The scripts and models were described in the final implementation section. The last variable is the precision. For the benchmarks FP32 and FP16 were used. FP32, also referred to as single-precision floating-point, is a floating point data format which uses 32 bits from which one is used for the sign, eight for the exponent and 23 for the fraction [36]. FP16, also called half-precision floating point, uses 16 bits. One of those is used for the sign, five for the exponent and 10 for the fraction [36]. Comparing those two formats, FP32 is more accurate, but calculations can be slower de- pending on the chip architecture used. Some chip architectures do not even support FP32 calculations.

Here is a list of all the benchmarks that were run:

• TensorFlow using the evaluate script with FP32 precision on the MacBook and the Jetson

• TensorFlow using the predict script with FP32 precision on the MacBook and the Jetson

• OpenVINO using the predict script with FP32 precision on the MacBook

18 3 Experiments

• OpenVINO using the predict script with FP16 precision on the MacBook and the NCSs of both generations

• OpenVINO using the predict script with FP16 precision on the MacBook and the NCSs of both generations

• OpenVINO using the Benchmark App with FP16 precision on the MacBook and the NCSs of both generations

• ONNX runtime using the predict script with FP32 precision on the MacBook using the CPU and, on the Jetson, using TensorRT as accelerator

All these benchmarks were run on the four models mentioned before, 0_conv, 1_conv, 2_conv and LeNet-5. For each model five runs were executed, and an average was calculated.

3.3 Limitations of the Benchmarking Methods

Despite running most tests with FP32 precision, the NCSs could only be tested with FP16 because of hardware limitations. The frameworks were also limited to certain devices, running for example TensorFlow on one of the NCSs was not possible. In this case Intel only provides the NCS interface and drivers for its own software OpenVINO. Same applies for TensorRT which is specifically aimed at CUDA cores on Nvidia own GPUs. The MacBook has a bit more runtime variation because of the more complex system and the greater number of background programs running. Therefore it needs more scheduling. This should be averaged out by using multiple runs.

19 4 Results

This chapter evaluates the results of the three frameworks used. The first subchapter focuses on TensorFlow as this was the framework used for training the models in the first place.

4.1 TensorFlow

4.1.1 Results of the Evaluate Script The evaluate script measures the runtime per sample in microseconds. I calculated the runtime for the whole test set of 10,000 pictures in seconds myself.

0_conv 1_conv 2_conv LeNet-5 Average runtime per sample 35.4µs 127.8µs 340.2µs 91.6µs Average runtime 0.354s 1.278s 3.402s 0.916s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.1: Accuracy and average runtime on the MacBook in TensorFlow with FP32 precision in the evaluate script

0_conv 1_conv 2_conv LeNet-5 Average runtime per sample 271.8µs 1ms 1ms 1.4ms Average runtime 2.718s approx. 10s approx. 10s approx. 14s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.2: Accuracy and average runtime on the Jetson Nano in TensorFlow with FP32 preci- sion in the evaluate script

The times for the 1_conv, 2_conv and LeNet-5 models in 4.2 are in milliseconds opposed to the time in microseconds for the other results. These results show the MacBook as the faster device. Based on the results of the 0_conv model, the MacBook was around 6 times faster than the Jetson Nano.

20 4 Results

4.1.2 Results of the Predict Script The predict script measures the runtime for the whole test set of 10,000 pictures in seconds.

0_conv 1_conv 2_conv LeNet-5 Average runtime 20.67s 27.85s 35.25s 23.68s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.3: Accuracy and average runtime on the MacBook in TensorFlow with FP32 precision in the predict script

0_conv 1_conv 2_conv LeNet-5 Average runtime 110.97s 144.49s 166.95s 138.33s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.4: Accuracy and average runtime on the Jetson Nano in TensorFlow with FP32 preci- sion in the predict script

Compared to the evaluate script the runtimes of the predict script are much higher. Com- paring 0_conv on the MacBook, the evaluate script was 58 times faster. Also using the predict script, the MacBook is faster than the Jetson. Comparing the results for 0_conv again, it’s more than 5 times faster. Comparing the results between the models itself on the Jetson and the MacBook respectively, the results are much closer as a percentage compared to the evaluate script. One of the reasons for this difference is probably batching, as it uses batch size 32 in the evaluate script compared to 1 in the predict script. Generally, the predict is far less optimised which should explain the big difference.

21 4 Results

4.2 OpenVINO

4.2.1 Results of the Benchmark App The benchmarks app measures the processed frames per second (FPS) using random data inputs.

0_conv 1_conv 2_conv LeNet-5 Average FPS 12642.21 2822.01 1340.37 8543.43

Table 4.5: Average FPS on the MacBook using FP32 precision in the OpenVINO Benchmark App

0_conv 1_conv 2_conv LeNet-5 Average FPS 12651.72 2824.04 1348.12 8559.28

Table 4.6: Average FPS on the MacBook using FP16 precision in the OpenVINO Benchmark App

As the two tables above show there is no difference in runtime apart from measurement inaccuracies between FP32 and FP16 precision on the MacBook. This is no surprise when considering the CPU architecture of the MacBook. Due to the used floating point units of the machine, every 16-bit floating point number is treated like a 32-bit floating point number and therefore no performance gains can be achieved. More interesting are the results compared to the NCSs with both generations running FP16 precision.

0_conv 1_conv 2_conv LeNet-5 Average FPS 517.66 359.74 260.54 390.07

Table 4.7: Average FPS on the NCS1 using FP16 precision in the OpenVINO Benchmark App

Between the two NCSs, the NCS2 is faster with every model. However, the NCS2 has an especifically big advantages when using 2_conv, the model with two convolutional layers. Compared to the 11% in 0_conv, 2_conv runs nearly 53% faster on the newer stick. The marketing of optimised performance for convolutional layers seems to hold true in this case.

When comparing the MacBook to the NCSs, the MacBook is clearly faster. For 0_conv, the MacBook is 24 times faster than the 1st gen stick and around 22 times faster than the 2nd generation. 2_conv again shows a different picture. The MacBook is only 5 times faster than

22 4 Results

0_conv 1_conv 2_conv LeNet-5 Average FPS 575.91 451.42 398.07 491.10

Table 4.8: Average FPS on the NCS2 using FP16 precision in the OpenVINO Benchmark App the NCS1 and 3.3 times faster than the NCS2. Looking at these results the NCSs seem to have much more efficient processing of the convolutional layers.

4.2.2 Results of the Predict Script The predict script measures the runtime for the whole test set in seconds.

0_conv 1_conv 2_conv LeNet-5 Average runtime 2.00s 5.92s 9.97s 2.76s Accuracy 98.09% 92.36% 99.05% 98.65%

Table 4.9: Accuracy and average runtime on the MacBook in OpenVINO with FP32 precision in the predict script

0_conv 1_conv 2_conv LeNet-5 Average runtime 2.02s 5.90s 9.99s 2.60s Accuracy 98.09% 92.34% 99.05% 98.65%

Table 4.10: Accuracy and average runtime on the MacBook in OpenVINO with FP16 precision in the predict script

As seen with the Benchmark App before, there are no real differences regarding runtime on the MacBook between FP16 and FP32 precision. An interesting detail to note is the inconsistent accuracy of 1_conv using the FP16 model.

23 4 Results

0_conv 1_conv 2_conv LeNet-5 Average runtime 20.71s 29.10s 39.85s 27.07s Accuracy 98.09% 92.35% 99.05% 98.65%

Table 4.11: Accuracy and average runtime on the NCS1 in OpenVINO with FP16 precision in the predict script

0_conv 1_conv 2_conv LeNet-5 Average runtime 18.77s 23.45s 26.45s 21.64s Accuracy 98.08% 92.2% 99.05% 95.0%

Table 4.12: Accuracy and average runtime on the NCS2 in OpenVINO with FP16 precision in the predict script

Like in the Benchmark App, the NCS2 is faster across the board compared to the NCS1. For 0_conv there is a 10% performance increase, for 2_conv it increases by 50%. The accuracy varies as well. While 0_conv is slightly less accurate on the NCS2, the LeNet-5 accuracy drops significantly. Also, for 1_conv it drops by a small amount (less than 1%). Comparing runtimes to the MacBook, the MacBook is still the faster device with every model. For 0_conv it is around 10 times faster than the NCS1 and 9.4 times faster than the NCS2. The advantage decreases with more convolutional layers being added. For 2_conv it is around 4 times faster than the NCS1 and 2.6 times faster than the NCS2.

4.3 ONNX Runtime

The predict script measures the runtime for the whole test set in seconds.

0_conv 1_conv 2_conv LeNet-5 Average runtime 0.61s 5.11s 10.18s 1.59s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.13: Accuracy and average runtime on the MacBook in the ONNX runtime with FP32 precision in the predict script

While accuracy is consistent across both devices, the MacBook proofs again to be faster. For 0_conv it is 10 times faster than the Jetson and for 2_conv it is 1.8 times faster. An unexpected behaviour is shown by the Jetson by reproducibly being quite slow with 1_conv.

24 4 Results

0_conv 1_conv 2_conv LeNet-5 Average runtime 6.15s 24.44s 18.25s 8.83s Accuracy 98.23% 98.99% 99.19% 98.88%

Table 4.14: Accuracy and average runtime on the Jetson Nano in the ONNX runtime with FP32 precision in the predict script

As the ONNX runtime is using TensorRT as an accelerator and optimizer, it could be that TensorRT was not able to use as much optimization on the 1_conv as on the other models.

4.4 Comparison of the Results between Frameworks

4.4.1 Accuracy For accuracy, Tensorflow is the benchmark as it was the framework in which all the models were trained and saved. ONNX and OpenVINO use converted versions of the models.

Figure 4.1: Accuracies

As the chart 4.1 shows, TensorFlow and ONNX have the same accuracies across all the models. In OpenVINO however, accuracy is lost. This holds true for the FP32 as well as the FP16. Especially the 1_conv model has a significant drop in accuracy. On the NCS2, additionally the LeNet-5 model drops by a significant margin.

25 4 Results

4.4.2 Runtimes Given the complexity and size of the networks, the runtimes usually line up in the following order: 0_conv < LeNet-5 < 1_conv < 2_conv. The following diagrams compare the results of the predict script model by model. All run- times are shown in seconds on the y-axis. The x-axis lists the different devices.

Figure 4.2: Runtimes of 0_conv Figure 4.3: Runtimes of 1_conv (lower is better) (lower is better)

Figure 4.4: Runtimes of 2_conv Figure 4.5: Runtimes of LeNet-5 (lower is better) (lower is better)

26 4 Results

Starting with the 0_conv runtimes in figure 4.2, the comparison shows big differences between the devices and frameworks. While the MacBook using TensorFlow and both NCS are almost equally fast, the Jetson Nano is more than 5 times slower. The MacBook using ONNX is the fastest, followed by the MacBook with OpenVINO and the Jetson with ONNX. Looking at the runtimes of 1_conv in figure 4.3, it shows a similar picture. The NCS2 gains a bit more advantage over the NCS1. The Jetson Nano is slower than expected, while the fastest device is still the MacBook using the ONNX runtime. In figure 4.4, the 2_conv runtimes show a bit clearer picture. While the NCS2 is further gaining against the NCS1, the Jetson performs as expected in ONNX and beats both NCS. The fastest device this time is the MacBook using OpenVINO which seems to be processing the additional convolutional layer faster than ONNX. LeNet-5, shown in figure 4.5, performs almost like 1_conv but is a bit faster overall. The only bigger difference being the Jetson with ONNX performing as expected and being third fastest device in the benchmark.

Apart from the predict script, which does not use the full performance potential as discussed previously, I have done some more tests to measure maximum performance. The following diagram shows the highest frames per second achieved in the respective framework on the MacBook and the edge devices. I used the results of the evaluate script for TensorFlow and the results of the Benchmark App for OpenVINO. Important to note is that there was no specific tool for ONNX runtime, so I used the results of the predict script.

Figure 4.6: FPS on the MacBook in optimized scripts (higher is better)

As it can be seen in figure 4.6, TensorFlow has the highest throughput on all models and holds a big lead on all models apart from LeNet-5.

27 4 Results

ONNX runtime seems to be worse in comparison to the other frameworks when processing convolutional layers, as it loses the second spot in 0_conv to OpenVINO on the other models. OpenVINO is restricted by batch size = 1 and the synchronous inference. Even with the convolutional layer, LeNet-5 has a considerable advantage over the other two models with convolutional layers.

Figure 4.7: FPS on the Edge Devices in optimized scripts (higher is better)

In the chart 4.7 the FPS of the edge devices are compared. There are only results for 0_conv with TensorFlow, as the other results where too inaccurate to include them. TensorFlow on the Jetson is the fastest, followed by the Jetson with ONNX and TensorRT acceleration. OpenVINO has the same restrictions as mentioned above. Comparing the devices, the Jetson is faster than both NCSs. The worse performance in 1_conv is probably due to the lacking optimization mentioned before. However, with a growing number of convolutional layers the gap closes due to the more efficient processing in OpenVINO as seen in the diagram for the MacBook before. Compared to the MacBook all the edge devices perform considerably worse across the board. They are closing the gap when more convolutional layers are used.

28 5 Conclusion and Outlook

The following sections conclude the thesis and give an outlook on future work and interesting additions to the experiments made. The conclusion is separately done for the software and the hardware.

5.1 Conclusion

There are multiple takeaways from my experiences in implementing and benchmarking in the used frameworks. Beginning with TensorFlow, the native API is lower level and therefore less intuitive and without well-founded knowledge prone to bugs. It is easier to use a higher-level API like Keras or Pytorch. When running unoptimised code with the Keras API the inference times are not as good as on the other two tested frameworks. However, after optimising inference and batch size, it proved to be the best performing framework in the comparison. OpenVINO is the second fastest network both with optimised code as well as with my own script. Compared to ONNX, it has better performance when used with convolutional layers, but falls short to TensorFlow. Without restrictions to the batch size and synchronous inference the results of the optimised script would probably be higher. ONNX, more specifically the ONNX runtime, has the best performance when using un- optimised code. Another advantage is the high portability of the code which works with CPU, TensorRT and OpenVINO as accelerator without further optimisations. When using optimisations, the performance can probably be improved further. When focusing on accuracy of the models, TensorFlow is the benchmark as it was used for training. ONNX has the same accuracy after conversion across the board. Converting to OpenVINO proved lossy on all models I have tested. The drop in accuracy depends on the model and ranges between barely measurable (less than 1%) to a rather significant few percentage points (between 5% and 6%).

When comparing devices, the MacBook proved to be fastest across all the benchmarks. However, with a higher number of convolutional layers, the edge devices closed the gap. This is the case because their architecture is better optimised for neural networks and convolutional layers in particular [37]. Comparing the edge devices between each other, the most obvious observation was the higher throughput of the second generation of the Neural Compute Stick compared to the first generation. Still, the NCS2 showed problems with accuracy when used with certain models. The Jetson Nano was far slower than the competition when used with unoptimised code in

29 5 Conclusion and Outlook

TensorFlow. However, when using TensorRT as accelerator for the ONNX runtime, the Jetson was faster than the NCSs with every, but one model.

5.2 Outlook

While this thesis provides a look on the current state of machine learning on the edge, there are additional topics in this subject area that deserve a further investigation. This chapter names a few of them and should be motivating for future work based on this thesis. A topic which came up multiple times was the influence of the batch size. The restriction to a batch size of one leaves potential unused as some experiments of the thesis showed. The batch size should be adjusted for every device and framework to ensure best performance. This would, for example, enable asynchronous inference in OpenVINO. Possibly the edge devices would perform better with more parallelism as their architecture is specifically designed to deal with it [37]. Another topic is the influence of different accuracies like FP32 and FP16 on the accuracy of a neural network. The few experiments in the thesis hint to a minor influence but depending on the specific model and data there might be a bigger difference. Runtime was similar as well which can be explained with the CPU architecture of the MacBook. More interesting would be a comparison on a device like the Jetson Nano. The precision could even be further dropped to INT8, for example in TensorRT [35]. This promises even better performance in exchange for a lower precision. Accuracy issues on the NCS2 occurred multiple times during this thesis. As they are specific to certain models, it was not possible to identify the cause. More research would be needed to fully understand the underlying issue. As the ONNX runtime is a really portable interface, a more in-depth comparison between the different accelerators would show throughput differences between the underlying frameworks. Specifically comparisons using OpenVINO would give further insight. Also, a proper implementation of a TensorRT script could further add to a more complete overview as the use as accelerator might degrade performance. Through parallelisation of inference across multiple device throughput could be improved by a lot. For example, OpenVINO supports inference using multiple devices simultaneously [38]. Multi-node is also used to improve performance in the training process [39]. Lastly, Google Coral should be mentioned. From this device two versions exist, a developer board and a USB accelerator. While the board is a fully functional computer like the Jetson Nano, the USB accelerator competes with the Intel NCSs. Both support TensorFlow Lite, a customised TensorFlow version for mobile devices [40]. Their performance seems to stack up well against the Jetson Nano and the NCSs according to a publication by Mattia Antonini et al. [2]. Overall, edge devices can be a big improvement when used with the right models or in environments where a full-size desktop computer or a notebook is not suitable. As the market is quite heterogeneous and compatibility between frameworks is not always guaranteed, the best device and framework first on foremost depends on the specific use case.

30 List of Figures

1.1 A simple neural network with three layers ...... 2 1.2 Visual representation of one of the used models in this thesis ...... 3 1.3 Functionality of a convolutional layer ...... 4

2.1 Precision of the NCS2 ...... 7 2.2 Precision of the MacBook ...... 7 2.3 Overview of the training and inference scripts ...... 10 2.4 The four models side by side to show the added complexity ...... 11 2.5 Picture of a handwritten 8 from the MNIST dataset ...... 12

3.1 Nvidia Jetson Nano ...... 15 3.2 Intel Movidius Neural Compute Stick ...... 16 3.3 Intel Movidius Neural Compute Stick 2 ...... 16

4.1 Accuracies ...... 25 4.2 Runtimes of 0_conv (lower is better) ...... 26 4.3 Runtimes of 1_conv (lower is better) ...... 26 4.4 Runtimes of 2_conv (lower is better) ...... 26 4.5 Runtimes of LeNet-5 (lower is better) ...... 26 4.6 FPS on the MacBook in optimized scripts (higher is better) ...... 27 4.7 FPS on the Edge Devices in optimized scripts (higher is better) ...... 28

31 List of Tables

4.1 Accuracy and average runtime on the MacBook in TensorFlow with FP32 precision in the evaluate script ...... 20 4.2 Accuracy and average runtime on the Jetson Nano in TensorFlow with FP32 precision in the evaluate script ...... 20 4.3 Accuracy and average runtime on the MacBook in TensorFlow with FP32 precision in the predict script ...... 21 4.4 Accuracy and average runtime on the Jetson Nano in TensorFlow with FP32 precision in the predict script ...... 21 4.5 Average FPS on the MacBook using FP32 precision in the OpenVINO Bench- mark App ...... 22 4.6 Average FPS on the MacBook using FP16 precision in the OpenVINO Bench- mark App ...... 22 4.7 Average FPS on the NCS1 using FP16 precision in the OpenVINO Benchmark App...... 22 4.8 Average FPS on the NCS2 using FP16 precision in the OpenVINO Benchmark App...... 23 4.9 Accuracy and average runtime on the MacBook in OpenVINO with FP32 precision in the predict script ...... 23 4.10 Accuracy and average runtime on the MacBook in OpenVINO with FP16 precision in the predict script ...... 23 4.11 Accuracy and average runtime on the NCS1 in OpenVINO with FP16 precision in the predict script ...... 24 4.12 Accuracy and average runtime on the NCS2 in OpenVINO with FP16 precision in the predict script ...... 24 4.13 Accuracy and average runtime on the MacBook in the ONNX runtime with FP32 precision in the predict script ...... 24 4.14 Accuracy and average runtime on the Jetson Nano in the ONNX runtime with FP32 precision in the predict script ...... 25

32 Bibliography

[1] N. N. Learning Machines. McGraw Hill, 1965. [2] M. A. T. H. V. C. M. A. M. A. M. F. Kawsar. “Resource Characterisation of Personal-Scale Sensing Models on Edge Accelerators”. In: AIChallengeIoT. 2019. [3] Y.-Y. C. Y.-H. L. C.-C. K. M.-H. C. I.-H. Yen. “Design and Implementation of Cloud Analytics-Assisted Smart Power Meters Considering Advanced Artificial Intelligence as Edge Analytics in Demand-Side Management for Smart Homes”. In: Sensors (2019). [4] The Machine Learning Dictionary. url: http://www.cse.unsw.edu.au/~billw/mldict.html. [5] Deep Learning Vs Neural Networks - What’s The Difference? url: https://bernardmarr.com/default.asp?contentID=1789. [6] What is batch size in neural network? url: https://stats.stackexchange.com/questions/153531/what-is-batch-size-in- neural-network. [7] S. Kothawade. Build your first image classification model with MNIST dataset. 2019. url: https://medium.com/analytics-vidhya/build-your-1st-deep-learning- classification-model-with-mnist-dataset-1eb27227746b. [8] Classification: Accuracy. url: https://developers.google.com/machine- learning/crash-course/classification/accuracy. [9] Neuronale Netze. url: https: //user.phil.hhu.de/~petersen/SoSe17_Teamprojekt/AR/neuronalenetze.html. [10] N. S. G. H. A. K. I. S. R. Salakhutdinov. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In: Journal of Machine Learning Research (2014). [11] THE MNIST DATABASE of handwritten digits. url: http://yann.lecun.com/exdb/mnist/. [12] Image Classification. url: https://github.com/tensorflow/models/tree/master/ official/vision/image_classification. [13] MNIST in TensorFlow. url: https://github.com/tensorflow/models/tree/master/official/r1/MNIST. [14] Protocol Buffers. url: https://developers.google.com/protocol-buffers. [15] Image Classification Python* Sample. url: https://github.com/opencv/dldt/tree/2019/inference- engine/ie_bridges/python/sample/classification_sample.

33 Bibliography

[16] MNIST_png. url: https://github.com/myleott/MNIST_png. [17] Basic classification: Classify images of clothing. url: https://www.tensorflow.org/tutorials/keras/classification. [18] keras2onnx. url: https://github.com/onnx/keras-onnx. [19] Bitwise operation: NOT. url: https://docs.opencv.org/master/d0/d86/tutorial_py_image_arithmetics.html. [20] Image Classification Python* Sample Async. url: https://github.com/opencv/dldt/tree/2019/inference- engine/ie_bridges/python/sample/classification_sample_async. [21] Keras examples directory: MNIST_cnn.py. url: https://github.com/keras-team/keras/blob/master/examples/MNIST_cnn.py. [22] M. Gazar. LeNet-5 in 9 lines of code using Keras. 2018. url: https://bit.ly/33eoY4O. [23] TensorRT backend for ONNX. url: https://github.com/onnx/onnx-tensorrt. [24] Tensorflow Backend for ONNX. url: https://github.com/onnx/onnx-tensorflow. [25] NVIDIA Announces Jetson Nano: $99 Tiny, Yet Mighty NVIDIA CUDA-X AI Computer That Runs All AI Models. 2019. url: https://nvidianews.nvidia.com/news/nvidia-announces-jetson-nano-99-tiny- yet-mighty-nvidia-cuda-x-ai-computer-that-runs-all-ai-models. [26] Nvidia Jetson. url: https://www.nvidia.com/de-de/autonomous-machines/embedded- systems/jetson-nano/. [27] M. Fischer. 1-Watt-Rechenstick: Movidius Neural Compute Stick für maschinelles Sehen. 2017. url: https://www.heise.de/newsticker/meldung/1-Watt-Rechenstick-Movidius- Neural-Compute-Stick-fuer-maschinelles-Sehen-3780324.html.

[28] Intel R Neural Compute Stick 2: High Performance, Low Power for AI Inference. url: https://software.intel.com/sites/default/files/managed/80/10/ncs2-data- sheet.pdf. [29] T. Schönborn. Apple MacBook Pro Retina 13 (Early 2015) Notebook Review. 2015. url: https://www.notebookcheck.net/Apple-MacBook-Pro-Retina-13-Early-2015- Notebook-Review.139621.0.html. [30] Keras: The Python Deep Learning library. url: https://keras.io/. [31] Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Y. Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.

34 Bibliography

TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org. 2015. url: https://www.tensorflow.org/. [32] OpenVINO: Deploy high-performance, deep learning inference. url: https://software.intel.com/en-us/openvino-toolkit. [33] ONNX Runtime. url: https://github.com/microsoft/onnxruntime. [34] S. Shah. Microsoft and Facebook’s open AI ecosystem gains more support. 2017. url: https://www.engadget.com/2017/10/11/microsoft-facebooks-ai-onxx- partners/?guccounter=1&guce_referrer= aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnLw&guce_referrer_sig=AQAAAJ15tS5_ jZseGEMICd71d0D8SBuAo84Oc2T5YZAnqfzKjENFXl_iANGeqChUbwDcF7ettZ6i6_wIgxLLk_ JA7KaC1YluZ1lUUnGx2gKxdnTKlktY7Nz2jqgyGhZJy3w8wiLoMYC2g_ FJ5rA7u6elPAtXWXXgVuaC0NdDIDRWAR2B. [35] NVIDIA TensorRT: Programmable Inference Accelerator. url: https://developer.nvidia.com/tensorrt. [36] Survey of Floating-Point Formats. url: http://www.mrob.com/pub/math/floatformats.html. [37] S. Shah. Do we really need GPU for Deep Learning? - CPU vs GPU. 2018. url: https://medium.com/@shachishah.ce/do-we-really-need-gpu-for-deep- learning-47042c02efe2. [38] OpenVINO: Multi-Device Plugin. url: https://docs.openvinotoolkit.org/latest/ _docs_IE_DG_supported_plugins_MULTI.html. [39] Guide to multi-node training with Intel Distribution of Caffe. url: https://github.com/intel/caffe/wiki/Multinode-guide. [40] Coral: Products. url: https://coral.ai/products/.

35