Optimizing deep neural networks for deployment on the Movidius stick

Emiel Deprost Student number: 01503250

Supervisors: Prof. dr. ir. Pieter Simoens, Prof. dr. ir. Bart Dhoedt Counsellor: ing. Sam Leroux

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT

Academic year 2018-2019

Optimizing deep neural networks for deployment on the Movidius stick

Emiel Deprost Student number: 01503250

Supervisors: Prof. dr. ir. Pieter Simoens, Prof. dr. ir. Bart Dhoedt Counsellor: ing. Sam Leroux

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT

Academic year 2018-2019 Voorwoord

Als laatste jaarsstudent wou ik mijn kennis uitbreiden over deep learning. Machine learning kwam al aanbod gedurende mijn opleiding en wekte mijn interesse. Dankzij mijn masterproef heb ik mij in dit onderwerp kunnen verdiepen en is dit een zeer leerzame periode geweest.

Aan het einde van deze vierjarige opleiding is het dan ook het ideale moment om iedereen te bedanken. Mijn dank gaat in de eerste plaats uit naar Sam Leroux voor zijn waardevolle begeleiding bij mijn masterproef. Hij heeft mij van het begin tot het einde bijgestaan door talrijke raadgevingen en feedback bij de verkregen resultaten.

Graag wil ik ook prof. dr. ir. Pieter Simoens en prof. dr. ir. Bart Dhoedt bedanken voor het thesis onderwerp dat ik aangeboden kreeg en de kans om deze masterproef te maken.

Ook wil ik mijn dank betuigen aan de docenten die ons opgeleid hebben in de voorbije vier jaar.

Tenslotte wil ik mijn ouders bedanken die mij de kans gegeven hebben om deze boeiende oplei- ding te volgen en telkens klaar stonden om mij te helpen.

Emiel Deprost, mei 2019 Toelating tot bruikleen

“”De auteur(s) geeft (geven) de toelating deze masterproef voor consultatie beschikbaar te stellen en delen van de masterproef te kopiëren voor persoonlijk gebruik. Elk ander gebruik valt onder de bepalingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze masterproef.”

”The author(s) gives (give) permission to make this master dissertation available for consulta- tion and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.”

Emiel Deprost, mei 2019

Optimizing deep neural networks for deployment on the Movidius stick Emiel Deprost Supervisor(s): Pieter Simoens, Bart Dhoedt, Sam Leroux

Abstract— Deep neural networks are powerful models that allow bought by Intel. They call the Myriad 2 a Vision Pro- tackling complex problems. To run those networks a lot of computa- cessing Unit (VPU) it is a processor specially made for tional performance is required, this poses a problem on battery and computationally constrained devices. A possible solution is to make tasks. The architecture of the Myriad 2 is more specific hardware that allows running those networks more effi- described in figure 1. ciently. The Movidius stick is such specific hardware, made by Intel. In this work, we explore the capabilities of the Movidius stick. We will look at image classification networks and evaluate which works best on the Movidius stick. We will also give guidelines to design efficient neural networks for deployment on the Movidius Stick. Keywords—Movidius stick, neural compute stick, benchmarks, op- timization

I. Introduction Compared to deploying neural networks on the cloud, deploying them on the edge has clear advantages : • Less latency. • Lower energy consumption, as there is no energy cost to send the data to the cloud. • No third party to trust, because all the data can be pro- cessed locally. In recent years, a lot of effort has been made to make neu- ral networks more efficient. Also progress in hardware has been made, as the field of deep learning applications grows, Fig. 1. Myriad2 VPU block diagram. Borrowed from [1] more effort is put in application-specific hardware. In this work we will have a closer look at the Movidius Before deploying the neural network onto the Movidius stick, this is specific hardware for neural network inference. stick, we need to convert an existing model (Tensorflow, The main objectives of this work are the following: Caffe model or from any other supported framework) to • Looking at the capabilities of the Movidius stick, i.e, an intermediate representation (IR). The IR is the model which kind of deep neural networks can be executed on format used by the Movidius stick. This IR is then loaded the device. into the memory of the Movidius stick. The last step is to • Benchmarking the performance of the Movidius stick for send the input data to the Movidius stick and it returns the different deep neural network architectures. inference results. Note that Movidius stick is only suitable • Looking at the possible software optimizations that for inference and not for the training of networks. would allow running networks more efficiently on the de- The Movidius stick supports most deep learning lay- vice. ers, like fully connected, convolutional or pooling layers. It doesn’t support networks with memory, so RNNs and II. The Movidius stick LSTMs are not supported. The Movidius stick is a neural network accelerator made The Movidius stick only supports floating-point 16-bit by Intel. It needs to be operated by a host device that sends precision. Hence, the inference time cannot be accelerated the input and receives the output of the neural network. It by reducing the precision of the weights and activations. communicates via USB and also has the form factor of a large USB stick. This allows the Movidius stick to easily III. Layer benchmarks be added to any mini-computer board, such as a Raspberry In deep learning, different layers are used as building Pi, allowing faster neural network inference. blocks to build a deep neural network. The most commonly used layers were benchmarked on the Movidius stick for A. Movidius architecture different complexities. The results of those benchmarks The Movidius stick is based on the Myriad 2 chip, this allow us to make some interesting observations that are chip was originally made by Movidius which was later described here. The benchmarked layers are: • Fully connected layer in figure 3. The theoretical speedup on the figure isthe • Convolutional layer speedup calculated as the ratio of the number of operations • Depth separable convolution for a depth separable convolution and a full convolution. • ReLU and bias layer The highest theoretical speedup is 9×. The figure has been divided into 4 zones where a different behavior is observed. A. ReLU & bias A ReLU and bias layer were benchmarked, the inference 3 10 9 4 time of both layers can be measured as they are executed as 8 7 separate layers on the Movidius stick. Figure 2 shows the 6 inference time of the ReLU and bias layer. The figure shows 5 ) 2 e that the inference time of both layers is almost identical. l 4 sca 3 Below 40k operations, both layers have an almost con- g o l (

stant inference time of about 65µs. This means that even p u 2 d e for very small layers with a low amount of ReLU and bias e 5x5 image p operations the minimum inference time will be 130µs. This S 10x10 image inference time is very significant compared to the inference 1 1 15x15 image 9 25x25 image 8 time of a small layer. For example, a 50x50 depthwise con- 7 50x50 image volution with 128 channels, the inference time is 850µs, the 6 100x100 image 5 bias and ReLU take 13% of the inference time. Theoretical speedup 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 280 Channels (log scale) 260 ReLU 240 Bias 220 200 ) Fig. 3. Speedup of depth separable convolutions compared to the e l 180 equivalent full convolution. The number of input channels is kept sca

g 160 equal to the number of output channels, the used kernel size is o l (

3x3. 140 s) µ (

e 120 m i t In the first zone we have a speedup lower than one, itis

ce 100 n not interesting to use a depth separable convolution. This e r e f

n zone is only present for layers with a very low complexity I 80 as the number of channels is low and the image size is also small. 60 The second zone starts to be interesting as here there is 2 3 4 5 6 7 8 9 2 3 10k 100k an actual speedup, although the speedup is not very large Number of operations (log scale) and way lower than the theoretical speedup. Going to zone 3, something special happens, there is a large increase in speedup for large image sizes and a reduc- Fig. 2. Inference time of a ReLU and bias layer on the Movidius tion in speedup for lower image sizes. What happens here stick as a function of the number of operations. ReLU and bias are each counted as one operation. is that the Movidius stick changes its way of computing full convolutions, the inference time becomes smaller for small image sizes and larger for large image sizes. Going to zone 4, the same computing change happens B. Depth separable convolutions for the depth separable convolutions so now the speedup is Depth separable convolutions are used to make networks again normal. In zone 4 the achieved speedup is almost as more efficient, an example network that uses those isMo- high as the theoretical speedup. bileNet[2]. The idea is that a network with depth separable We showed in this part that depth separable convolu- convolutions achieves almost the same accuracy as a net- tions offer a speedup on the Movidius stick, this means work with full convolutions but with a lower amount of those convolutions can be used in networks to speed up parameters and operations. A depth separable convolution the inference time on the Movidius stick. splits a full convolution into two parts, a depthwise and a pointwise convolution, and by this achieves a lower amount C. Arithmetic intensity of parameters and operations. For every benchmarked layer the maximum achieved per- Both depth separable convolution and full convolutions formance is measured, table I shows the results. It is clearly were benchmarked on the Movidius stick. They were visible that the achieved performance is a lot higher for 1x1 benchmarked for different image sizes and channel num- and 3x3 convolutions compared to FC and depthwise con- bers, the number of input and output channels of the lay- volutions. This lower computational performance can be ers are both set equal. The speedup that a depth separable linked to the arithmetic intensity of the layer. The arith- convolution offers compared to a full convolution is showed metic intensity is the number of FLOPs per byte memory access needed. This metric shows how memory intensive a TABLE III layer is, a lower number is more memory intensive. Note Computational performance for a 1x1 conv and 50x50 image size. that the arithmetic intensity is calculated and not a mea- The complexity of all the layers is the same. sured metric. Table II shows the arithmetic intensity of the different c1:c2 (c1,c2) GFLOP/s Diff layers. The arithmetic intensity for the FC layers and 1:1 128,128 54.2 reference depthwise convolutions are a lot lower than those of the 1:4 64,256 52.6 -3% 1x1 and 3x3 convolutions. This means, the FC and depth- 1:8 32,512 49.2 -9.2% wise convolution layers are a lot more memory intensive. 1:16 16,1024 36.9 -31.9 This can explain why those layers achieve lower computa- tional performance. Here we showed that the arithmetic intensity is an im- than 8. When changing the channel number to the closest portant metric that could limit the maximum performance multiple of 8,the measured inference time was between 2- on the Movidius stick if it is too low. 8× lower. This means that the number of channels should always be chosen to be a multiple of 8 to achieve a reason- TABLE I able inference time on the Movidius stick. Computational performance of different layers on the Movidius stick IV. Model benchmarks Image size 10 × 10 25 × 25 100 × 100 The inference time of different models, made for Ima- Layer type (GFLOPS) (GFLOPS) (GFLOPS) geNet[4], was measured on the Movidius Stick. Figure 4 FC 3.22 3.22 3.22 shows the top 1 accuracy on ImageNet in function of the Depthconv 3x3 4.4 6.0 7.6 inference time on the Movidius Stick. The chart shows Conv 1x1 48.0 55.8 61.3 that for an accuracy higher than 75% the inference time Conv 3x3 57.5 60.9 63.1 becomes very large. The first improvement in accuracy costs double the inference time. If inference time and en- ergy consumption are important in the target application (which is most likely if using the Movidius stick), we would TABLE II argue that MobileNetV2 with 75% accuracy is a way better Maximum arithmetic intensity of different layers on the Movidius trade-off. stick. FP16 is used so 1 memory acces is 2 bytes.

Image size 80 10 × 10 25 × 25 100 × 100 FLOP FLOP FLOP 75 Layer type ( byte )( byte )( byte ) FC 1 1 1 70 Depthconv 3x3 4.1 4.2 4.2 MobileNetV1 cy MobileNetV2 a 65 r Conv 1x1 100 625 10 000 ShuffleNetV1 ccu

a ShuffleNetV2

Conv 3x3 100 625 10 000 60 1 Alexnet p o

T 55 Squeezenets 1.0/1.1 VGG-16 50 Inception-v2 D. Convolution channel ratio Resnet-50/101/152 45 SE-Inception-v2 As seen in the previous point the arithmetic intensity is SE-ResNets-50/101/152 an important factor to consider. One interesting observa- 40 SE-ResNeXt-50/101 tion made in the ShuffleNetV2 paper [3], is that using an 0 100 200 300 400 500 600 700 equal number of input and output channels maximizes the Inference time (ms) arithmetic intensity. A large difference in the number of in- put and output channels decreases the arithmetic intensity. This has as a consequence that the achieved computational Fig. 4. Accuracy of the model on ImageNet in function of the infer- ence time on the Movidius stick. performance is higher for equal input and output channels. This was tested with a channel ratio from 1 to 16 on the Movidius stick, the results are in table III In chart 5 we only display efficient models, which are The table shows that for a ratio of 4 the performance designed to have few operation but retain higher accuracy. reduction is only 3% but for a higher ratio the performance As showed in the chart, for accuracies lower than 68% Mo- decrease can be very significant. bileNet V1 is the better model, while for accuracies higher than 68% MobileNetV2[5] is better. The ShuffleNet V1[6] E. Channel count should be a multiple of 8 & V2 models are a lot worse than the MobileNets. We noted a high inference time when using convolutional Chart 6 shows the top 1 accuracy as a function of the layers with a different number of input and output layers complexity (FLOPs) of the model. The complexity of a A. ShuffleNet models 75 The previous section showed that the ShuffleNets per- 70 formed a lot worse than what would be expected, here we

65 explain why. Figure 7 shows the proportion of the inference time for every layer of ShuffleNet V1 & V2. It shows that 60 a very large portion of the inference time is taken by the ”Reshape” operation. This operation is used in the channel 55 op 1accuracy shuffle layer. This layer mixes the order of the channels in T the network and is implemented with two ”Reshape” and 50 MobileNetV1 one ”Permute” operation. Although this operation theo- 45 MobileNetV2 retically takes no FLOPs, we can see that it takes a lot of ShuffleNetV1 time to execute on the Movidius stick. ShuffleNetV2 40 0 50 100 150 ShuffleNetV2-x1.0 ShuffleNetV1-x1.0 Conv Inference time (ms) Reshape Copy Bias Permute Relu 34.8% Fig. 5. Accuracy of the efficient models on ImageNet in function of 39% Sum 42.8% the inference time on the Movidius stick. 45.1% DepthConv Receive-Tensor FC

6.39% 3.52% 2.36% 5.77% 0.735% 75 4.74% 4.52% 0.882% 2.47% 0.0348% 2.01% 0.591% 0.681% 70 1.73% 1.88%

65 cy a r 60 Fig. 7. Proportion of the inference time for the different layer types

ccu in ShuffleNet. a

1

p 55 o T

50 MobileNetV1 B. Small models are less efficient 45 MobileNetV2 ShuffleNetV1 Figure 8 shows that MobileNetV1 has a higher compu- ShuffleNetV2 40 tational performance as MobileNetV2. This is the reason 0 0.2 0.4 0.6 0.8 1 1.2 that MobileNetV2 is less interesting than MobileNetV1.

Complexity (GFLOP) Even if the number of operations is lower for the same achieved accuracy, as it is executed at a lower computa- tional performance it is still slower. Fig. 6. Accuracy of the efficient model on ImageNet in function of Both model types indicate that larger models achieve the Required number of FLOPs. a higher computational performance. To explain this we look at inference time per layer of a small and large Mo- bileNetV1 model, figure 9. For the smaller model the ReLU, Bias and Receive-Tensor layers take a larger pro- model should be related to the inference time, so chart 5 portion of the inference time. and 6 should be similar, but by comparing both charts it For the bias and ReLU layers, we’ve seen that for a small is clear that there are significant differences. number of operations the inference time is almost constant. The complexity of MobileNetV2 is always lower than Mo- As the small and large models have an equal number of bileNetV1 for the same accuracy, but as seen before V1 is ReLU and bias layers, the proportion of time taken by those faster for accuracies lower than 68%. layers for the smaller model will be larger. A similar conclusion can be made for the Receive-Tensor Chart 6 shows that ShuffleNet should at least be very layer. This is the time needed to send the image to the competitive with MobileNetV2, but from the inference Movidius stick, this time will be almost constant. So for a time, we observe that they are a lot worse than the Mo- smaller model, it will take a large portion of the inference bileNets. time. This shows that the complexity of a model isn’t repre- Figure 9 also shows that even for the larger MobileNet sentative for the fot the inference time. To compare the model the bias and ReLU layers take 28% of the inference efficiencies of different models, the inference time should time. This shows that the inference time of those layers is always be tested on the target hardware. absolutely non-negligible. is a better choice for accuracies lower than 68%. The Shuf-

25 fleNet models should theoretically compete with the Mo-

) bileNets, but they are not interesting on the Movidius stick S P

O as their inference time is slowed down by the reshape and

L 20 F

G transpose operations. (

d

n A conclusion independent of the Movidius stick is that co 15 FLOPs is not a good enough metric to evaluate efficient se

r

e neural network architectures. The FLOPs only give a p s

n 10 rough idea of the performance that should be achieved. So, o i t a r to compare models with equivalent FLOPs the inference e p

O 5 time should always be measured on the target hardware.

MobileNetV1 MobileNetV2 References [1] B. Barry, C. Brick, F. Connor, D. Donohoe, D. Moloney, R. Rich- 0 0.2 0.4 0.6 0.8 1 1.2 mond, M. O’Riordan, and V. Toma, “Always-on vision processing Complexity (GFLOP) unit for mobile applications,” vol. 35, no. 2, pp. 56–66. [2] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam, “MobileNets: Efficient convolutional neural Fig. 8. Computational performance of the model as a function of the networks for mobile vision applications,” . model complexity. [3] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun, “ShuffleNet v2: Practical guidelines for efficient CNN architec- ture design,” . MobileNetV1-1.0-224 MobileNetV1-0.25-128 Conv [4] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev ReLU6 Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Bias 14.1% Receive-Tensor Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei, 20.7% DepthConv “ImageNet large scale visual recognition challenge,” vol. 115, no. 26.2% FC 3, pp. 211–252. 14.1% [5] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmogi- nov, and Liang-Chieh Chen, “MobileNetV2: Inverted residuals 55.3% 4.21% 19.2% and linear bottlenecks,” .

13.5% [6] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun, “Shuf- 13.1% fleNet: An extremely efficient convolutional neural network for 16.6% mobile devices,” .

1.95% 1.08%

Fig. 9. Distribution of the inference time for a small(right) and large(left) MobileNet V1 model.

V. Conclusion Some guidelines to design efficient neural networks for the Movidius stick were given above, those are summarized here: • The number of channels always has to be a multiple of 8 • Depth separable convolutions achieve a reasonable speedup for higher complexities. • Bias and ReLU operations cannot be neglected, espe- cially for smaller layers/models. • Layers with a low arithmetic intensity are executed at a lower computational performance. • The input and output channel ratio should be as close as possible to 1. • The reshape and permute operations should be avoided on the Movidius stick. For the model benchmarks, we showed that MobileNet V1 & V2 are clearly the best models for an accuracy lower than 75%. If accuracy higher than 75% is desired, other models than MobileNet can be used but those will be a lot slower. We showed that theoretically, MobileNetV2 should have a better performance than MobileNetV1 but MobilenetV1 Contents

1 Introduction 22

1.1 Outline ...... 24

2 Deep neural networks 25

2.1 Image classification problem ...... 25

2.2 Neural network layers ...... 26

2.2.1 Fully connected layer ...... 26

2.2.2 Convolutional layer ...... 27

2.2.3 Depth separable convolutions ...... 28

2.3 Common architectures ...... 30

2.3.1 AlexNet ...... 30

2.3.2 VGG ...... 30

2.3.3 GoogleNet inception V1 ...... 31

2.3.4 ResNet ...... 32

3 Efficient neural networks 33

3.1 Hardware ...... 34

3.2 Different execution algorithms ...... 34

3.2.1 Standardise to GeMM ...... 35

12 CONTENTS 13

3.2.2 FFT convolution ...... 35

3.2.3 Winograd convolutions ...... 36

3.3 Network modifications ...... 36

3.4 Efficient network architectures ...... 36

3.4.1 Mobilenet V1 ...... 37

3.4.2 Mobilenet V2 ...... 38

3.4.3 Shufflenet V1 ...... 40

3.4.4 Shufflenet V2 ...... 43

4 Movidius stick 45

4.1 Movidius stick architecture ...... 45

4.2 OpenVino toolkit ...... 46

4.2.1 Model optimizer ...... 47

4.2.2 Inference engine ...... 49

4.3 NCS capabilities ...... 50

5 Layer’s benchmarks 52

5.1 Fully connected layer benchmark ...... 54

5.1.1 NCS results ...... 55

5.1.2 CPU results ...... 55

5.2 Convolutional layer benchmark ...... 56

5.2.1 NCS results ...... 57

5.2.2 CPU results ...... 57

5.3 Pointwise convolution benchmark ...... 58

5.3.1 NCS results ...... 59 14 CONTENTS

5.3.2 CPU results ...... 59

5.4 Depthwise convolution benchmark ...... 60

5.4.1 NCS results ...... 61

5.4.2 CPU results ...... 61

5.5 Speedup of depth separable convolutions ...... 62

5.6 ReLU & bias ...... 64

5.7 Channel ratio ...... 65

5.8 Comparison of efficiencies ...... 65

5.9 Channel count should be a multiple of 8 ...... 67

6 Model benchmarks 68

6.1 Models comparison ...... 69

6.2 MobileNet V1 & V2 ...... 71

6.2.1 Why MobileNetV2 is slower than V1? ...... 73

6.2.2 Small models are less efficient ...... 74

6.3 ShuffleNet V1 & V2 ...... 76

7 Conclusion 77

Appendices 83

A Model benchmark results 84 Abbreviations

API Application Programming Interface. 47, 51

CNN Convolutional Neural Network. 24, 28, 50

CNNs Convolutional Neural Networks. 28

CPU Central Processing Unit. 15, 17, 18, 32, 33, 41, 44, 47, 51–59, 67, 69, 74, 82–85

FC Fully Connected. 24, 33, 50, 53, 62–64, 76

FFT Fast Fourier Transform. 33, 34

FLOP FLoating point OPeration. 16, 24, 28–30, 60, 63–65, 67–71, 73

FLOPS FLoating point OPerations per Second. 53, 55, 57, 61, 62, 64

FLOPs FLoating point OPerations. 24, 31, 35, 37, 39, 41, 50, 53, 63, 64, 72

FPGA Field-Programmable Gate Array. 44, 47

GeMM General Matrix Multiply. 33

GPU Graphics Processing Unit. 32–34, 41, 44, 47, 48

IR Intermediate Representation. 44, 47, 51

LPDDR Low-Power Double Data Rate. 48

LSTM Long Short-Term Memory. 48

MAC Memory Access Cost. 24, 26, 27, 41, 63, 64

NCS Neural Compute Stick. 15–18, 44, 47–49, 51–60, 62–67, 69, 70, 73–76, 82–85

OPS OPeration per Second. 62

15 16 Abbreviations

RAM Random Access Memory. 24, 47, 48

ReLU Rectified Linear Unit. 16, 36, 39, 41, 50, 62, 63, 72, 73, 76

RNN Recurrent Neural Network. 48

SDK Software Development Kit. 47

SIMD Single Instruction, Multiple Data. 32

TPU Tensor Processing Unit. 32

VLIW Very Large Instruction Word. 43

VPU . 15, 43, 44, 47, 48 List of Figures

2.1 Schematic representation of a convolutional layer...... 27

2.2 Schematic representation of a depth separable convolution...... 28

2.3 Architecture of AlexNet. Borrowed from [4] ...... 30

2.4 Inception module used in GoogleNet. Borrowed from [5]...... 31

2.5 GoogleNet architecture. Borrowed from [5]...... 32

2.6 ResNet residual block. Borrowed from [6]...... 32

3.1 Convolution converted to a matrix multiplication with Im2Col operation. Bor- rowed from [14] ...... 35

3.2 Inverted residual with linear bottleneck block. The diagonally hatched structures do not contain non-linearities. Figure borrowed from [26] with some corrections, the Relu6 operation was showed on the bottleneck convolution instead of on the expansion convolution...... 38

3.3 (a) Two stacked group convolutions, channels between the different groups are not shared. (b) Two stacked group convolutions where the channels are shuffled so that every output group fully relates with all the input groups. Borrowed from [22]...... 41

3.4 (a) Basic unit for the shuffleNetV1 network. (b) Basic unit with stride 2. Bor- rowed from [22]...... 42

3.5 (a) Basic unit for the shuffleNetV2 network. (b) Basic unit with stride 2. Bor- rowed from [23]...... 44

17 18 LIST OF FIGURES

4.1 Myriad2 VPU block diagram. Borrowed from [29] ...... 46

4.2 Deployment workflow to program the NCS. Image borrowed from the OpenVino documentation[33] ...... 47

4.3 Part of the architecture of ResNet-50. The network before optimization (a). The network after the model optimizer stride optimization (b). Images made with NetScope [35] ...... 48

5.1 Inference time (a) and computational performance (b) of a fully connected layer in function of the number of input and output neurons. Benchmark on NCS. .. 54

5.2 Inference time (a) and computational performance (b) of a fully connected layer in function of the number of input and output neurons. Benchmark on CPU. .. 54

5.3 Inference time (a) and computational performance (b) of a convolutional layer in function of the number of channels. The input and output channels are set equal and the used kernel size is 3x3. The different series are inferences fora different image size. Benchmark on NCS...... 56

5.4 Inference time (a) and computational performance (b) of a convolutional layer in function of the number of channels. The input and output channels are set equal and the used kernel size is 3x3. The different series are inferences fora different image size. Benchmark on CPU...... 56

5.5 Inference time (a) and computational performance (b) of a pointwise(1 × 1) convolution in function of the number of channels. The input and output chan- nels are set equal. The different lines are inferences on a different image size. Benchmark on NCS...... 58

5.6 Inference time (a) and computational performance (b) of a pointwise (1 × 1) convolution in function of the number of channels. The input and output chan- nels are set equal. The different lines are inferences on a different image size. Benchmark on CPU ...... 58

5.7 Inference time (a) and computational performance (b) of a depthwise convo- lution in function of the number of channels. The used kernel size is 3 × 3. The different lines are inferences on a different image size. Benchmark on NCS ... 60

5.8 Inference time (a) and computational performance (b) of a depthwise convo- lution in function of the number of channels. The used kernel size is 3 × 3. The different lines are inferences on a different image size. Benchmark on CPU ... 60 LIST OF FIGURES 19

5.9 Speedup of a depth separable convolution with respect to a full convolution. Used kernel size is 3 × 3. Benchmarked on the NCS...... 62

5.10 Inference time (a) and computational performance (b) of a bias and ReLU in function of the number of executed operations. Bias and ReLU are considered as one operation each. Benchmark on NCS ...... 64

5.11 Inference time (a) and computational performance (b) of a 3 × 3 convolution in function of the number of channels. The used image size is 25 × 25. Benchmark on the NCS ...... 67

6.1 Top 1 accuracy on ImageNet in function of the number or FLOPs of the model. The exact numbers can be found in appendix A. The FLOPs were calculated from the network architecture generated by the OpenVino model optimizer. So, the number of FLOPs can be slightly lower than those reported in the original papers because of the optimizations, described in part 4.2.1. For the ResNets the FLOPs are significantly lower because of the ResNet stride optimization...... 70

6.2 Top 1 accuracy on Imagenet in function of the inference time on NCS (a) and CPU(b). The exact numbers can be found in appendix A...... 70

6.3 Top 1 accuracy on ImageNet in function of the number or FLOPs of the model. Zoomed in on MobileNet and ShuffleNet...... 72

6.4 Top 1 accuracy on Imagenet in function of the inference time on NCS (a) and NCS(b). Zoomed in on MobileNet and ShuffleNet...... 72

6.5 Complexity (a) and inference time (b) by layer type for MobileNetV1 & V2. The models have similar complexity. Results for the NCS ...... 73

6.6 The computational performance of the model (FLOPs) in function of the com- plexity of the model (FLOPs). Results for the NCS...... 75

6.7 Proportion of the inference time taken by the different layer types in MobileNetV1, comparing a small network (right) with a larger network (left). Inference on NCS. 75

6.8 Proportion of the inference time for the different layer types of ShuffleNet V1& V2. Inference on NCS...... 76 List of Tables

3.1 Architecture of MobilenetV1-1.0-224. S is the stride, c the number of output channels, n the number of times the layer is repeated...... 37

3.2 Bottleneck residual block transforming from Cin to Cout channels, with stride s and expansion factor t. Table borrowed from [26] ...... 39

3.3 Architecture of MobilenetV2-1.0-224. S is the stride, c the number of output channels, n the number of times the layer is repeated and t the expansion factor. 40

3.4 Architecture of ShuffleNetV1. S is the stride, c the number of output channels and n the number of times the layer is repeated. The number of groups used for the group convolutions is 3...... 42

3.5 Architecture of ShuffleNetV2 1×. S is the stride, c the number of output channels and n the number of times the layer is repeated. The number of groups used for the group convolutions is 3...... 44

5.1 Conv 1x1, image size 100x100 ...... 65

5.2 Conv 1x1, image size 50x50 ...... 65

5.3 Computational performance of different layers on the NCS...... 66

5.4 Maximum arithmetic intensity of different layers on the NCS. FP16 is used so 1 memory acces is 2 bytes...... 66

A.1 ShuffleNet V1 and V2 model benchmarks: complexity , top 1 accuracy onIm- ageNet, inference time on NCS, inference time on CPU. Accuracy from original papers [23, 22] ...... 84

20 LIST OF TABLES 21

A.2 MoblileNetV2 model benchmarks: complexity , top 1 accuracy on ImageNet, in- ference time on NCS, inference time on CPU. Accuracy source https://github. com/tensorflow/models/tree/master/research/slim/nets/mobilenet .... 85

A.3 MoblileNetV1 model benchmarks: complexity , top 1 accuracy on ImageNet, in- ference time on NCS, inference time on CPU. Accuracy source https://github. com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md 86

A.4 Table with the complexity of the model, top 1 accuracy on ImageNet, inference time on NCS, inference time on CPU and accuracy source...... 87 1 Introduction

Until the introduction of the concept of artificial intelligence (AI), the processing of information by computers was completely algorithmic. It was a deterministic execution of a written program to respect established rules. These rules were developed to solve problems in a restricted and well-defined application domain.

Although at the beginning of the 1960s, some scientists opened up the first lines of research in the field of artificial intelligence, it was from the 1980s that significant research began.The first basic architectures of artificial neural networks are being developed, as well as thetraining methods for these. In the 1990s, multi-layer networks are growing. In trying to adapt biomimetic architectures. The methods of learning by backpropagation are refined.

Nevertheless, the state of technology at the time does not yet allow tackling complex applications. The computing power and memory sizes were several tens of thousands of times weaker than today. We can imagine that the time has worked for us, the effect of Moore’s Law [1] has allowed us today to reach increased processing capacities. It was necessary to wait until the arrival of cheaper and more powerful processors, allowing AI to take place today in various applications.

Today, we can build networks with complex architectures, consisting of many layers. The ar- rival of GPUs on the market has brought massive parallel processing capabilities. The storage capacities evolve to reach limitless peaks, thus making it possible to store reference information

22 23 useful for the learning processes of the networks.

All this brings us to deep neural networks. Those are networks that can have thousands of millions of connections, one such type of network is a Convolutional Neural Network (CNN). Those are extensively used in computer vision related tasks like image classification, object localization, and image segmentation. However, even for today’s hardware they still require a lot of computational performance to be run. This poses a problem for deploying such networks on devices with limited processing capabilities and constrained energy consumption. Currently, those networks are mostly trained and deployed in the cloud, where a lot of resources are available.

The main benefits of deploying deep neural networks locally instead of in the cloud arethe following:

• Less latency, as there is no data transfer time.

• Lower energy consumption, as sending data to the cloud also costs a lot of energy.

• No third party to trust, because all the data can be processed locally.

To deploy networks locally there are two requirements, the hardware needs to be capable of running the network at a reasonable speed and the energy consumption has to be low enough for battery constrained devices. To make this possible there are essentially two elements that can be worked on:

• Improving the hardware, so that it allows running networks faster and at a lower energy cost.

• Improving the software, by making more efficient neural networks that require less com- putation.

In recent years more specific hardware has been developed to run neural networks, which allows running neural networks faster and more efficiently. One of those is the Movidius stick, which is the object of this work. The Movidius stick is made to accelerate the execution of neural networks for low power devices. It comes in a USB stick form factor which allows it to be added easily to a mini computer board like a Raspberry Pi. The main goals of this work are the following:

• Looking at the capabilities of the Movidius stick, i.e, which kind of deep neural network can be executed on the device. 24 CHAPTER 1. INTRODUCTION

• Benchmarking the performance of the Movidius stick for different deep neural network architectures.

• Looking at the possible software optimizations that would allow running networks more efficiently on the device.

1.1 Outline

In chapter 2, the basic concepts of deep neural networks are introduced and commonly used models are described. Chapter 3 gives an overview of methods to make neural networks more efficient and also describes some efficient models. Chapter 4, explains how to work with the Movidius stick. In chapter 5, we analyze the benchmarks of base layers used in deep learning. In Chapter 6, the benchmarks of deep neural network models are analyzed. 2 Deep neural networks

A deep neural network is an artificial neural network that consists of many layers with alotof connections. Every connection in the network has a weight and by changing those weights the input of the network can be mapped to the desired output. Learning the weights of the network is done by showing a lot of examples of what we want to accomplish to the network. A large number of connections allows deep neural networks to be very good at processing unstructured data like images, speech or plain text.

2.1 Image classification problem

The problem of image classification consists of predicting the correct label of an image. The most well-known challenge in this field is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [2]. It provides a large database of images that are labeled to one of the 1000 different image classes. The ImageNet challenge has become the standard for evaluating the image classification performance of a network.

The problem of image classification is interesting as other computer vision problems, suchas object detection and image segmentation, can be solved by reusing techniques used for image classification.

25 26 CHAPTER 2. DEEP NEURAL NETWORKS

2.2 Neural network layers

Building a deep neural network architecture isn’t done every time from scratch, basic layers are reused. By stacking multiple layers one after another, deep neural networks can be built. In this section, some layers used for CNNs are described.

2.2.1 Fully connected layer

In a Fully Connected (FC) layer all the input neurons are connected to all the output neurons, this layer is also called a dense layer. Let Nin and Nout denote the number of inputs and outputs neurons. Then the amount of FLoating point OPerations (FLOPs) needed to compute a fully connected layer is given by the following equation:

F LOP s = Nin ∗ Nout + (Nin − 1) ∗ Nout (2.1)

Here the definition is used where multiplications and additions are eachone FLoating point OPeration (FLOP). The first part of the equation is the number of multiply operations while the second part of the equation are the add operations. As most of the time Nin >> 1 the equation can be simplified to:

F LOP s ≈ 2 ∗ Nin ∗ Nout (2.2) Another important metric for a layer is the Memory Access Cost (MAC). The definition used here for the MAC is the minimum amount of accesses required, supposing that all the needed data to compute the layer is in RAM and the results have to be stored back in RAM. So, the MAC is the sum of all the memory loads and stores needed to compute the layer. This, of course, is a simplification of the real world but the goal is to have a metric allowing toquantify how memory intensive a certain layer is. For a FC the MAC is given by equation 2.3

∗ MAC = |{z}Nin + |N{zout} + |Nin {zNout} (2.3) Loading inputs Writing outputs Loading weight

The first and second parts correspond respectively to the loading of the inputs andwritingof the outputs. The last part represents the loading of the weights.

When working with images, using a fully connected layer is very costly. A digital image is encoded as an array of pixels, every pixel consists of 3 brightness values (Red, Green, and Blue). These values create the color of the pixel. Every brightness value can be seen as an input feature to the neural network. This means that an image of 200 × 200 pixels has 200 ∗ 200 ∗ 3 = 120000 features. If the first layer of the network is a fully connected layer with 1000 output neurons the number of parameters and FLOPs would be very large (120 million parameters and about 240 FLOP)s. 2.2. NEURAL NETWORK LAYERS 27

Sparsely connected layers allow reducing the number of parameters and operations. In a sparsely connected layer, not all input features are connected to all output features. An example of such a sparsely connected layer is a convolutional layer. When working with images convolutional, layers are almost always used.

2.2.2 Convolutional layer

In a convolutional layer, the input is represented as a three-dimensional matrix, it has a height (H), width (W), and a number of channels (Cin), figure 2.1. Input feature map Output feature map

H K H K

Cin

W W

Cin Cout

Figure 2.1: Schematic representation of a convolutional layer.

A convolution consists of an elementwise multiplication between a part of the input feature map and the kernel (with dimensions K × K × Cin), followed by the sum of all the elements in the resulting matrix. The kernel moves over the input feature map in two dimensions, the height, and width of the image. That’s why this type of convolution is called a 2D convolution. Every convolution of the input feature map results in one channel of the output feature map. Thus, the number of kernels will determine the number of output channels (Cout).

A convolutional layer can be seen as a sparsely connected version of a fully connected layer. Where only the input features in close spatial proximity are connected to an output feature. This reduces the number of parameters and operations a lot. The computational cost of such a layer is given by:

F LOP s = H ∗ W ∗ Cout ∗ Cin ∗ K² + H ∗ W ∗ Cout ∗ (Cin ∗ K² − 1) (2.4)

The left term being the number of multiplication operations and the right term the number of add operations. When Cin ∗ K² >> 1 then the equation above can be simplified by this one:

F LOP s ≈ Cin ∗ K² ∗ H ∗ W ∗ Cout ∗ 2 (2.5) 28 CHAPTER 2. DEEP NEURAL NETWORKS

The MAC of a convolutional layer is given by the following equation:

MAC = H ∗ W ∗ (Cin + Cout) + K² ∗ Cin ∗ Cout (2.6)

The first term represents the reading of the inputs and the writing of the outputs. Thesecond term is the loading of all the kernel weights.

2.2.3 Depth separable convolutions

Depth separable convolutions try to have the same effect as a ”full” convolution but witha significant reduction in the number of operations and parameters. This is done by separating the convolution in two parts: a depthwise convolution followed by a pointwise convolution, see Figure 2.2.

Depthwise Convolution

Pointwise Convolution

KxK Conv 1x1xCin H H H

W W W Cin Cin Cout

Figure 2.2: Schematic representation of a depth separable convolution.

The depthwise convolution is a channel-wise spatial convolution. This means that the kernel has dimensions K × K × 1 and the number of convolutions is equal to the number of input channels 2.2. NEURAL NETWORK LAYERS 29

Cin. The result after a depthwise convolution is a feature map with dimensions H ×W ×Cin. At this stage the information of the different channels hasn’t been combined, therefore a pointwise convolution is used. The pointwise convolution has kernel dimensions 1 × 1 × Cin and the number of convolutions is chosen in function of the desired amount of output channels Cout. This operation effectively combines the information of the different channels.

Compared to full convolutions, depth separable convolutions need fewer parameters and are less complex. The drawback is that they have a slightly lower representational power, which results in a small drop of the accuracy[3].

The complexity of a depth separable convolution is given by:

∗ ∗ ∗ ∗ − ∗ ∗ ∗ ∗ − F LOP s = |H W Cin{z(2 K² 1)} + H| W Cout{z (2 Cin 1)} (2.7) Depthwise conv Pointwise conv

The terms correspond to the depthwise and pointwise convolution respectively. As 2 ∗ K² >> 1 and 2 ∗ Cin >> 1 the equation above can be approximated by:

≈ ∗ ∗ ∗ ∗ ∗ ∗ ∗ F LOP s (H| W {zCin K}² + H| W {zCout Cin}) 2 (2.8) Depthwise conv Pointwise conv

With approximations 2.5 and 2.8 the complexity reduction can be approximated by:

(H ∗ W ∗ C ∗ K² + H ∗ W ∗ C ∗ C ) ∗ 2 1 1 Complexity reduction ≈ in out in = + (2.9) Cin ∗ K² ∗ H ∗ W ∗ Cout ∗ 2 Cout K² This means that for a 3 × 3 kernel the highest achievable reduction in complexity is 9. From equation 2.4 and 2.7 the exact complexity reduction is given by:

C ∗ (2 ∗ K² − 1) 2 ∗ C − 1 Complexity reduction = in + in (2.10) Cout ∗ (2 ∗ K² ∗ Cin − 1) 2 ∗ K² ∗ Cin − 1

As the depthwise and pointwise convolution are executed separately the memory accesses will be calculated for both operations individually. The MAC for a depthwise convolution:

MACdepthwise = 2 ∗ H ∗ W ∗ Cin + K² ∗ Cin (2.11)

The first term is the loading and writing of the feature maps, the second term is theloadingof the kernel weights.

The number of MAC for a pointwise convolution:

MACpointwise = H ∗ W ∗ (Cin + Cout) + Cin ∗ Cout (2.12)

Again the first term is for the loading and writing of the feature maps and the secondtermis the loading of all the kernel weights. 30 CHAPTER 2. DEEP NEURAL NETWORKS

2.3 Common architectures

In this part, some commonly used networks for image classification are introduced. As those networks are made for image classification, all described networks are Convolutional Neural Networks (CNNs). A CNN is a neural network that essentially uses convolutional layers.

2.3.1 AlexNet

The first CNN used for large scale image recognition is Alexnet[4], this network won the Im- ageNet[2] competition in 2012. It beat the winner of 2011 by a large margin. Afterward, the competition has always been won by CNNs.

The network started with an 11x11 convolution, the spatial dimension of the feature map grad- ually decreased and the number of channels increased. The idea is that further in the network the features become more abstract. The network consists of 1.4 GFLOPs. It was so large that time the memory of one GPU was not enough (3GB) to train the network, they had to separate the network into two parts and run them on two GPU’s. That’s why it is separated into two groups in the figure. AlexNet achieved a top 5 accuracy of 81.8% and a top 1 accuracy of59.3%.

Figure 2.3: Architecture of AlexNet. Borrowed from [4]

2.3.2 VGG

This network continued building on the same principles as AlexNet, they tried to achieve higher accuracies by going deeper (adding more layers). They made multiple architectures with different depths (number of layers), from 11 to 19 layers. They showed that deeper networks perform 2.3. COMMON ARCHITECTURES 31 better but at a greater computational and memory cost.

In their networks, only 3x3 convolutions were used. When multiple 3x3 convolutions are stacked in subsequent layers, their receptive field becomes greater. So two 3x3 convolution have the same receptive field as one 5x5 convolution. The advantage of this is that the same receptive field can be achieved with fewer parameters and operations.

VGG-16 achieved a top 5 accuracy of 92.5% and a top 1 accuracy of 75.2%, for a complexity of 31 GFLOPs.

2.3.3 GoogleNet inception V1

The GoogleNet[5] network went even deeper than VGG. They made a 22 layer deep network by stacking inception modules, figure 2.4. This module has multiple pathways with different operations. The idea is that while training the network can choose which operation works best. If the output channels of all four paths are stacked the number of channels would become very large. To solve this they used 1x1 convolutions as bottlenecks, these convolutions simply reduce a high number of input channels to a lower number of output channels.

Figure 2.4: Inception module used in GoogleNet. Borrowed from [5].

Figure 2.5 shows the full architecture of GoogleNet, the network is basically made by stacking multiple inception modules. Another particularity is that they have 2 auxiliary classifiers. Those are only used for training and allow to have more gradient flowing to the first layers ofthe network.

GoogleNet achieved a top 5 accuracy of 93.3% for a complexity of 3 GFLOPs. 32 CHAPTER 2. DEEP NEURAL NETWORKS

Figure 2.5: GoogleNet architecture. Borrowed from [5].

2.3.4 ResNet

ResNet [6] won the ILSVRC 2015 competition. They use an architecture with an impressive amount of 152 layers.

The first thing they remarked is that the benefits from going deeper with simple convolution layers stagnates. They did tests with a 56 and 20-layers model and showed that the test error for the 20-layer model was lower. This is unexpected because a 56-layer model should at least be as good as a 20-layer model. This means that the problem doesn’t lay in the model itself but in the optimization of the model. As a 56-layer model is larger it is much harder to optimize all those parameters.

To solve this problem they proposed an architecture with residual connections, the base block of their architecture is shown in figure 2.6. The idea is to add ”shortcut connections” that skips some layers and is added back into the networks. This means that instead of learning the desired output the network has to learn the difference between the input and output.

Figure 2.6: ResNet residual block. Borrowed from [6].

By stacking the residual block of figure 2.6, they made networks with different depths. The best network was ResNet-152, with 152 layers. It achieved a top 5 accuracy of 95.5% and top 1 accuracy of 80.6% for a computational cost of 22.6 GFLOPs. 3 Efficient neural networks

In chapter 2 the basics of deep neural networks were introduced. To evaluate the performance of the network, the only metric that was looked at was the accuracy. But this is not the only metric to evaluate a neural network. The time to execute the networks (called inference time) and energy consumption are also important metrics. This is especially important for edge devices as they mostly have limited computational resources and are often battery powered [7]. A concrete example would be to deploy a neural network on a phone.

To evaluate how good a network is for deployment, 3 metrics are important:

• The performance of the network on the given task (mostly accuracy for image classifica- tion).

• The inference time

• The energy consumption

The problem with the energy consumption and inference time is that those are dependent on the hardware the network is running on. So what can also be done to look at the FLOPs instead of the energy consumption and inference time. The assumption here is that the inference time and energy consumption will be highly related to the number of FLOPs

33 34 CHAPTER 3. EFFICIENT NEURAL NETWORKS

Efficient processing of neural networks means lowering the inference time and energy consump- tion. This can be done in different ways:

• On the hardware level

• Using different algorithms

• Modifying existing networks

• Creating better network architectures

Those are described in the following sections.

3.1 Hardware

Running a neural network consists of almost only multiply and accumulate operations. As those operations are independent they can be parallelized. This means that using a Single Instruction, Multiple Data (SIMD) architecture is interesting, that’s why GPUs are used rather than CPUs for the training and inference of networks. As deep learning is getting more popular, multiple companies make specific hardware for deep learning training and inference. This allows reducing the cost and energy consumption compared to GPUs.

Google made a Tensor Processing Unit (TPU)[8] for the training and inference of networks. Recently they introduced the edge TPU[9] this one is made for constrained devices and focuses on inference rather than training. Tesla made an AI[10] chip to power its Autopilot system. Intel makes Nervana neural network processors[11]. They also make neural network processors for the edge in the form of USB sticks, namely the neural compute stick v1[12] and v2[13]. The neural compute stick v1 is the hardware that is used in this work.

This clearly shows that there is a lot of research going on to make more specific and thus efficient hardware for deep neural networks.

3.2 Different execution algorithms

Different algorithms can be used to compute deep learning layers. Those algorithms allowto reduce the computational cost or allow better exploitation of the hardware resources. As a consequence choosing the right algorithm can make the execution of a layer more efficient. 3.2. DIFFERENT EXECUTION ALGORITHMS 35

3.2.1 Standardise to GeMM

Computing libraries for CPUs and GPUs are highly optimized for the execution of General Matrix Multiply (GeMM) operations. For a FC layer this isn’t a problem as the basic operation here is matrix multiplication. On the other hand for a convolutional layer, the data needs to be reordered so that it can be executed as a GeMM. Reordering the data is done with an image to column (Im2Col) operation, figure 3.1. Basically, every path of the input image that needs to be multiplied with a kernel is converted into a column. The disadvantage of this is that data has to be reordered and there is duplicate data in the matrixes. But the advantages of using GeMM outweigh the disadvantages [14].

Figure 3.1: Convolution converted to a matrix multiplication with Im2Col operation. Borrowed from [14]

3.2.2 FFT convolution

This method uses Fast Fourier Transform (FFT) to reduce the number of operations of a convo- lutional layer. Convolution in the time domain corresponds to an elementwise multiplication in the frequency domain, which costs fewer operations. The inputs and the kernels are transformed to the frequency domain with a FFT, in the frequency domain an elementwise operation is com- puted and the result is transformed back to the time domain. Even with the cost of the FFT transformation it costs less to use this method instead of directly computing the convolution in the time domain [15]. 36 CHAPTER 3. EFFICIENT NEURAL NETWORKS

3.2.3 Winograd convolutions

Winograd convolutions use Winograd’s minimal filtering algorithms to reduce the number of operations needed to compute a convolution. They showed that for small kernel sizes this method performs better on a GPU compared to the previous FFT method [16].

3.3 Network modifications

Networks can also be made more efficient by simplifying existing networks. One method is to reduce the precision of weights and activations in the network, while another is to remove ”unnecessary” connections.

Reducing the precision can be done by mapping the 32-bit floating-point numbers to 16-bit floating-point numbers. Doing this for the weights and activations lowers the model sizeand allows for smaller arithmetic operations which can be processed faster.

Paper [17] shows that by fine-tuning the network the weights and activations can be reduced to 8-bit fixed-point numbers for almost no accuracy loss. Fine tuning means that the network is retrained with the new reduced precision weights.

Tests were also made with binarized neural networks[18], where the weights and activations are mapped to +1 or -1. But this technique highly decreases the accuracy of the network.

Another method to make networks more efficient is to remove connections in the network. Connections with small weights can be set to 0 with only a small accuracy loss [19]. This is effectively the same as removing connections. This technique allows lowering the model size and inference time if the hardware supports it. To speed up the inference the hardware has to support skipping multiplications with 0 or benefit from executing sparsely connected layers.

3.4 Efficient network architectures

The networks described in section 2.3 are essentially focused on obtaining higher accuracy for image classification. This is mostly done by stacking more layers and so increasing the complexity of the network. The problem is that this makes the networks more complex to execute and so it increases the inference time. The last years different networks were proposed that try to bemore efficient [20, 21, 22, 23, 24, 25]. This means, maintaining an accuracy similar to the state of the art networks but significantly reducing the complexity. Some of those networks are described in the following sections. 3.4. EFFICIENT NETWORK ARCHITECTURES 37

3.4.1 Mobilenet V1

The Mobilenet paper [3] proposed a more efficient neural network by replacing full convolutions with depth separable convolutions (described in part 2.2.3). The full network architecture is described in table 3.1. As can be seen in the table the network applies the same rule as VGG, every time the spatial image size is halved the number of channels is doubled.

Table 3.1: Architecture of MobilenetV1-1.0-224. S is the stride, c the number of output channels, n the number of times the layer is repeated.

Layer type s Input size c n Conv 2 224 × 224 × 3 32 1 Depthsep conv 1 112 × 112 × 32 64 1 Depthsep conv 2 112 × 112 × 64 128 1 Depthsep conv 1 56 × 56 × 128 128 1 Depthsep conv 2 56 × 56 × 128 256 1 Depthsep conv 1 28 × 28 × 256 256 1 Depthsep conv 2 28 × 28 × 256 512 1 Depthsep conv 1 14 × 14 × 512 512 5 Depthsep conv 2 14 × 14 × 512 1024 1 Depthsep conv 2 7 × 7 × 1024 1024 1 Avg pool (7 × 7) 1 7 × 7 × 1024 1024 1 FC 1 1 × 1 × 1024 1000 1

This model achieves 70.6% top 1 accuracy on ImageNet, this is only 1.1% lower than the same model with full convolutions. It has a complexity of 1137 MFLOPs, which is 8.55× less operations compared to the full convolution model. By this, they showed that depth separable convolutions are good for reducing the complexity while not losing much accuracy.

According to the application, a trade-off between accuracy and complexity is desired. Therefore, two global hyperparameters were introduced, a width multiplier and a resolution multiplier. Those hyperparameters allow adjusting the complexity of the created network.

The width multiplier α changes the number of channels for the entire network. The number of input and output channels for all layers become αCin and αCout.

The resolution multiplier ρ changes the resolution of the image. The spatial resolution at every layer of the network will be equal to ρH and ρW .

Conventionally the width and resolution multiplier are added to the name of the model; e.g, MobileNetV1-0.75-224 has a width multiplier of 0.75 and a resolution of 224x224. 38 CHAPTER 3. EFFICIENT NEURAL NETWORKS

3.4.2 Mobilenet V2

Mobilenet V2 [26] is based on Mobilenet V1. They introduce a novel layer module called: the inverted residual with linear bottleneck. It consists of a pointwise convolution that expands a low dimensional feature map to a higher dimension followed by a depthwise convolution. The last operation is a bottleneck operation that reduces the dimension again. The module is illustrated in figure 3.2.

Figure 3.2: Inverted residual with linear bottleneck block. The diagonally hatched structures do not contain non-linearities. Figure borrowed from [26] with some corrections, the Relu6 operation was showed on the bottleneck convolution instead of on the expansion convolution.

Bottleneck module

Non-linearities are needed in a network to be able to represent more than simply a linear com- bination of the input to the output. The problem with non-linearities is that they discard information. For example, Rectified Linear Unit (ReLU) operation sets all negative numbers to zero, so information is lost. When using non-linearities in a low dimensional space (the dimension is the number of channels) information is lost. But when doing this in a high dimensional space the information might still be present in another channel. So they suggest that non-linearities should only be applied in a high dimensional space to minimize the information loss. On the other hand, using high dimensional spaces have a higher computational cost. To mitigate this problem they proposed this expansion -> operation -> bottleneck structure, figure 3.2

The first operation is a pointwise convolution that expands the low dimensional layer to ahigher dimension, this operation has non-linearities. Then a depthwise convolution is applied which is a low-cost operation even if there are a lot of channels, this operation also contains non-linearities. The last step is a linear bottleneck layer, it’s a pointwise convolution that reduces the number of channels, but here the layer does not have non-linearities as the number of output channels 3.4. EFFICIENT NETWORK ARCHITECTURES 39 is low. The exact dimensions of this layer module is described in table 3.2.

Table 3.2: Bottleneck residual block transforming from Cin to Cout channels, with stride s and expansion factor t. Table borrowed from [26]

Input Operator Output

H × W × Cin 1x1 conv, ReLU6 H × W × tCin × × H × W × H W tCin 3x3 dwise s=s, ReLU6 s s tCin H × W × H × W × s s tCin linear 1x1 conv s s Cout

Inverted residuals

Mobilenet V2 also uses residuals, it has been shown that those connections help to increase the accuracy of deeper models [6][27]. As can be seen in figure 3.2 the residual connection connects the two bottlenecks. The intuition behind this is that those bottlenecks contain all the necessary information.

Full network

The full Mobilenet V2 architecture is given in table 3.3. Compared to Mobilenet V1 for an equal spatial image resolution, the number of channels in the bottleneck layer is a lot smaller in Mobilenet V2. Just as Mobilenet V1 they use a width and resolution multiplier to change the model accuracy-complexity trade-off.

The largest Mobilenet V2 model (width multiplier = 1.4 & resolution = 224x224) achieves a top-1 accuracy of 74.7% on ImageNet and requires 1164 MFLOPs. 40 CHAPTER 3. EFFICIENT NEURAL NETWORKS

Table 3.3: Architecture of MobilenetV2-1.0-224. S is the stride, c the number of output channels, n the number of times the layer is repeated and t the expansion factor.

Layer type s Input size c t n Conv 2 224 × 224 × 3 32 - 1 Bottleneck 1 112 × 112 × 32 16 1 1 Bottleneck 2 112 × 112 × 16 24 6 2 Bottleneck 2 56 × 56 × 24 32 6 3 Bottleneck 2 28 × 28 × 32 64 6 4 Bottleneck 1 14 × 14 × 64 96 6 3 Bottleneck 2 14 × 14 × 96 160 6 3 Bottleneck 1 7 × 7 × 160 320 6 1 Conv 1x1 1 7 × 7 × 320 1280 - 1 Avg pool 7x7 - 7 × 7 × 1280 - - 1 Conv 1x1 1 × 1 × 1280 k - -

3.4.3 Shufflenet V1

Depth separable convolutions are useful to reduce the complexity of a model while retaining a high level of accuracy [3, 20]. The problem for small models is that now all the complexity lays in the pointwise convolutions. In MobileNetV1-1.0-224 more than 90% of the complexity is in the pointwise layers. ShuffleNet [22] proposes to use group convolutions to reduce complexity.

The problem with group convolutions is that output of one group only depends on the input of this same group, i.e., the information between groups is not shared. To solve this problem they introduce a shuffle operation, this operation is illustrated in figure 3.3b. This ensures that different channels are mixed between the different groups. This operation can be implemented as a reshape -> transpose -> reshape operation. The tensor of dimensions (N, H, W, C) has to be reshaped to dimensions (N, H, W, C, G) where G is the number of groups. Then those two last dimensions need to be transposed, this creates a tensor of dimensions (N, H, W, G, C). This tensor is then reshaped to its original dimensions.

As they reduce the complexity of the pointwise convolutions by using groups they can use more channels and still keep the same computational cost. They showed that for models with equal complexity, higher accuracies can be achieved by having more groups and more channels.

The basic unit used in ShuffleNetV1 consists of a 1x1 group convolution, a channel shuffle, a depthwise convolution and another 1x1 group convolution (see figure 3.4a). They also use residual connections in the unit. Figure 3.4b shows the same unit but with stride 2. The differences are: a 3x3 average pooling is added to the residual connection and the channelsof 3.4. EFFICIENT NETWORK ARCHITECTURES 41

(a) (b)

Figure 3.3: (a) Two stacked group convolutions, channels between the different groups are not shared. (b) Two stacked group convolutions where the channels are shuffled so that every output group fully relates with all the input groups. Borrowed from [22]. the residual are concatenated instead of added. ShuffleNet only uses one ReLU non-linearity after the first pointwise convolution, and no ReLU after the depthwise convolution.

Table 3.4 gives the full architecture of ShuffleNetV1 with 3 groups.

Similarly to the MobileNets, they also use a parameter to increase the complexity of their model by multiplying the number of channels. To denote which width multiplier is used they just write it after the model name, e.g., ShuffleNet 2× uses a width multiplier of 2.

The top-1 accuracy achieved on ImageNet for ShuffleNetV1 2× is 73.7%, for a complexity of only 1048 MFLOPs. They also made a ShuffleNetV1 2× variant where they added Squeeze- and-Excitation layers[21] to the network, this increased the complexity to 1054 MFLOPs and the accuracy to 75.3%. 42 CHAPTER 3. EFFICIENT NEURAL NETWORKS

(a) (b)

Figure 3.4: (a) Basic unit for the shuffleNetV1 network. (b) Basic unit with stride 2. Borrowed from [22].

Table 3.4: Architecture of ShuffleNetV1. S is the stride, c the number of output channels andn the number of times the layer is repeated. The number of groups used for the group convolutions is 3.

Layer type s Input size c n Conv 2 224 × 224 × 3 24 1 MaxPool 3x3 2 112 × 112 × 24 24 1 ShuffleNet unit 2 56 × 56 × 24 240 1 ShuffleNet unit 1 28 × 28 × 240 240 3 ShuffleNet unit 2 28 × 28 × 240 480 1 ShuffleNet unit 1 14 × 14 × 480 480 7 ShuffleNet unit 2 14 × 14 × 480 960 1 ShuffleNet unit 1 7 × 7 × 960 960 3 Global pool 7x7 2 7 × 7 × 960 960 1 FC - 1 × 1 × 960 1000 1 3.4. EFFICIENT NETWORK ARCHITECTURES 43

3.4.4 Shufflenet V2

In the ShuffleNet V2 [23] paper they wanted to create an efficient network but by also looking at the inference time and not only the FLOPs. For this they proposed four guidelines to make efficient neural networks, those are summarized here.

G1) Equal channel width minimizes memory access cost (MAC). The inference time of a layer is not only related to the number of FLOPs but is also affected by the MAC. The MAC of a pointwise convolution is given in equation 2.12. They showed that for a pointwise convolution the number of FLOPs per MAC (called the arithmetic intensity reaches a maximum when the number of input channels is equal to the number of output channels. They executed benchmarks on a CPU and GPU with networks with different input and output channels ratios, the result was that the network with the same number of input and output channels always had faster inference time.

G2) Excessive group convolution increases MAC. Increasing the number of groups also reduces the arithmetic intensity (FLOPs/MAC). They also showed that the inference time is higher for convolution with more groups on a CPU and on a GPU.

G3) Network fragmentation reduces the degree of parallelism. With network fragmen- tation they mean splitting one large convolution into multiple parts. You can execute smaller convolutions sequentially (equivalent with adding layers) or you could do multiple convolutions in parallel (like a group convolution). They showed that at the same complexity the network with only one large convolution always has the lowest inference time.

G4) Element-wise operations are non-negligible. With element-wise operation they mean: ReLU, Bias, Add,.. Those layers have a low complexity but a high MAC so the inference time is also high and those operations are non-negligible.

Figure 3.5 shows the basic units used in ShuffleNetV2, there is a stride 1 and stride 2 unit.As there is a channel split operation, every branch only contains half of the channels of the network. So the 1x1 and depthwise convolutions are only executed on half of the channels, this removes the need to use group convolutions (follows G2). To shuffle the information from the paths, they add a channel shuffle operation after the 2 branches have been concatenated again. They don’t use any expansion or bottleneck layers as those would violate G1.

Table 3.5 describes the full architecture of ShuffleNetV2, the architecture is similar to Shuf- fleNetV1 (table 3.4) but with a different unit.

ShuffleNetV2 2× achieves a top-1 accuracy of 74.9% for a complexity of 1182 MFLOPs. 44 CHAPTER 3. EFFICIENT NEURAL NETWORKS

(a) (b)

Figure 3.5: (a) Basic unit for the shuffleNetV2 network. (b) Basic unit with stride 2. Borrowed from [23].

Table 3.5: Architecture of ShuffleNetV2 1×. S is the stride, c the number of output channels and n the number of times the layer is repeated. The number of groups used for the group convolutions is 3.

Layer type s Input size c n Conv 3x3 2 224 × 224 × 3 24 1 MaxPool 3x3 2 112 × 112 × 24 24 1 ShuffleNetV2 unit 2 56 × 56 × 24 116 1 ShuffleNetV2 unit 1 28 × 28 × 116 116 3 ShuffleNetV2 unit 2 28 × 28 × 116 232 1 ShuffleNetV2 unit 1 14 × 14 × 232 232 7 ShuffleNetV2 unit 2 14 × 14 × 232 464 1 ShuffleNetV2 unit 1 7 × 7 × 464 464 3 Covn 1x1 1 7 × 7 × 464 1024 1 Global pool 7x7 2 7 × 7 × 1024 1024 1 FC - 1 × 1 × 960 1000 1 4 Movidius stick

The Movidius™ stick is a neural network accelerator made by Intel®. It communicates via USB and has the form factor of a USB stick. The Movidius stick allows to run neural networks and send back the results to the host computer.

4.1 Movidius stick architecture

The Movidius™ stick is based on the Myriad 2 Vision Processing Unit (VPU). This chip is developed by the Movidius™ company, which was acquired by Intel® in September 2016 [28]. The Myriad 2 VPU is a co-processor designed to accelerate computer vision related tasks. The architecture is depicted in figure 4.1, it consists of 12 SHAVE Very Large Instruction Word (VLIW) vector processors and programmable hardware accelerators [29]. The programmable hardware accelerators are used for computer vision specific tasks as lens shading correction, sharpening filter, gamma correction,... The Myriad 2 VPU runs at 600MHz at 0.9V. Intel used this chip to create the Movidius neural compute stick. It is marketed as a neural network inference device but it is important to note that the Myriad 2 chip wasn’t specifically built for neural network inference[30]. It was build for more general computer vision related tasks.

45 46 CHAPTER 4. MOVIDIUS STICK

Figure 4.1: Myriad2 VPU block diagram. Borrowed from [29]

The more common name used by Intel to denote the Movidius stick is the Neural Compute Stick (NCS), from now on we will always use the NCS abbreviation.

4.2 OpenVino toolkit

The OpenVino toolkit is a (partially) open source software made by Intel, it is used for the deployment of neural networks on different Intel hardware platforms. Figure 4.2 shows the general development workflow when working with the OpenVino toolkit. The first stepisto create and train a network in any supported framework as Caffe[31] or Tensorflow[32]. Then this model needs to be transformed into an intermediate representation with the OpenVino model optimizer (more about this later). This Intermediate Representation (IR) consists of an XML and a .bin file. The XML file describes the full architecture of the network whilethe .bin file contains the parameters. As the last step, the OpenVino inference engine canrunthe IR model on any supported hardware. The supported hardware consist of Intel CPUs, GPUs, FPGAs or VPUs. 4.2. OPENVINO TOOLKIT 47

Figure 4.2: Deployment workflow to program the NCS. Image borrowed from the OpenVino documentation[33]

4.2.1 Model optimizer

The model optimizer converts a model from another framework, like Tensorflow or Caffe, toan intermediate representation. It doesn’t only convert the model but also does some optimizations. Those optimizations include: fusing multiple layers into one, stride optimization with ResNet or cutting off parts of the model. Those optimization techniques are described here.

Batchnorm layer fusing

The batchnorm layer [34] is used in almost all recent models, it consists of a normalization 4.1 and scale operation 4.2:

xi − µB xbi = (4.1) σB

yi = γxbi + β (4.2)

The parameters µB, σB, γ, β are parameters learned while training the model, at inference time those are constants for any given neuron. The value of xi is calculated from the previous layer activations by the following equation:

xi = w0a0 + w1a1 + ... + wn−1an−1 + wnan + b (4.3)

With a the previous layer activations, w the weights, b the bias and n the number of neurons to which xi is connected. By combining equation 4.3 with the normalization and scale operation

(4.1 and 4.2) we get the following equation for yi: γ yi = (w0a0 + w1a1 + ... + wn−1an−1 + wnan + b − µB) + β σB 48 CHAPTER 4. MOVIDIUS STICK

By transforming the weights and biases of equation 4.3, the batchnorm layer can be fused into the previous layer. The new bias and weights become: γ wnew = w (4.4) σB γ bnew = (b − µB) + β (4.5) σB Doing this effectively removes the computational cost of the batchnorm layers.

ResNet stride optimization

This optimization is specific to the ResNet architecture, a part of the ResNet architecture is showed in figure 4.3a. The problem here is that the convolutions 3a_1 and 3a_2a are both pointwise convolutions with stride 2. This basically means that half of the input image is not even used and thus those points were computed for nothing.

(a) (b)

Figure 4.3: Part of the architecture of ResNet-50. The network before optimization (a). The network after the model optimizer stride optimization (b). Images made with NetScope [35]

The idea of stride optimization is to move the stride convolution higher up in the network. It is moved up till the first convolution with a kernel size larger than 1x1 is found. In this case,this is the 2c_2b convolution. Figure 4.3b shows the network after stride optimization. Here the stride has been moved up in the 2c_2b convolution and so the 2c_2c convolution only needs to be computed on a 28x28 feature map. Also, a 1x1 stride 2 pooling operation had to be added to the other branch so that the spatial dimension would be 28x28. This optimization allows 4.2. OPENVINO TOOLKIT 49 reducing the number of operations needed for the ResNet networks. E.g. ResNet-50 goes from 7.72 GFLOPs to 6.97 GFLOPs after optimization, this is a reduction of 9.7% FLOPs.

Cutting of model parts

The model optimizer also allows cutting off parts of models. This can be useful when some parts are only needed for the training of the model. An example is the auxiliary classifiers used in the Inception network, they are only used while training to have better gradient flow through the network.

4.2.2 Inference engine

The inference engine module is basically an API that allows loading a IR model and run it on any supported Intel device. They have a C++ API and a Python API, the Python API is simplified version of the full C++ API and only supports basic functionality for now. In this work, the Python API has been used for simplicity. The inference engine supports Intel CPUs, GPUs, FPGAs and VPUs, it is here that the open sourness of OpenVino stops. The inference engine code is fully open source but the libraries used to communicate with the proprietary Intel hardware (they call them plugins) are not all open source. The CPU and GPU plugins are in the open source OpenVino distribution [36], but the FPGA and VPU plugins are only included when downloaded from the Intel developer website.

To directly program the chip on the NCS (the Myriad 2) the Myriad SDK is needed. This SDK is not available publicly, as we don’t have the Myriad SDK we will only be able to make changes to the architecture of the networks and no low-level execution algorithms.

Running a neural network with the inference engine follows 3 steps: loading, compiling and executing the network.

1. The first step is to load the XML and .bin file of the network. It will create aCNNNetwork class that represents this network in host memory.

2. The second step is to compile and upload the network to the target device. In case of the NCS, the full network and weights are placed in the RAM of the NCS.

3. The last step is to send the input data to the target device and ask for the inference, once the inference is done the target device returns the outputs of the network. 50 CHAPTER 4. MOVIDIUS STICK

Asking for inference can be done in two ways, a sync and async call. The sync call is a blocking call and will wait till the inference is done. While the async call is nonblocking and so allows another inference request before the previous one is done. As the NCS is operated via USB, sending the input data to the NCS can limit the performance. To mitigate this Intel recommends having at least four parallel (async) inference requests to fully to hide the data transfer costs [37].

4.3 NCS capabilities

The NCS doesn’t support all the layers that are available in the OpenVino toolkit, all the sup- ported layers can be found in the OpenVino documentation[38]. Basically what’s not supported is everything that is related to 3D convolution (where the kernel moves in 3 directions) and to memory (so RNNs, LSTMs,... are not supported).

The NCS only supports FP16 inference, although the datasheet of the Myriad 2 says it also supports FP32 and 8/16/32-bit integer operations. So this is a limitation of the VPU plugin used in the OpenVino toolkit.

On GPUs it is common practice to use large batch sizes because this largely increases the throughput. On the NCS this isn’t true, the batch size has no influence on the throughput. If the batch size is doubled the inference time is also doubled. This is logical because the NCS is meant to be used in real time operation where latency is important and so batch sizes of one are used. All the following tests in this work were made with a batch size of one.

The NCS has 0.5 GB of LPDDR 3 RAM, it offers 400 GB/s of memory bandwidth. This memory is used to store the full network and the activations for every layer. As the NCS can only be used for inference, the only activation layers that need to be saved are those that are still needed to do more computation later. This size should be enough for most networks, for example, VGG19 has 144 million parameters which represent 0.288 GB of data in floating-point 16-bit precision. So a little more than half the available RAM is used.

As networks can be simplified by pruning small weight, it would be interesting ifthe NCS could handle zero weights efficiently. This was tested with AlexNet, the inference speed was tested with random weight and with all zero weights. No significant difference in inference speed was measured. This means that the pruning of a network cannot be exploited on the NCS.

The power consumption of the NCS was measured: It is 0.30 W at idle. During inference, the power consumption varies between 1.3 and 1.8 W.

Knowing the capabilities of the NCS already allows us to eliminate some optimization possibil- 4.3. NCS CAPABILITIES 51 ities described in previous chapter. Working on the low-level algorithms that execute the deep learning layers (as described in section 3.2) won’t be possible as we don’t have access to the low level implementation of the layers executed on the NCS.

The optimizations by reducing the precision of the weights and activations also won’t be possible as only FP16 is allowed on the NCS. Also setting weights to 0 doesn’t reduce the inference time.

This means that the remaining possible optimizations are in the use of better neural network architectures. 5 Layer’s benchmarks

In this chapter different layers used in CNNs are benchmarked. The inference time of the layer is measured while the complexity (FLOPs) of the layer is increased. The following layers are benchmarked:

• Fully connected layer

• Convolutional layer

• Depth separable convolution

• ReLU and bias layer

According to the layer type, the complexity can be modified with different parameters. Fora FC layer the complexity can be modified by changing the number of inputs and outputs ofthe layer. While for a convolutional layer the image size, kernel size, input and output channels all have an impact on the complexity. The influence of those different parameters on the inference time is analyzed in this chapter.

The benchmarks are executed with the Intel® OpenVino™ toolkit version 2018 R5. The bench- marks have been executed on 2 hardware platforms supported by OpenVino:

52 53

• On the NCS version 1 at 16-bit floating point (FP16) precision

• On an Intel i7 5700HQ CPU on a single thread at 32-bit floating point (FP32) precision. The frequency of the processor was kept constant at 3.5 GHz.

The procedure we followed for the layer benchmarks is the following:

1. We create a Tensorflow model that only consists of the layer of interest, with the parameters that we want to test (kernel size, image height, ...).

2. We convert the model to an IR with the OpenVino Model optimizer.

3. We run the IR on the target hardware (random input data is used) at least 10 times. And measure get the inference time with the get_perf_counts method of the OpenVino inference engine API.

4. We calculate the average inference time and standard deviation for this layer.

A remark on step 3, the get_perf_counts method returns the inference time for each unique layer. What it returns is different according to the used inference hardware. Onthe NCS a layer with a bias is executed in 2 parts, the layer itself and the bias operation. So we can measure the time for the bias operation and layer individually. While on the CPU this is only reported as one layer. This all means, in the NCS benchmarks the time to compute the bias is not included while for the CPU it is included.

Another remark is the get_perf_counts method also reports the time needed to receive the input data. This means, in the following benchmarks the data transfer time is not included. 54 CHAPTER 5. LAYER’S BENCHMARKS

5.1 Fully connected layer benchmark

100k Fully connected layer y ~ x² 5 3

) 2 e l 10k sca 2.5 g o

l 5 (

S s) P µ ( O

2 L e 2 F m i G t 1000

ce

n 5 e r e

f 1.5 n

I 2

100

5 1

2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1000 10k 1000 10k Number of neurons Number of neurons (a) (b)

Figure 5.1: Inference time (a) and computational performance (b) of a fully connected layer in function of the number of input and output neurons. Benchmark on NCS.

20 5 Fully connected layer y ~ x² 2 18 10k ) e l 5 16 sca

g 2 o

l 14 (

1000 S s) P µ ( O 5 L 12 e F m i G t 2

ce 10 n 100 e r e f 5 n

I 8

2

10 6

5

2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 1000 10k 1000 10k Number of neurons Number of neurons (a) (b)

Figure 5.2: Inference time (a) and computational performance (b) of a fully connected layer in function of the number of input and output neurons. Benchmark on CPU. 5.1. FULLY CONNECTED LAYER BENCHMARK 55

In this section a FC layer is benchmarked on the NCS and CPU. The inference time of a FC layer in function of the number of input and output neurons is showed in figures 5.1a and 5.2a, on the NCS and CPU respectively. The number of input and output neurons is kept equal (Nin = Nout). Approximation 2.2 shows that the complexity of a fully connected layer is proportional to the input neurons multiplied with the output neurons, so if Nin = Nout the complexity is proportional to the squared number of neurons (N) F LOP s ∼ N². As it is expected that the inference time will be proportional to the complexity it can be stated that the inference time should be proportional to N²:

Inference time ∼ N² for Nin = Nout (5.1)

As in charts 5.1a and 5.2a both axes are logarithmic it is hard to see if the relation between the number of neurons and the inference time is quadratic. Therefore, the orange lines are added as a reference on the figure. Those lines represent a quadratic relation on the graph andthus show a constant computational performance. The computational performance is expressed in FLoating point OPerations per Second (FLOPS). The line was placed so that it crosses the measuring point with the highest computational performance, this means no measured point is located under the line. The closer the points are to this line the higher their computational performance.

5.1.1 NCS results

Figure 5.1a shows that for a high neuron number (above 2000) the quadratic relation is followed. While for a lower number of neurons the inference time is too high. This means that a layer with a lower number of neurons is executed less efficiently on the NCS. This is also visible in figure 5.1b, here the number of FLOPS are shown. The chart shows that for a higher number of neurons the computational performance increases rapidly until it plateaus around 2000 neurons. So for networks with an equivalent amount of FLOPs, it would be better to have fewer layers with a high neuron count than more layers with a low neuron count. The latter would be executed at lower computational performance and thus run slower on the NCS.

5.1.2 CPU results

The same charts were made for the execution on the CPU, figure 5.2a and 5.2b. Interestingly, here the charts show the inverse effect as before. For an increase in complexity, the computational performance actually lowers. As inference time on the CPU is not the focus of this work, the possible causes weren’t analyzed. 56 CHAPTER 5. LAYER’S BENCHMARKS

5.2 Convolutional layer benchmark

5 5x5 image 5x5 image 2 10x10 image 60 10x10 image 1M 15x15 image 15x15 image 25x25 image 25x25 image ) 5

e 50 l 50x50 image 50x50 image 2

sca 100x100 image 100x100 image

g 100k o

l 40 (

5 S s) P µ ( 2 O

L e 30 F

m 10k i G t 5 ce

n 20

e 2 r e f 1000 n I 5 10

2 100 0 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.3: Inference time (a) and computational performance (b) of a convolutional layer in function of the number of channels. The input and output channels are set equal and the used kernel size is 3x3. The different series are inferences for a different image size. Benchmark on NCS.

5

2 5x5 image 5x5 image 100 1M 10x10 image 10x10 image 5 15x15 image 15x15 image 2 25x25 image 25x25 image ) e l 100k 50x50 image 50x50 image 5 80

sca 100x100 image 100x100 image 2 g o l

( 10k

5 S s) P µ

( 60 O

2 L e

1000 F m i G t

5

ce 2 n

e 40

r 100 e

f 5 n I 2 10 5 20 2 1 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.4: Inference time (a) and computational performance (b) of a convolutional layer in function of the number of channels. The input and output channels are set equal and the used kernel size is 3x3. The different series are inferences for a different image size. Benchmark on CPU. 5.2. CONVOLUTIONAL LAYER BENCHMARK 57

Equation 2.5 shows that the complexity of a convolutional layer is proportional to the number of input channels (Cin), output channels (Cout) and the image size (H & W). It is expected that as those parameters increase, the inference time also increases. Figure 5.3a and 5.4a shows this for both the CPU and NCS. The x-axis represents the number of channels, the input channels are set equal to the output channels (Cin = Cout). The different series represent different image sizes (H & W). The graph shows that for an increasing number of channels and image size the inference time increases.

5.2.1 NCS results

Figure 5.3b shows the computational performance on the NCS. Between 8 and 64 channels the performance increases with the complexity of the layer. For a number of channels between 72 and 1024, the same observation can be made. What is really special about this figure is the performance ”jump” between 64 and 72 channels. For images larger than 15x15 there is a large performance hit, while for images smaller than 10x10 there is a performance increase.

This ”jump” is also visible in figure 5.3a, it shows that for an image size of 5x5 the inference time with 72 channels is even lower than the time for 64 channels. This means that it would be inefficient to use a 5x5 image with 64 channels as more channels can be used for anevenlower inference time.

It is clear that the NCS changes its way of computing a convolutional layer at the mark of 64 channels. This modification is beneficial for layers with a small spatial size (lower than10x10) but harmful for higher spatial sizes (higher than 15x15). In most cases, this won’t be a problem for small spatial sizes, because when the spatial size is lower than 10x10 the network often has more than 64 channels. Unfortunately, we cannot modify or know how the NCS is executing those layers as the code running on the NCS is proprietary.

5.2.2 CPU results

Figures 5.4b shows the computational performance in function of the complexity on the CPU. For the CPU the general trend is that an increase in the number of channels increases the computational performance, this until it plateaus around 100 channels. An increase in image size also results in a performance increase. Figure 5.4b shows that for a larger image size the performance plateaus higher. The computational performance plateaus around 100 GFLOPS for image sizes above 25x25, while only 90 GFLOPS and 80 GFLOPS are achieved for image sizes of 15x15 and 10x10 respectively. With an image size of 5x5, another behavior is observed, the number of FLOPS is at it’s maximum (80 GFLOPS) at 200 channels for an increasing number of channels the performance quickly drops to 60 GFLOPS. 58 CHAPTER 5. LAYER’S BENCHMARKS

5.3 Pointwise convolution benchmark

5 5x5 image 5x5 image 60 10x10 image 10x10 image 2 15x15 image 15x15 image 100k 25x25 image 25x25 image ) 50 e l 5 50x50 image 50x50 image

sca 100x100 image 100x100 image

g 2 40 o l (

S

s) 10k P µ ( O

5 L 30 e F m i G t 2 ce

n 20

e 1000 r e f

n 5 I 10 2

100 0 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.5: Inference time (a) and computational performance (b) of a pointwise(1 × 1) con- volution in function of the number of channels. The input and output channels are set equal. The different lines are inferences on a different image size. Benchmark on NCS.

5 120 5x5 image 5x5 image 2 110 10x10 image 10x10 image 100k 100 5 15x15 image 15x15 image 25x25 image 90 25x25 image ) 2 e l 80 10k 50x50 image 50x50 image

sca 100x100 image 100x100 image 5 70 g o l 2 (

S 60

s) 1000 P µ ( O

5 L e

F 50

m 2 i G t 100 ce

n 5 40 e r

e 2 f n I 10 5 30 2 1 5 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.6: Inference time (a) and computational performance (b) of a pointwise (1 × 1) convolution in function of the number of channels. The input and output channels are set equal. The different lines are inferences on a different image size. Benchmark on CPU 5.3. POINTWISE CONVOLUTION BENCHMARK 59

A pointwise convolution is basically the same type of convolution as described in part 5.2, the only difference being the kernel size of 1x1. So, the results should be similar to those obtainedin part 5.2. This layer is benchmarked because it is used in a depth separable convolution, section 2.2.3.

5.3.1 NCS results

Figure 5.5 shows the inference time and computational performance of a pointwise layer in function of the number of channels and for different image sizes on the NCS. Chart 5.5a shows that for a low number of channels and a small image size the inference doesn’t go lower than 86 µs. We will call this time the minimum overhead time. This is the lowest possible inference time for a pointwise convolution, this time can be seen as a constant overhead time.

Chart 5.5b shows that when the number of channels and the image size increases, the compu- tational performance also increases and arrives at a maximum of 61 GFLOPS. Going from 208 to 216 channels the same performance ”jump” can be seen as for the 3x3 convolutions (chart 5.3b). The difference is that this ”jump” occurs from 208 to 216 channels for a pointwise convo- lution, while it is from 64 to 72 channels on a 3x3 convolution. This means that if the jump is produced by an algorithmic change on the NCS, the decision is not purely based on the number of channels.

5.3.2 CPU results

Figure 5.6b shows the same benchmark on the CPU. Chart 5.6a shows that at low complexity there is again a lowest possible inference time. The difference is that here this overhead time is only about 1µs. Chart 5.6b shows that for higher complexities the computational perfor- mance stay rather constant between 80 and 100 GFLOPS. We don’t have the same important performance difference between low and high complexity layers. 60 CHAPTER 5. LAYER’S BENCHMARKS

5.4 Depthwise convolution benchmark

4 8 3 5x5 image 5x5 image 10x10 image 10x10 image 2 7 15x15 image 15x15 image 25x25 image 25x25 image ) 10k

e 9 l 8 6 7 50x50 image 50x50 image 6

sca 100x100 image 100x100 image 5

g 4 5 o l (

3 S s) P

µ 2

( 4 O

L e F m i G 1t 000 9 3 8 ce 7

n 6 e

r 5

e 2

f 4 n I 3

2 1

100 0 9 8

6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.7: Inference time (a) and computational performance (b) of a depthwise convolution in function of the number of channels. The used kernel size is 3 × 3. The different lines are inferences on a different image size. Benchmark on NCS

2 40 5x5 image 5x5 image 10k 10x10 image 10x10 image 35 5 15x15 image 15x15 image 25x25 image 25x25 image ) 2 e l 50x50 image 30 50x50 image 1000

sca 100x100 image 100x100 image

g 5

o 25 l (

2 S s) P µ ( O 20

100 L e F m i

5 G t

ce 15

n 2 e r e

f 10

n 10 I 5

2 5 1

6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 10 100 1000 Channels (log scale) Channels (log scale) (a) (b)

Figure 5.8: Inference time (a) and computational performance (b) of a depthwise convolution in function of the number of channels. The used kernel size is 3 × 3. The different lines are inferences on a different image size. Benchmark on CPU 5.4. DEPTHWISE CONVOLUTION BENCHMARK 61

In this part, the benchmarks of a depthwise convolution are analyzed. The complexity of a depthwise convolution is given by the first term of equation 2.7, the difference with a full convo- lution is that the complexity is independent of the number of output channels. The consequence is that the complexity of a depthwise convolution only increases linearly with the number of channels instead of quadratically for a full convolution (Remember that in the benchmarks we set Cin = Cout).

5.4.1 NCS results

Chart 5.7a shows, again, that for lower complexities there is a minimum overhead time, here 105 µs. With increasing complexity the computational performance increases till a maximum and then the computational performance decreases slightly. This maximum happens at a different number of channels according to the image size. For an image size of 100x100 the maximum is at 200 channels while for an image size of 10x10 it is at 900.

5.4.2 CPU results

The benchmark on the CPU is visible in figure 5.8. On the computational performance graph, we can see that for the same image size the computational performance stays about constant. And similarly to the NCS for larger image sizes a higher computational performance is achieved. 62 CHAPTER 5. LAYER’S BENCHMARKS

5.5 Speedup of depth separable convolutions

Depth separable convolutions are used because they offer a smaller complexity and size for almost the same accuracy as full convolutions. In this part, we measure if they effectively have a lower inference time compared to an equivalent full convolution. In the previous benchmarks, the inference time of a depthwise, pointwise and 3x3 convolution was measured. This means that we have all the required information to calculate the speedup. A depth separable convolution consists of a depthwise convolution followed by a pointwise convolution, by adding the inference times of chart 5.5a and 5.7a we have the inference time of a depth separable convolution. The speedup of the depth separable convolution is then calculated by: Inference time full conv Speedup = Inference time depthsep conv The speedup results in function of the number of channels are shown in figure 5.9. The line represents the theoretical speedup, which is calculated as: F LOP s full conv T heoreticalspeedup = F LOP s depth sep conv

3 10 9 4 8 7 6 5

) 2 e l 4 sca 3 g o l (

p

u 2 d e

e 5x5 image p

S 10x10 image 1 1 15x15 image 9 25x25 image 8 7 50x50 image 6 100x100 image 5 Theoretical speedup

6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 Channels (log scale)

Figure 5.9: Speedup of a depth separable convolution with respect to a full convolution. Used kernel size is 3 × 3. Benchmarked on the NCS. 5.5. SPEEDUP OF DEPTH SEPARABLE CONVOLUTIONS 63

Figure 5.9 is divided in four zones those zones are analysed in the following paragraphs.

Zone 1 shows measurements with a speedup lower than one, this means that the depth separa- ble convolution is slower than the full convolution. Those are measurements with a low channel count and small image size, so low complexity layers. The reason is that the depth separable convolution consists of two operations. As seen in previous parts, both depthwise and pointwise convolutions have constant overhead time for low complexities. As the depth separable convo- lution consists of a pointwise and a depthwise operation it has two times the overhead. This causes a slowdown on the low complexity points.

As the complexity increases the speedup also increases, this is zone 2. This is can be explained by two reasons: The first one is that as the complexity increases the overall inference time increases and the overhead time becomes less significant. Second, at a higher channel count, the theoretical speedup of a depth separable layer becomes higher.

Going from zone 2 to zone 3 we can see a sudden speedup increase for large image sizes and a decrease for smaller image sizes. This change happens, going from 64 to 72 channels and corresponds to the performance jump in 3x3 convolutions benchmark, figure 5.3b. As discussed in the 3x3 convolutions analysis (section 5.2.1), there is a drop in performance for image sizes larger than 15x15 and a performance increase for image sizes smaller than 10x10. This, of course, has an impact on the speedup, for image sizes larger than 10x10 there is a clear increase in speedup as the inference time for the full convolution increases. For image sizes lower than 10x10 there is a speedup decrease.

From zone 3 to zone 4, a new sudden change happens. This happens going from 208 to 216 channels and corresponds to the performance jump of the pointwise convolutions, see chart 5.5b. This is the same jump as on the 3x3 convolution (figure 5.3b) but it happens at a higher channel count. As now this ”jump” happens in the depth separable convolution this cancels the effects of the jump of the 3x3 convolution. As the computational performance decreases for large image sizes and increases for low image sizes. The same effect is seen in the speedup, for image sizes larger than 25x25 the speedup decreases. For images smaller than 25x25 there is a speedup increase.

Zone 4 shows that for larger image sizes and a higher number of channels the speedup increases slightly. But the measured speedup is not matching the theoretical speedup. This is because the depthwise convolutions are executed at a lower computational performance than a 3x3 convolution. The highest achieved performance for depthwise convolutions is 8 GFLOPS, see chart 5.7b. While for the pointwise (chart 5.5b) and 3x3 convolutions (chart 5.3b) a maximum performance of 61 GFLOPS is achieved. 64 CHAPTER 5. LAYER’S BENCHMARKS

5.6 ReLU & bias

In this part, the inference time of a ReLU and bias layer are benchmarked. Often the bias isn’t seen as a separate layer because it’s included into the FC or convolutional layers. But on the NCS the bias is actually computed in a separate layer and so the inference time of only the bias can be measured.

Figure 5.10a shows the inference time for bias and ReLU in function of the number of bias and ReLU operations on the NCS (ReLU and bias are counted as 1 operation each). The first observation is that the inference time is almost the same for both operations. Thesecond observation is that for less than 40 thousand operations the inference time is constant, and takes about 65 µs.

280 1000 260 ReLU 900 ReLU 240 Bias 800 Bias ) e 220 l 700 200 sca )

e 600 l g

180 o l (

sca

d 500

g 160 n o l ( co 140

s) 400 se

µ r (

e e 120 p s m i t n 300

o i t ce

100 a n r e e r p e f o

n I 80 n 200 o i l l i m

60

2 3 4 5 6 7 8 9 2 3 2 3 4 5 6 7 8 9 2 3 10k 100k 10k 100k Number of operations (log scale) Number of operations (log scale) (a) (b)

Figure 5.10: Inference time (a) and computational performance (b) of a bias and ReLU in function of the number of executed operations. Bias and ReLU are considered as one operation each. Benchmark on NCS

Figure 5.10b plots the number of operations per second on the NCS. The FLOPS notation isn’t used here because a ReLU isn’t a floating-point operation. For a number of operations higher than 40 thousand the achieved performance is a bit lower than 1 GOPS.

It is important to note that the inference time of the bias and ReLU layer are absolutely not negligible, especially for smaller convolutions. Neural networks often have a bias and a ReLU operation for every FC or convolutional layer, this means that there is at least 130 µs used for bias and ReLU. By looking back at the inference times of convolutional and FC layers (charts 5.1a, 5.3a, 5.5a and 5.7a) it is clear that the inference time of smaller layers are in the same order of magnitude as the inference time for the bias and ReLU. Some concrete examples: a 1024x1024 FC layer has an inference time of 905µs, 1035µs with bias and ReLU, the bias and 5.7. CHANNEL RATIO 65

ReLU take 12.5% of the inference time. For a 50x50 depthwise convolution with 128 channels the inference time is 850µ, the bias and ReLU take 13% of the inference time.

5.7 Channel ratio

The first ShuffleNetV2 guideline showed that when using a different amount of inputchannels than output channels the MAC increases proportionally to the FLOPs. This results in a lower computational performance for this layer. In other words an unbalanced channel ratio reduces the arithmetic intensity. This claim was verified on the NCS for an image size of 100x100 and 50x50, see table 5.1 and 5.2. Four layers of equal complexity were benchmarked, the only difference is the input and output channel ratio.

Table 5.1: Conv 1x1, image size 100x100 Table 5.2: Conv 1x1, image size 50x50

c1:c2 (c1,c2) GFLOP/s Diff c1:c2 (c1,c2) GFLOP/s Diff 1:1 128,128 59.5 0% 1:1 128,128 54.2 0% 1:4 64,256 58.0 -2.5% 1:4 64,256 52.6 -3% 1:8 32,512 49.6 -16.6% 1:8 32,512 49.2 -9.2% 1:16 16,1024 37.6 -36.8 1:16 16,1024 36.9 -31.9

The measurements show that as the channel ratio increases the computational performance decreases. This performance decrease starts to be significant for channel ratios larger than 8. This test confirms that using an unbalanced channel ratio reduces the performance onthe NCS.

5.8 Comparison of efficiencies

As seen in previous sections different layers on the same hardware are executed at a different computational performance. Table 5.3 summarizes the highest achieved performance on the NCS for every layer type. The highest computational performance is achieved for a 1x1 and 3x3 convolution. Full convolutions are almost 10X more efficient compared to depthwise and FC layers. This shows that the NCS is optimized for executing convolutions.

One could ask, why depthwise convolutions and fully connected layers are less efficient on the NCS. One metric that wasn’t looked at yet is the arithmetic intensity of the layer. The arithmetic intensity is defined as the number of FLOPs per byte of memory accessed. This is an important metric because the execution speed of a layer can be bottlenecked by the memory. The arithmetic 66 CHAPTER 5. LAYER’S BENCHMARKS

Table 5.3: Computational performance of different layers on the NCS.

Image size 5 × 5 10 × 10 15 × 15 25 × 25 50 × 50 100 × 100 Layer type (GFLOPS)(GFLOPS)(GFLOPS)(GFLOPS)(GFLOPS)(GFLOPS) FC 3.22 3.22 3.22 3.22 3.22 3.22 Depthconv 3x3 2.0 4.4 5.1 6.0 6.9 7.6 Conv 1x1 28.4 48.0 52.1 55.8 59.9 61.3 Conv 3x3 38.1 57.5 58.4 60.9 62.2 63.1 intensity can be calculated by the following equation: F LOP s Arithmetic Intensity = (5.2) MAC The unit of the arithmetic intensity is FLOPs/byte. The lower the arithmetic intensity the more memory bandwidth is needed to achieve the same execution speed in FLOPS. Table 5.4 shows the maximum achievable arithmetic intensity for the layers benchmarked in chapter 5.

Table 5.4: Maximum arithmetic intensity of different layers on the NCS. FP16 is used so 1 memory acces is 2 bytes.

Image size 5 × 5 10 × 10 15 × 15 25 × 25 50 × 50 100 × 100 FLOP FLOP FLOP FLOP FLOP FLOP Layer type ( byte )( byte )( byte )( byte )( byte )( byte ) FC 1 1 1 1 1 1 Depthconv 3x3 3.6 4.1 4.2 4.2 4.2 4.2 Conv 1x1 25 100 225 625 2500 10 000 Conv 3x3 25 100 225 625 2500 10 000

Table 5.4 shows that the arithmetic intensity of a FC layer and depthwise separable convolution are a lot lower than those of full convolutions. We suppose that this is the reason why those layers achieve a significantly lower performance on the NCS. As those layers are more memory intensive, it is harder to sustain the arithmetic intensity with enough data. This would mean that the bottleneck lays in the memory hierarchy architecture.

This shows that inference time is not only related to the complexity of the layer measured in FLOPs. In example: depth separable convolutions have a much higher theoretical performance increase than what is actually achieved by measuring the inference time. By making more complex structures the power to brute force problems is lower and so the efficiency on processors is lower. So when making decisions of certain more computationally efficient layers the effective 5.9. CHANNEL COUNT SHOULD BE A MULTIPLE OF 8 67 speedup should be considered instead of the complexity reduction (in FLOPs) as the difference can be very large.

5.9 Channel count should be a multiple of 8

On the convolution benchmarks, the number of channels was always chosen to be a multiple of 8. Figure 5.11 shows the computational performance and inference time of a 3x3 convolutional layer, same as chart 5.3b. But here the number of channels is incremented one by one. As it can be seen in figure 5.11a there are two clear inference lines, a higher and a lower one. The lower line corresponds to all the channels that are a multiple of 8. The inference time of the upper line is between 2 and 10 times higher, it is clear that this is not the intended behavior. So, when designing a network the number of channels should always be a multiple of 8 so that it can be executed in a reasonable amount of time on the NCS.

1M 60B 5

2

) 50B e l 100k sca

5

g 40B o l ( 2 S s) µ P (

10k O 30B e L F m 5 i t

ce 2

n 20B e r

e 1000 f n I 5 10B

2 100 0

1 2 5 10 2 5 100 2 5 1000 1 2 5 10 2 5 100 2 5 1000

Channels (log scale) Channels (log scale) (a) (b)

Figure 5.11: Inference time (a) and computational performance (b) of a 3 × 3 convolution in function of the number of channels. The used image size is 25 × 25. Benchmark on the NCS 6 Model benchmarks

In this chapter, we will measure the inference time of different classification networks on the NCS. To measure the inference time the same benchmark procedure is followed as described in chapter 5. All the benchmarked models are made for classification on ImageNet, so the accuracies are known and can be compared. By measuring the inference time for those models on the NCS, it will allow selecting the models with the best accuracy-inference time trade-off. In the second part of this chapter, we further analyze some specific models and try to explain why they are faster or slower than others on the NCS. The benchmarks of the previous chapter will be used for this.

68 6.1. MODELS COMPARISON 69

6.1 Models comparison

On figure 6.1 the top 1 accuracy of different models can be seen as a function ofthe FLOPs. It shows that to achieve accuracies above 75% it is at a huge computational cost. MobileNetV2 achieves 75% accuracy for a computational cost of 1.2 GFLOPs, the next significant increase in accuracy is SE-ResNeXt-50[21] with 79% and a cost of 8.4 GFLOPs. This is an increase of 4% accuracy for 7× the computational cost. If energy consumption and inference time are important in the target application (which is most likely if using the NCS), we would argue that MobileNetV2 with 75% accuracy is a way better trade-off.

The question remains if a 7× higher complexity would also result in a 7× higher inference time. Figure 6.2 shows the accuracy in function of the inference time on the NCS and CPU. By looking again at MobilenetV2 and SE-ResNeXt-50, MobileNetV2 is 6× faster on the NCS and 8.5× faster on the CPU. So the claim that MobileNetV2 is a better trade off remains.

Another observation is that the Squeeze-and-Excitation networks[21] (SE-networks) are always a lot slower than the non-SE network. In example: SE-ResNet-50 is 42% slower than ResNet-50 on the NCS and 46% slower on the CPU, although there is only a slight difference in FLOPs of 11%. This shows that some operations are more costly than others.

In the following sections, the smaller models will be analyzed more thoroughly, starting with MobileNet. 70 CHAPTER 6. MODEL BENCHMARKS

80

75

70 MobileNetV1

cy MobileNetV2 a 65 r ShuffleNetV1 ccu

a ShuffleNetV2

60 1 Alexnet p o

T 55 Squeezenets 1.0/1.1 VGG-16 50 Inception-v2 Resnet-50/101/152 45 SE-Inception-v2 SE-ResNets-50/101/152 40 SE-ResNeXt-50/101 0 5 10 15 20 25 30

Complexity (GFLOP)

Figure 6.1: Top 1 accuracy on ImageNet in function of the number or FLOPs of the model. The exact numbers can be found in appendix A. The FLOPs were calculated from the network architecture generated by the OpenVino model optimizer. So, the number of FLOPs can be slightly lower than those reported in the original papers because of the optimizations, described in part 4.2.1. For the ResNets the FLOPs are significantly lower because of the ResNet stride optimization.

80 80

75 75

70 70 MobileNetV1 MobileNetV1

cy MobileNetV2 cy MobileNetV2 a 65 a 65 r r ShuffleNetV1 ShuffleNetV1 ccu ccu

a ShuffleNetV2 a ShuffleNetV2

60 60 1 1 Alexnet Alexnet p p o o

T 55 Squeezenets 1.0/1.1 T 55 Squeezenets 1.0/1.1 VGG-16 VGG-16 50 Inception-v2 50 Inception-v2 Resnet-50/101/152 Resnet-50/101/152 45 SE-Inception-v2 45 SE-Inception-v2 SE-ResNets-50/101/152 SE-ResNets-50/101/152 40 SE-ResNeXt-50/101 40 SE-ResNeXt-50/101 0 100 200 300 400 500 600 700 0 50 100 150 200 250 300 350

Inference time (ms) Inference time (ms) (a) (b)

Figure 6.2: Top 1 accuracy on Imagenet in function of the inference time on NCS (a) and CPU(b). The exact numbers can be found in appendix A. 6.2. MOBILENET V1 & V2 71

6.2 MobileNet V1 & V2

Figure 6.3 shows the same Acc-FLOP chart as before but only shows the smaller networks. As claimed in the MobileNetV2 paper, the chart shows that MobileNetV2 is a no-compromise improvement over MobileNetV1. The inference time on the CPU (chart 6.4b) confirms the theoretical improvement, by showing that for a desired accuracy MobileNetV2 is faster. But when looking at the inference time on the NCS (chart 6.4a) the inverse can be seen. For models with accuracies under 68%, MobileNetV1 is faster than V2. For higher accuracies, MobileNetV2 takes the lead again. As MobileNetV2 is slower for a lower complexity, this means that it is executed at a lower computational performance (FLOPs). 72 CHAPTER 6. MODEL BENCHMARKS

75

70

65 cy a r 60 ccu a

1

p 55 o T

50 MobileNetV1 45 MobileNetV2 ShuffleNetV1 ShuffleNetV2 40 0 0.2 0.4 0.6 0.8 1 1.2

Complexity (GFLOP)

Figure 6.3: Top 1 accuracy on ImageNet in function of the number or FLOPs of the model. Zoomed in on MobileNet and ShuffleNet.

75 75

70 70

65 65

60 60

55 55 op 1accuracy op 1accuracy T T

50 50 MobileNetV1 MobileNetV1 45 MobileNetV2 45 MobileNetV2 ShuffleNetV1 ShuffleNetV1 ShuffleNetV2 ShuffleNetV2 40 40 0 50 100 150 0 5 10 15

Inference time (ms) Inference time (ms) (a) (b)

Figure 6.4: Top 1 accuracy on Imagenet in function of the inference time on NCS (a) and NCS(b). Zoomed in on MobileNet and ShuffleNet. 6.2. MOBILENET V1 & V2 73

6.2.1 Why MobileNetV2 is slower than V1?

To answer this question, two models of equal complexity are selected for comparison:

• MobileNetV1-1.0-224

– Complexity: 1.14 GFLOPs – Accuracy: 70.9% – Inference time: 42.7 ms

• MobileNetV2-1.4-224

– Complexity: 1.16 GFLOPs – Accuracy: 75% – Inference time: 62.6 ms

Figure 6.5a shows the complexity per layer type for both selected models. The chart shows that the complexity is essentially in the pointwise convolutions and that both models have the same complexity for every layer type. As both models have the same complexity we would expect that they have the same inference time but this is not the case. Figure 6.5b shows the inference time of the two selected models by layer type. For every layer type MobileNetV2 has a higher inference time.

30 MobileNetV1-1.0-224 MobileNetV1-1.0-224 MobileNetV2-1.4-224 MobileNetV2-1.4-224 1 25

) 0.8

P 20 O L F G ( 0.6 y

t 15 xi e l p m

o 0.4 Inference time(ms) 10 C

0.2 5

0 0 Bias ReLU6 Conv 1x1 DepthConv Bias ReLU6 Conv 1x1 DepthConv

Layer type Layer type (a) (b)

Figure 6.5: Complexity (a) and inference time (b) by layer type for MobileNetV1 & V2. The models have similar complexity. Results for the NCS 74 CHAPTER 6. MODEL BENCHMARKS

To explain why there is a higher inference time for the bias and ReLU we can look back at the architectures of MobileNetV1 & V2, table 3.4 and 3.3. MobileNetV2 has 54 bias and 37 ReLU layers while MobileNetV1 only has 28 bias and 28 ReLU layers. This difference is not visible in the total model complexity as the required operations for bias and ReLU are negligible compared to the FLOPs for a convolution. But as seen in the layer benchmarks, section 5.6, although the complexity is negligible the inference time is not. A bias and ReLU layer can be seen as a constant time for low complexities, as MobileNetV2 has more bias and ReLU layers the inference time will be higher. Another factor playing in the disfavor of MobileNetV2 is that the ReLU operation is executed in the expanded layers (see section 3.4.2), this means that there are actually 6 times more output channels than shown in table 3.3.

On chart 6.5b can also be seen that the time for the pointwise convolutions is also longer on MobileNetV2 although the complexity for this layer is equal on both models (figure 6.5a). Why the pointwise convolutions are executed at a lower computational performance can be explained by two reasons. The first one is that MobileNetV2 has 34 pointwise convolutions while MobileNetV1 only has 13. This means that the average complexity of one pointwise layer is smaller in MobileNetV2 and as seen in section 5.3 the higher the complexity the higher the computational performance.

The second reason can be linked to the expansion and bottleneck mechanism used in Mo- bileNetV2. As MobileNetV2 uses an expansion factor of 6 this means that the expansion con- volution will have 6 times more output channels as it has input channels. The bottleneck layer being the inverse with 6 times fewer output channels than input channels. As shown in section 5.7, using an unbalanced amount of input and output channels decreases the arithmetic intensity and as a consequence decreases the performance.

6.2.2 Small models are less efficient

Chart 6.6 shows the computational performance of the MobileNet models as a function of the model complexity. By increasing the complexity of the same model type, a higher computa- tional performance is achieved (The complexity is changed by changing the width and resolution multiplier).

One reason is, as a model gets smaller the complexity per layer is reduced and as a consequence also the computational performance, see chapter 5.

But this is not the only reason, figure 6.7 compares the inference time per layer type for a small MobilenetV1 model vs a large one. The larger model runs at 23.4 fps and the smaller model at 113.3 fps. The percentage of time spent on the bias and ReLU is 28.2% for the large model and 39.9% for the smaller model. So, the proportion of time spent on bias and ReLU is larger for 6.2. MOBILENET V1 & V2 75 smaller models. This is logical because the number of bias and ReLU operations stays the same and as seen in section 5.6, for a small number of channels the inference time of the bias and ReLU can almost be seen as a constant. Another large part of the inference time for the smaller model is the receive-tensor layer, this is the time needed to load the image into the NCS. As the model executes faster this constant load time takes up a larger proportion of the total inference time.

As a result of all this, the convolution operation (that contains more than 90% of all operations) only takes 25% of the inference time. This shows that if we want to go faster on the NCS at a certain point the limit will be in the overhead time that is not visible by only looking at the complexity of the model.

25 ) S P O

L 20 F G (

d n

co 15 se

r e p s

n 10 o i t a r e p

O 5

MobileNetV1 MobileNetV2

0 0.2 0.4 0.6 0.8 1 1.2

Complexity (GFLOP)

Figure 6.6: The computational performance of the model (FLOPs) in function of the complexity of the model (FLOPs). Results for the NCS.

MobileNetV1-1.0-224 MobileNetV1-0.25-128 Conv ReLU6 Bias 14.1% Receive-Tensor 20.7% DepthConv 26.2% FC

14.1%

55.3% 4.21% 19.2%

13.5% 13.1%

16.6%

1.95% 1.08%

Figure 6.7: Proportion of the inference time taken by the different layer types in MobileNetV1, comparing a small network (right) with a larger network (left). Inference on NCS. 76 CHAPTER 6. MODEL BENCHMARKS

6.3 ShuffleNet V1 & V2

Looking back at figure 6.3, it can be seen that based on model complexity ShuffleNet V1 & V2 should be at least as performant as MobileNet and sometimes even better. But when looking at the inference time on the NCS (chart 6.4a), the inference time is much longer than the MobileNets. To explain why ShuffleNet is so much slower we can look at the proportion ofthe inference time for every layer type, figure 6.8. A large part of the inference time is taken by the reshape operation. This operation is part of the implementation of the shuffle operation (remember that the shuffle operation is implemented by two reshape and one transposition operation, section 3.4.3). The transposition is noted as ”permute” in the pie charts. This means that the shuffle operation takes 39.3% and 49.8% of the inference time on ShuffleNetV1 andV2 respectively. This is clearly the reason behind the high inference times.

On the other hand, on the CPU the reshape operation doesn’t seem to be a problem as the inference time is about the same as MobileNetV2 (chart 6.4b). This suggests that the problem on the NCS lays in an inefficient implementation of the reshape operation.

ShuffleNetV2-x1.0 ShuffleNetV1-x1.0 Conv Reshape Copy Bias Permute Relu 34.8% 39% Sum 42.8% 45.1% DepthConv Receive-Tensor FC

6.39% 3.52% 2.36% 5.77% 0.735% 4.74% 4.52% 0.882% 2.47% 0.0348% 2.01% 0.591% 0.681% 1.73% 1.88%

Figure 6.8: Proportion of the inference time for the different layer types of ShuffleNet V1&V2. Inference on NCS. 7 Conclusion

We show that by programming the NCS with OpenVino, low-level optimizations of a neural network are not possible. We are limited to a restricted set of supported layers. This limits the possible optimizations to making more efficient network architectures. So when using the NCS for a project it should be used without any significant optimization expectations.

The layer benchmarks in chapter 5, allows us to give some guidelines to design neural network architectures for the NCS:

• Layers with a low complexity almost always achieve a lower computational performance on the NCS, so the inference time will be lower for one large layer compared to multiple small layers.

• The NCS changes it’s way of computing convolutional layers at a certain channel count, according to the feature map size it increases or decreases the computational performance.

• The number of channels always has to be a multiple of 8, otherwise the execution efficiency is extremely low on the NCS. In the worst case it could be 10× slower.

• Depth separable convolutions achieve a reasonable speedup for higher complexities, while for lower complexities they actually are slower than full convolutions.

77 78 CHAPTER 7. CONCLUSION

• Bias and ReLU operations cannot be neglected, their inference time becomes very signifi- cant for smaller models.

• Layers with a low arithmetic intensity (FC layer, depthwise convolution, ...) are executed at significantly lower computational performance.

• The reshape and transpose/permute operations should be avoided on the NCS, as those take a significant amount of time to execute.

• The ratio between the input and output channels should be as low as possible as a higher ratio decreases the computational performance.

For the model benchmarks, we showed that MobileNet V1 & V2 are clearly the best models for an accuracy lower than 75%. If accuracy higher than 75% is desired, other models than MobileNet can be used but those will be a lot slower.

We showed that while MobileNetV2 should always have a better performance than MobileNetV1, MobilenetV1 is a better choice than V2 for accuracies lower than 68%. Also the ShuffleNet models, who should theoretically compete with the MobileNets, are not interesting on the NCS as their inference time is slowed down by the reshape and transpose operations.

A conclusion independent of the NCS is that FLOPs is not a good enough metric to evaluate efficient neural network architectures. The FLOPs only give a rough idea of the performance that should be achieved. So when comparing models with equivalent FLOPs the inference time should always be measured on the target hardware. Bibliography

[1] R.R. Schaller. “Moore’s law: past, present and future”. In: IEEE Spectrum 34.6 (June 1997), pp. 52–59. issn: 0018-9235. doi: 10.1109/6.591665. url: http://ieeexplore. ieee.org/document/591665/ (visited on 05/29/2019). [2] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Challenge”. In: Interna- tional Journal of Computer Vision 115.3 (Dec. 1, 2015), pp. 211–252. issn: 1573-1405. doi: 10.1007/s11263-015-0816-y. url: https://doi.org/10.1007/s11263-015-0816-y (visited on 05/25/2019). [3] Andrew G. Howard et al. “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”. In: (Apr. 17, 2017). url: https://arxiv.org/abs/1704.04861v1 (visited on 03/09/2019). [4] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet classification with deep convolutional neural networks”. In: Communications of the ACM 60.6 (May 24, 2017), pp. 84–90. issn: 00010782. doi: 10.1145/3065386. url: http://dl.acm.org/ citation.cfm?doid=3098997.3065386 (visited on 02/14/2019). [5] Christian Szegedy et al. “Going deeper with convolutions”. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston, MA, USA: IEEE, June 2015, pp. 1–9. isbn: 978-1-4673-6964-0. doi: 10.1109/CVPR.2015.7298594. url: http://ieeexplore. ieee.org/document/7298594/ (visited on 05/29/2019). [6] Kaiming He et al. “Deep residual learning for image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 770–778. [7] Weisong Shi et al. “Edge Computing: Vision and Challenges”. In: IEEE Internet of Things Journal 3.5 (Oct. 2016), pp. 637–646. issn: 2327-4662. doi: 10.1109/JIOT.2016.2579198. url: http://ieeexplore.ieee.org/document/7488250/ (visited on 03/01/2019). [8] Norman P. Jouppi et al. “In-Datacenter Performance Analysis of a Tensor Processing Unit”. In: arXiv:1704.04760 [cs] (Apr. 16, 2017). arXiv: 1704.04760. url: http://arxiv.org/ abs/1704.04760 (visited on 05/26/2019).

79 80 BIBLIOGRAPHY

[9] Edge TPU - Run Inference at the Edge | Edge TPU. Google Cloud. url: https://cloud. google.com/edge-tpu/ (visited on 05/26/2019). [10] Tesla says its new self-driving chip will help make its cars autonomous. MIT Technology Review. url: https://www.technologyreview.com/f/613403/tesla-says-its-new- self-driving-chip-will-help-make-its-cars-autonomous/ (visited on 05/26/2019). [11] Nervana Neural Network Processor. Intel AI. url: https://www.intel.ai/nervana-nnp/ (visited on 05/26/2019). [12] admin. Intel® Movidius™ Neural Compute Stick. url: https://software.intel.com/ en-us/movidius-ncs (visited on 05/26/2019). [13] ajolleyx. Intel® Neural Compute Stick 2. url: https : / / software . intel . com / en - us/neural-compute-stick (visited on 05/26/2019). [14] Sharan Chetlur et al. “cuDNN: Efficient Primitives for Deep Learning”. In: arXiv:1410.0759 [cs] (Oct. 3, 2014). arXiv: 1410.0759. url: http://arxiv.org/abs/1410.0759 (visited on 05/26/2019). [15] Michael Mathieu, Mikael Henaff, and Yann LeCun. “Fast Training of Convolutional Net- works through FFTs”. In: arXiv:1312.5851 [cs] (Dec. 20, 2013). arXiv: 1312.5851. url: http://arxiv.org/abs/1312.5851 (visited on 05/26/2019). [16] Andrew Lavin and Scott Gray. “Fast Algorithms for Convolutional Neural Networks”. In: arXiv:1509.09308 [cs] (Sept. 30, 2015). arXiv: 1509.09308. url: http://arxiv.org/ abs/1509.09308 (visited on 05/26/2019). [17] Philipp Gysel, Mohammad Motamedi, and Soheil Ghiasi. “Hardware-oriented Approxima- tion of Convolutional Neural Networks”. In: arXiv:1604.03168 [cs] (Apr. 11, 2016). arXiv: 1604.03168. url: http://arxiv.org/abs/1604.03168 (visited on 05/27/2019). [18] Matthieu Courbariaux et al. “Binarized Neural Networks: Training Deep Neural Net- works with Weights and Activations Constrained to +1 or -1”. In: arXiv:1602.02830 [cs] (Feb. 8, 2016). arXiv: 1602.02830. url: http://arxiv.org/abs/1602.02830 (visited on 05/27/2019). [19] Song Han et al. “Learning both Weights and Connections for Efficient Neural Networks”. In: arXiv:1506.02626 [cs] (June 8, 2015). arXiv: 1506.02626. url: http://arxiv.org/ abs/1506.02626 (visited on 05/27/2019). [20] François Chollet. “Xception: Deep learning with depthwise separable convolutions”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017, pp. 1251–1258. [21] Jie Hu et al. “Squeeze-and-Excitation Networks”. In: arXiv:1709.01507 [cs] (Sept. 5, 2017). arXiv: 1709.01507. url: http://arxiv.org/abs/1709.01507 (visited on 03/23/2019). BIBLIOGRAPHY 81

[22] Xiangyu Zhang et al. “ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices”. In: arXiv:1707.01083v2 [cs] (Dec. 7, 2017). url: http://arxiv.org/ abs/1707.01083v2. [23] Ningning Ma et al. “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”. In: arXiv:1807.11164v1 [cs] (July 30, 2018). url: http://arxiv.org/abs/1807. 11164v1. [24] Barret Zoph et al. “Learning Transferable Architectures for Scalable Image Recognition”. In: arXiv:1707.07012 [cs, stat] (July 21, 2017). arXiv: 1707.07012. url: http://arxiv. org/abs/1707.07012 (visited on 03/31/2019). [25] Zheng Qin et al. “FD-MobileNet: Improved MobileNet with a Fast Downsampling Strat- egy”. In: arXiv:1802.03750v1 [cs] (Feb. 11, 2018). url: http://arxiv.org/abs/1802. 03750v1. [26] Mark Sandler et al. “MobileNetV2: Inverted Residuals and Linear Bottlenecks”. In: arXiv:1801.04381 [cs] (Jan. 12, 2018). arXiv: 1801.04381. url: http://arxiv.org/abs/1801.04381 (vis- ited on 03/09/2019). [27] Saining Xie et al. “Aggregated Residual Transformations for Deep Neural Networks”. In: arXiv:1611.05431 [cs] (Nov. 16, 2016). arXiv: 1611.05431. url: http://arxiv.org/ abs/1611.05431 (visited on 03/22/2019). [28] Movidius + Intel = Vision for the Future of Autonomous Devices | Machine Vision Tech- nology | Movidius. url: https : / / web . archive . org / web / 20190522090101 / https : //www.movidius.com/news/ceo-post-september-2016 (visited on 05/22/2019). [29] B. Barry et al. “Always-on Vision Processing Unit for Mobile Applications”. In: IEEE Micro 35.2 (Mar. 2015), pp. 56–66. issn: 0272-1732. doi: 10.1109/MM.2015.10. [30] hotchipsvideos. HC28-S8: Dealing with Big Data. Mar. 15, 2017. url: https://youtu. be/Vk1Wr5hwCpQ?t=1375 (visited on 05/28/2019). [31] Caffe | Deep Learning Framework. url: https://caffe.berkeleyvision.org/ (visited on 05/22/2019). [32] TensorFlow. TensorFlow. url: https://www.tensorflow.org/ (visited on 05/22/2019). [33] Model Optimizer Developer Guide - OpenVINO Toolkit. url: https://docs.openvinotoolkit. org/2019_R1.1/_docs_MO_DG_Deep_Learning_Model_Optimizer_DevGuide.html (vis- ited on 05/29/2019). [34] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: arXiv:1502.03167 [cs] (Feb. 10, 2015). arXiv: 1502.03167. url: http://arxiv.org/abs/1502.03167 (visited on 02/03/2019). [35] Quick Start — Netscope CNN Analyzer. url: https://dgschwend.github.io/netscope/ quickstart.html (visited on 05/23/2019). 82 BIBLIOGRAPHY

[36] Deep Learning Deployment Toolkit. Contribute to opencv/dldt development by creating an account on GitHub. original-date: 2018-10-15T10:54:40Z. May 23, 2019. url: https: //github.com/opencv/dldt (visited on 05/24/2019). [37] Optimization Guide - OpenVINO Toolkit. url: https://docs.openvinotoolkit.org/ 2019_R1.1/_docs_optimization_guide_dldt_optimization_guide.html#myriad (visited on 05/24/2019). [38] Supported Devices - OpenVINO Toolkit. url: https://docs.openvinotoolkit.org/ 2019_R1.1/_docs_IE_DG_supported_plugins_Supported_Devices.html#supported_ layers (visited on 05/25/2019). [39] albanie/convnet-burden: Memory consumption and FLOP count estimates for convnets. url: https://github.com/albanie/convnet-burden (visited on 06/06/2019). Appendices

83 A Model benchmark results

Table A.1: ShuffleNet V1 and V2 model benchmarks: complexity , top 1 accuracy on ImageNet, inference time on NCS, inference time on CPU. Accuracy from original papers [23, 22]

Model name Complexity Top 1 acc NCS CPU (MFLOPs) (ms) (ms) ShuffleNetV1×1 254.1 67.6 106.3 8.3 ShuffleNetV2×0.5 81.7 60.3 43.7 2.8 ShuffleNetV2×1.0 282.5 69.4 136.4 7.8 ShuffleNetV2×1.5 574.0 72.6 131.3 11.2 ShuffleNetV2×2.0 976.5 74.9 168.3 16.5

84 85

Table A.2: MoblileNetV2 model benchmarks: complexity , top 1 accuracy on ImageNet, infer- ence time on NCS, inference time on CPU. Accuracy source https://github.com/tensorflow/ models/tree/master/research/slim/nets/mobilenet

Model name Complexity Top 1 acc NCS CPU (MFLOPs) (ms) (ms) mobilenet_v2_0.35_128 40.4 50.8 14.3 1.3 mobilenet_v2_0.35_160 61.6 55.7 16.1 1.6 mobilenet_v2_0.35_192 87.6 58.2 18.0 2.1 mobilenet_v2_0.35_224 118.3 60.3 20.9 2.8 mobilenet_v2_0.35_96 23.8 45.5 13.3 1.0 mobilenet_v2_0.5_128 65.1 57.7 16.5 1.8 mobilenet_v2_0.5_160 100.2 61.0 18.8 2.3 mobilenet_v2_0.5_192 143.2 63.9 21.4 3.1 mobilenet_v2_0.5_224 194.0 65.4 25.3 4.0 mobilenet_v2_0.5_96 37.7 51.2 14.7 1.2 mobilenet_v2_0.75_128 138.1 63.2 19.5 3.0 mobilenet_v2_0.75_160 214.3 66.4 23.9 3.8 mobilenet_v2_0.75_192 307.5 68.7 28.7 5.0 mobilenet_v2_0.75_224 417.6 69.8 35.4 7.2 mobilenet_v2_0.75_96 78.8 58.8 16.7 2.2 mobilenet_v2_1.0_128 198.0 65.3 21.8 3.6 mobilenet_v2_1.0_160 307.9 68.8 27.4 6.0 mobilenet_v2_1.0_192 442.2 70.7 33.7 7.0 mobilenet_v2_1.0_224 601.0 71.8 42.0 9.7 mobilenet_v2_1.0_96 112.5 60.3 18.2 2.7 mobilenet_v2_1.3_224 1018.0 74.4 57.8 16.2 mobilenet_v2_1.4_224 1163.6 75.0 62.6 17.1 86 APPENDIX A. MODEL BENCHMARK RESULTS

Table A.3: MoblileNetV1 model benchmarks: complexity , top 1 accuracy on ImageNet, infer- ence time on NCS, inference time on CPU. Accuracy source https://github.com/tensorflow/ models/blob/master/research/slim/nets/mobilenet_v1.md

Model name Complexity Top 1 acc NCS CPU (MFLOPs) (ms) (ms) mobilenet_v1_0.25_128 27.1 41.5 8.8 0.6 mobilenet_v1_0.25_160 42.1 45.5 9.5 0.9 mobilenet_v1_0.25_192 60.4 47.7 10.3 1.3 mobilenet_v1_0.25_224 82.1 49.8 11.3 1.9 mobilenet_v1_0.5_128 98.3 56.3 10.4 2.1 mobilenet_v1_0.5_160 153.0 59.1 12.7 2.8 mobilenet_v1_0.5_192 219.9 61.7 15.3 3.4 mobilenet_v1_0.5_224 299.0 63.3 18.7 4.3 mobilenet_v1_0.75_128 213.5 62.1 14.0 3.3 mobilenet_v1_0.75_160 332.8 65.3 18.0 5.1 mobilenet_v1_0.75_192 478.5 67.2 22.7 6.2 mobilenet_v1_0.75_224 650.8 68.4 28.7 10.0 mobilenet_v1_1.0_128 372.8 65.2 19.0 6.0 mobilenet_v1_1.0_160 581.4 68 25.6 8.0 mobilenet_v1_1.0_192 836.2 70 33.2 10.8 mobilenet_v1_1.0_224 1137.5 70.9 42.7 15.4 87

Table A.4: Table with the complexity of the model, top 1 accuracy on ImageNet, inference time on NCS, inference time on CPU and accuracy source.

Model name Complexity Top 1 acc NCS CPU Source (GFLOPs) (ms) (ms) AlexNet 1.45 58.2 77.9 34.3 [39] Squeezenet 1.0 1.72 58.1 65.6 20.9 [39] Squeezenet 1.1 0.78 58.2 30.4 9.6 [39] VGG-16 31.0 73 711.9 347.5 [21] Inception V2 4.0 74.6 132.5 47.3 [21] ResNet-50 7.0 75.3 189.8 80.4 [21] ResNet-101 14.4 76.4 365.8 169.2 [21] ResNet-152 21.8 77 552.6 244.3 [21] SE-Inception V4 4.1 75.8 150.3 53.7 [21] SE-ResNet-50 7.7 77.6 270.3 117.6 [21] SE-ResNet-101 15.2 78.3 479.7 240.2 [21] SE-ResNet-152 22.6 78.7 710.6 317.9 [21] SE-ResNeXt-50 8.5 79.0 317.1 145.5 [21] SE-ResNeXt-101 16.0 80.2 641.1 260.2 [21]