The Movidius Stick Optimizing Deep Neural Networks for Deployment On

Optimizing deep neural networks for deployment on the Movidius stick Emiel Deprost Student number: 01503250 Supervisors: Prof. dr. ir. Pieter Simoens, Prof. dr. ir. Bart Dhoedt Counsellor: ing. Sam Leroux Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT Academic year 2018-2019 Optimizing deep neural networks for deployment on the Movidius stick Emiel Deprost Student number: 01503250 Supervisors: Prof. dr. ir. Pieter Simoens, Prof. dr. ir. Bart Dhoedt Counsellor: ing. Sam Leroux Master's dissertation submitted in order to obtain the academic degree of Master of Science in de industriële wetenschappen: elektronica-ICT Academic year 2018-2019 Voorwoord Als laatste jaarsstudent wou ik mijn kennis uitbreiden over deep learning. Machine learning kwam al aanbod gedurende mijn opleiding en wekte mijn interesse. Dankzij mijn masterproef heb ik mij in dit onderwerp kunnen verdiepen en is dit een zeer leerzame periode geweest. Aan het einde van deze vierjarige opleiding is het dan ook het ideale moment om iedereen te bedanken. Mijn dank gaat in de eerste plaats uit naar Sam Leroux voor zijn waardevolle begeleiding bij mijn masterproef. Hij heeft mij van het begin tot het einde bijgestaan door talrijke raadgevingen en feedback bij de verkregen resultaten. Graag wil ik ook prof. dr. ir. Pieter Simoens en prof. dr. ir. Bart Dhoedt bedanken voor het thesis onderwerp dat ik aangeboden kreeg en de kans om deze masterproef te maken. Ook wil ik mijn dank betuigen aan de docenten die ons opgeleid hebben in de voorbije vier jaar. Tenslotte wil ik mijn ouders bedanken die mij de kans gegeven hebben om deze boeiende opleiding te volgen en telkens klaar stonden om mij te helpen. Emiel Deprost, mei 2019 Toelating tot bruikleen “”De auteur(s) geeft (geven) de toelating deze masterproef voor consultatie beschikbaar te stellen en delen van de masterproef te kopiëren voor persoonlijk gebruik. Elk ander gebruik valt onder de bepalingen van het auteursrecht, in het bijzonder met betrekking tot de verplichting de bron uitdrukkelijk te vermelden bij het aanhalen van resultaten uit deze masterproef.” ”The author(s) gives (give) permission to make this master dissertation available for consulta- tion and to copy parts of this master dissertation for personal use. In all cases of other use, the copyright terms have to be respected, in particular with regard to the obligation to state explicitly the source when quoting results from this master dissertation.” Emiel Deprost, mei 2019 Optimizing deep neural networks for deployment on the Movidius stick Emiel Deprost Supervisor(s): Pieter Simoens, Bart Dhoedt, Sam Leroux Abstract— Deep neural networks are powerful models that allow bought by Intel. They call the Myriad 2 a Vision Pro- tackling complex problems. To run those networks a lot of computa- cessing Unit (VPU) it is a processor specially made for tional performance is required, this poses a problem on battery and computationally constrained devices. A possible solution is to make computer vision tasks. The architecture of the Myriad 2 is more specific hardware that allows running those networks more effi- described in figure 1. ciently. The Movidius stick is such specific hardware, made by Intel. In this work, we explore the capabilities of the Movidius stick. We will look at image classification networks and evaluate which works best on the Movidius stick. We will also give guidelines to design efficient neural networks for deployment on the Movidius Stick. Keywords—Movidius stick, neural compute stick, benchmarks, op- timization I. Introduction Compared to deploying neural networks on the cloud, deploying them on the edge has clear advantages : • Less latency. • Lower energy consumption, as there is no energy cost to send the data to the cloud. • No third party to trust, because all the data can be pro- cessed locally. In recent years, a lot of effort has been made to make neural networks more efficient. Also progress in hardware has been made, as the field of deep learning applications grows, Fig. 1. Myriad2 VPU block diagram. Borrowed from [1] more effort is put in application-specific hardware. In this work we will have a closer look at the Movidius Before deploying the neural network onto the Movidius stick, this is specific hardware for neural network inference. stick, we need to convert an existing model (Tensorflow, The main objectives of this work are the following: Caffe model or from any other supported framework) to • Looking at the capabilities of the Movidius stick, i.e, an intermediate representation (IR). The IR is the model which kind of deep neural networks can be executed on format used by the Movidius stick. This IR is then loaded the device. into the memory of the Movidius stick. The last step is to • Benchmarking the performance of the Movidius stick for send the input data to the Movidius stick and it returns the different deep neural network architectures. inference results. Note that Movidius stick is only suitable • Looking at the possible software optimizations that for inference and not for the training of networks. would allow running networks more efficiently on the de- The Movidius stick supports most deep learning lay- vice. ers, like fully connected, convolutional or pooling layers. It doesn’t support networks with memory, so RNNs and II. The Movidius stick LSTMs are not supported. The Movidius stick is a neural network accelerator made The Movidius stick only supports floating-point 16-bit by Intel. It needs to be operated by a host device that sends precision. Hence, the inference time cannot be accelerated the input and receives the output of the neural network. It by reducing the precision of the weights and activations. communicates via USB and also has the form factor of a large USB stick. This allows the Movidius stick to easily III. Layer benchmarks be added to any mini-computer board, such as a Raspberry In deep learning, different layers are used as building Pi, allowing faster neural network inference. blocks to build a deep neural network. The most commonly used layers were benchmarked on the Movidius stick for A. Movidius architecture different complexities. The results of those benchmarks The Movidius stick is based on the Myriad 2 chip, this allow us to make some interesting observations that are chip was originally made by Movidius which was later described here. The benchmarked layers are: • Fully connected layer in figure 3. The theoretical speedup on the figure isthe • Convolutional layer speedup calculated as the ratio of the number of operations • Depth separable convolution for a depth separable convolution and a full convolution. • ReLU and bias layer The highest theoretical speedup is 9×. The figure has been divided into 4 zones where a different behavior is observed. A. ReLU & bias A ReLU and bias layer were benchmarked, the inference 3 10 9 4 time of both layers can be measured as they are executed as 8 7 separate layers on the Movidius stick. Figure 2 shows the 6 inference time of the ReLU and bias layer. The figure shows 5 ) 2 e that the inference time of both layers is almost identical. l 4 sca 3 Below 40k operations, both layers have an almost con- g o l ( stant inference time of about 65µs. This means that even p u 2 d e for very small layers with a low amount of ReLU and bias e 5x5 image p operations the minimum inference time will be 130µs. This S 10x10 image inference time is very significant compared to the inference 1 1 15x15 image 9 25x25 image 8 time of a small layer. For example, a 50x50 depthwise con- 7 50x50 image volution with 128 channels, the inference time is 850µs, the 6 100x100 image 5 bias and ReLU take 13% of the inference time. Theoretical speedup 6 7 8 9 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 9 10 100 1000 280 Channels (log scale) 260 ReLU 240 Bias 220 200 ) Fig. 3. Speedup of depth separable convolutions compared to the e l 180 equivalent full convolution. The number of input channels is kept sca g 160 equal to the number of output channels, the used kernel size is o l ( 3x3. 140 s) µ ( e 120 m i t In the first zone we have a speedup lower than one, itis ce 100 n not interesting to use a depth separable convolution. This e r e f n zone is only present for layers with a very low complexity I 80 as the number of channels is low and the image size is also small. 60 The second zone starts to be interesting as here there is 2 3 4 5 6 7 8 9 2 3 10k 100k an actual speedup, although the speedup is not very large Number of operations (log scale) and way lower than the theoretical speedup. Going to zone 3, something special happens, there is a large increase in speedup for large image sizes and a reduc- Fig. 2. Inference time of a ReLU and bias layer on the Movidius tion in speedup for lower image sizes. What happens here stick as a function of the number of operations. ReLU and bias are each counted as one operation. is that the Movidius stick changes its way of computing full convolutions, the inference time becomes smaller for small image sizes and larger for large image sizes. Going to zone 4, the same computing change happens B. Depth separable convolutions for the depth separable convolutions so now the speedup is Depth separable convolutions are used to make networks again normal.

Load more