Deep-learning on embedded and desktop systems with ® OpenVINOTM

Marco Valdes, Università di Bologna

Corso di Sistemi Digitali M Stefano Mattoccia, Università di Bologna Intel® OpenVINOTM Toolkit

Develop applications and solutions that emulate human . Based on convolutional neural networks (CNN), the toolkit extends workloads across Intel® hardware (including accelerators) and maximizes performance [1]

OpenVINO stands for Open Visual Inferencing and Neural Network Optimization Goal: optimize CNNs for Intel architectures

CPUs graphics

FPGAs Vision Processing Units

Image Proc. Units Vision Accelerator Design

https://software.intel.com/en-us/openvino-toolkit/hardware Training and inference

The deployment of a typical learning-based system requires two phases:

1. Training; the network is fed with training data to minimize the error on the validation set 2. Inference: once trained, the network is deployed with new data

Nonetheless, there are some notable exceptions such as self-adaptive networks (ie, training is carried out at inference time directly from the input data)

With OpenVINO we always start from a pre-trained model OpenVINO: How it works? OpenVINO: Model Optimizer

● Converts models from various frameworks (i.e. TensorFlow, , MXNet) ● Converts to a unified model (i.e. intermediate representation (IR)) ● Optimizes topologies but not for a specific target device ● Fixes constants paths in graph ● Python application OpenVINO: Inference Engine

● API for Inference across all Intel® architectures ● Allows optimized inference on most Intel hardware targets ● Heterogeneity support allows execution of layers across hardware types ● Asynchronous execution improves performance OpenVINO: Inference Engine OpenVINO: Inference Engine workflow

● Load model and weights ● Load inference plugin (CPU, Graphic accelerator, FPGA, Myriad 2/X) ● Load network to plugin ● Allocate input, output buffers ● Fill input buffer with data ● Run inference ● Interpret output results Neural Compute Stick

● Neural network accelerator in USB stick form factor ● TensorFlow and Caffe frameworks supported ● Extended to work with Openvino Toolkit with FP16 Data Type ● Features the same Intel Movidius (Intel Movidius VPU) used in drones, VR headsets, and other low-power intelligent and autonomous products Optimizations with Model Optimizer

Model Optimizer provides methods to accelerate inference with the neural networks that do not require model retraining:

● Linear Operations Fusing ● Grouped Convolution Fusing (Specific for models) ● Stride optimization (Specific for caffè models) ● Cutting Off Parts of the model Fusing CNN Layers [3]

Conventional Approach:

● iteratively compute each and save temporary results on memory ● streaming in the results saved to memory

Optimized Approach:

● fuse 2+ layers ● only the first input feature map are transferred to the RAM ● compute all intermediate values of all fused layers ● output feature map is written on the RAM Optimization strategies: Model Optimizer

Linear Operations Fusing Grouped Convolution Fusing Optimization strategies: Inference Engine 1/3

For each supported device, the Inference Engine API applies different optimizations:

● Internal CPU Plugin Optimizations: ● Merging of group ● Fusing of Convolution with ReLU or ELU. CPU plugin is fusing all Convolution with ReLU or ELU layers if these layers are located after the Convolution layer. ● Removing power layer. CPU plugin removes Power layer from topology if it has the following parameters: power = 1, scale = 1, offset = 0. ● Fusing Convolution + Sum or Convolution + Sum + ReLU Optimization strategies: Inference Engine 2/3

Fusing Convolution + Sum or Convolution + Sum + ReLU Optimization strategies: Inference Engine 3/3

The GPU plugin uses the Intel® Compute Library for Deep Neural Networks. clDNN is an open source performance library for (DL) applications intended for acceleration of Deep Learning Inference on Intel Processor Graphics including Intel® HD Graphics and Intel® Iris® Graphics. GPU Plugin allows this specific Optimizations:

● Fused layers: ● Layers optimized out when conditions allow: ■ Convolution - Activation ■ Crop ■ Deconvolution - Activation ■ Concatenate ■ Eltwise - Activation ■ Reshape ■ Flatten ■ Fully Connected - Activation ■ Split ■ Copy Data type representation

Floating point can represent a wide range of numbers.

In case of FP32, we use 32 bit (1 bit for the sign, 8 bit for the exponent and 23 bit for the fractional part) Data type representation 2

The largest part of the data lay in the INT8 range and we can think that we can improve performance by using 8 bits but this choice provide an accuracy loss so we need to search a trade off Moving to int8 data type

Requirements: Supported Layers: ● Intel platforms that support to ● Convolution instruction set from the following list: ● FullyConnected (AVX-512 only) ○ Intel AVX-512 ● ReLU ○ Intel AVX2 ● Pooling ○ Intel SSE4.2 ● Eltwise ● A model must contain at least one ● Concat activation layer of ReLU type ● Resample ● Obj-detection and classification models Case study: monocular depth estimation with PyDnet

● PyDnet [4] is a lightweight deep-network for monocular depth perception

● Developed with TensorFlow

● Compared to most other networks, delivers real-time performance on standard CPUs

● Suited for embedded systems Case study: freezing the model 1/2

Goal: create an inference graph file

Steps:

● Identify output nodes of the graph ● Load and set the Tensorflow graph (from checkpoint) ● All trainable parameters are represented as variables in the graph ● Use tensorflow util to convert variables into constants Case study: freezing the model 2/2

… output_nodes = ['model/resize_images/ResizeBilinear','model/resize_images_1/ResizeBilinear','model/resize_i mages_1/ResizeBilinear'] … tf.train.write_graph(sess.graph_def, "frozen_modelsSPLIT", 'pydnet.pbtxt')

graph_pbtxt = os.path.join("frozen_model", 'pydnet.pbtxt') graph_path = os.path.join("frozen_model", 'pydnet.ckpt') outputs = output_nodes[0] for name in output_nodes[1:]: outputs += ','+name frozen_graph_path = os.path.join("frozen_models", 'frozen_pydnet.pb') freeze_graph.freeze_graph(graph_pbtxt,'',False, graph_path, outputs,'save/restore_all', 'save/Const:0', frozen_graph_path, True, '') Case study: run the model optimizer 1/2 once the model is frozen, run the model optimizer Python app:

python3 '/opt/intel//deployment_tools/model_optimizer/mo_tf.py' --input_model '/path/to/model/frozen_pydnet.pb' --model_name "IRPydnet" --data_type (FP32,FP16) --output 'model/resize_images/ResizeBilinear','model/resize_images/ResizeBilinear','mo del/resize_images/ResizeBilinear' --log_level=DEBUG Case study: run the model optimizer 2/2

The model optimizer enables to perform general purpose (i.e., agnostic to the target architecture) optimizations: set parameters, cut the model, modify data type or insert custom layer definitions for the input model

In few seconds the it produces the Intermediate Representation of the model and generates three files used by the inference engine:

1. IRPydnet.xml 2. IRPydnet.bin 3. IRPydnet.mapping Case study: input and output blobs 1/2

● In the OpenVINO terminology, a blob is the binary input or output of a network ● It simply consists of a Numpy tensor

For PyDnet:

● The input blob of the network is a single (batch size N=1) RGB (channels =3) image of size 256x512 (HxW) ● The output blob is a single image of size 256x512 (HxW) with two C=2 channels (depth is encoded with 16 bit) Case study: input and output blobs 2/2

PyDnet

C=3, HxW=256x512 HxW=256x512, C=2 Case study: inference engine 1/2

● Consists in the creation of a Python wrapper that uses Inference Engine API to perform inference for the specific model deployed ● The input and output blobs must be processed according to a specific order ● For PyDnet, the input BLOB’s shape must be [1,3,256,512] (NCHW) and output blob that has shape [1,2,256,512] (NCHW) has to be transpose in [256,512,3] (HWC) form Case study: inference engine 2/2

model_xml = "/home/marco/Scaricati/pydnet-master/pydnet-master/IRPydnet.xml" model_bin = "/home/marco/Scaricati/pydnet-master/pydnet-master/IRPydnet.bin" ie = IECore() net= IENetwork(model=model_xml,weights=model_bin) ie.add_extension("/opt/intel/openvino/inference_engine/lib/intel64/libcpu_extension_sse4.so", "CPU") #needed only for Interp Layer for CPU input_blob = next(iter(net.inputs)) n,c,h,w=net.inputs[input_blob].shape #exec_net = ie.load_network(network=net, device_name="CPU") #exec_net = ie.load_network(network=net, device_name="GPU") #Load the specific plugin for the device that you want to test exec_net = ie.load_network(network=net, device_name="MYRIAD") … #load image with cv2, #resize it and preprocess data img= img.transpose((2,1,0)) …. res = exec_net.infer(inputs={input_blob: img}) out= res['model/resize_images/ResizeBilinear'] #select the output … #postprocess and visualize data Case study: experimental results

The network provides results at three resolutions: H Half (H), Quarter (Q) and Eighth (E)

PyDnet Q

E Case study: performance evaluation (H res)

● CPU: Intel Core 7 (7500U) 2.7 Ghz Power Consumption: ~15 W

● Graphic accelerator: Intel HD Graphics 620 Power Consumption: ~15 W

● Myriad V2: Intel Movidius stick 1 Power Consumption: 0.5 W OpenVINO on embedded systems using Intel’s Movidius Neural Compute Stick

Low-power consumption is indispensable for autonomous/unmanned vehicles and IoT () devices and appliances. In order to develop deep learning inference applications, we can use Intel’s Movidius USB stick [2] Deep networks on a Raspberry PI with the Movidius Stick and OpenVINO

● The OpenVINO toolkit for Raspbian OS includes the Inference Engine and the MYRIAD plugins for Intel Movidius Neural Compute Stick ● The OpenVINO toolkit for Raspbian OS is an archive with pre-installed header files and libraries ● The following components are installed by default: ○ Inference Engine ○ Opencv ● Model Optimizer is not included, an already converted model is needed Requirements

Hardware

● Raspberry Pi board with ARM ARMv7-A CPU and Intel Movidius VPU

Operating Systems

● Raspbian Buster, 32-bit or Raspbian Stretch, 32-bit

Software

● CMake 3.7.2 or higher and Python 3.5, 32-bit Case study: PyDnet

The workflow is the same:

● Load FP16 model and weights ● Select the plugin (with the Raspberry PI, the only one is Myriad) ● Preprocess the input blob ● Run the inference engine ● Postprocess the output blob Case study: performance evaluation on a Raspberry PI with the Movidius stick

● RaspBerry pi 3: armv7l Cortex-A53 1.4 Ghz

● Myriad V2: Intel Movidius stick 1 Power Consumption: 0.5 W

Inference engine on Movidius NCS1 will run in combo with the CPU. the CPU usage is about 30% during inference Conclusions

● OpenVINO is a new and effective framework to optimize deployment of CNNs on Intel architectures

● A single framework for multiple architectures, including embedded devices

● Current limitations: ○ Support for custom networks (e.g., PydNet) should be improved ○ Some layers are not supported at all, in particular with the Intel Movidius stick References

[1] Intel® Distribution of OpenVINO™ Toolkit, https://software.intel.com/en-us/openvino-toolkit

[2] Intel® Movidius™ VPUs, https://www.movidius.com/

[3] M. Alwani, H. Chen, M. Ferdman, P. Milder “Fused-Layer CNN Accelerators”

[4] M. Poggi, F. Aleotti, F. Tosi, S. Mattoccia, “Towards real-time unsupervised monocular depth estimation on CPU”, IROS 2018