Deep-Learning on Embedded and Desktop Systems with Intel® Openvinotm

Deep-learning on embedded and desktop systems with Intel® OpenVINOTM Marco Valdes, Università di Bologna Corso di Sistemi Digitali M Stefano Mattoccia, Università di Bologna TM Intel® OpenVINO Toolkit Develop applications and solutions that emulate human vision. Based on convolutional neural networks (CNN), the toolkit extends workloads across Intel® hardware (including accelerators) and maximizes performance [1] OpenVINO stands for Open Visual Inferencing and Neural Network Optimization Goal: optimize CNNs for Intel architectures CPUs Processor graphics FPGAs Vision Processing Units Image Proc. Units Vision Accelerator Design https://software.intel.com/en-us/openvino-toolkit/hardware Training and inference The deployment of a typical learning-based system requires two phases: 1. Training; the network is fed with training data to minimize the error on the validation set 2. Inference: once trained, the network is deployed with new data Nonetheless, there are some notable exceptions such as self-adaptive networks (ie, training is carried out at inference time directly from the input data) With OpenVINO we always start from a pre-trained model OpenVINO: How it works? OpenVINO: Model Optimizer ● Converts models from various frameworks (i.e. TensorFlow, Caffe, MXNet) ● Converts to a unified model (i.e. intermediate representation (IR)) ● Optimizes topologies but not for a specific target device ● Fixes constants paths in graph ● Python application OpenVINO: Inference Engine ● API for Inference across all Intel® architectures ● Allows optimized inference on most Intel hardware targets ● Heterogeneity support allows execution of layers across hardware types ● Asynchronous execution improves performance OpenVINO: Inference Engine OpenVINO: Inference Engine workflow ● Load model and weights ● Load inference plugin (CPU, Graphic accelerator, FPGA, Myriad 2/X) ● Load network to plugin ● Allocate input, output buffers ● Fill input buffer with data ● Run inference ● Interpret output results Movidius Neural Compute Stick ● Neural network accelerator in USB stick form factor ● TensorFlow and Caffe frameworks supported ● Extended to work with Openvino Toolkit with FP16 Data Type ● Features the same Intel Movidius vision processing unit (Intel Movidius VPU) used in drones, VR headsets, and other low-power intelligent and autonomous products Optimizations with Model Optimizer Model Optimizer provides methods to accelerate inference with the convolution neural networks that do not require model retraining: ● Linear Operations Fusing ● Grouped Convolution Fusing (Specific for tensorflow models) ● Stride optimization (Specific for caffè models) ● Cutting Off Parts of the model Fusing CNN Layers [3] Conventional Approach: ● iteratively compute each layer and save temporary results on memory ● streaming in the results saved to memory Optimized Approach: ● fuse 2+ layers ● only the first input feature map are transferred to the RAM ● compute all intermediate values of all fused layers ● output feature map is written on the RAM Optimization strategies: Model Optimizer Linear Operations Fusing Grouped Convolution Fusing Optimization strategies: Inference Engine 1/3 For each supported device, the Inference Engine API applies different optimizations: ● Internal CPU Plugin Optimizations: ● Merging of group convolutions ● Fusing of Convolution with ReLU or ELU. CPU plugin is fusing all Convolution with ReLU or ELU layers if these layers are located after the Convolution layer. ● Removing power layer. CPU plugin removes Power layer from topology if it has the following parameters: power = 1, scale = 1, offset = 0. ● Fusing Convolution + Sum or Convolution + Sum + ReLU Optimization strategies: Inference Engine 2/3 Fusing Convolution + Sum or Convolution + Sum + ReLU Optimization strategies: Inference Engine 3/3 The GPU plugin uses the Intel® Compute Library for Deep Neural Networks. clDNN is an open source performance library for Deep Learning (DL) applications intended for acceleration of Deep Learning Inference on Intel Processor Graphics including Intel® HD Graphics and Intel® Iris® Graphics. GPU Plugin allows this specific Optimizations: ● Fused layers: ● Layers optimized out when conditions allow: ■ Convolution - Activation ■ Crop ■ Deconvolution - Activation ■ Concatenate ■ Eltwise - Activation ■ Reshape ■ Flatten ■ Fully Connected - Activation ■ Split ■ Copy Data type representation Floating point can represent a wide range of numbers. In case of FP32, we use 32 bit (1 bit for the sign, 8 bit for the exponent and 23 bit for the fractional part) Data type representation 2 The largest part of the data lay in the INT8 range and we can think that we can improve performance by using 8 bits but this choice provide an accuracy loss so we need to search a trade off Moving to int8 data type Requirements: Supported Layers: ● Intel platforms that support to x86 ● Convolution instruction set from the following list: ● FullyConnected (AVX-512 only) ○ Intel AVX-512 ● ReLU ○ Intel AVX2 ● Pooling ○ Intel SSE4.2 ● Eltwise ● A model must contain at least one ● Concat activation layer of ReLU type ● Resample ● Obj-detection and classification models Case study: monocular depth estimation with PyDnet ● PyDnet [4] is a lightweight deep-network for monocular depth perception ● Developed with TensorFlow ● Compared to most other networks, delivers real-time performance on standard CPUs ● Suited for embedded systems Case study: freezing the model 1/2 Goal: create an inference graph file Steps: ● Identify output nodes of the graph ● Load and set the Tensorflow graph (from checkpoint) ● All trainable parameters are represented as variables in the graph ● Use tensorflow util to convert variables into constants Case study: freezing the model 2/2 … output_nodes = ['model/resize_images/ResizeBilinear','model/resize_images_1/ResizeBilinear','model/resize_i mages_1/ResizeBilinear'] … tf.train.write_graph(sess.graph_def, "frozen_modelsSPLIT", 'pydnet.pbtxt') graph_pbtxt = os.path.join("frozen_model", 'pydnet.pbtxt') graph_path = os.path.join("frozen_model", 'pydnet.ckpt') outputs = output_nodes[0] for name in output_nodes[1:]: outputs += ','+name frozen_graph_path = os.path.join("frozen_models", 'frozen_pydnet.pb') freeze_graph.freeze_graph(graph_pbtxt,'',False, graph_path, outputs,'save/restore_all', 'save/Const:0', frozen_graph_path, True, '') Case study: run the model optimizer 1/2 once the model is frozen, run the model optimizer Python app: python3 '/opt/intel/openvino/deployment_tools/model_optimizer/mo_tf.py' --input_model '/path/to/model/frozen_pydnet.pb' --model_name "IRPydnet" --data_type (FP32,FP16) --output 'model/resize_images/ResizeBilinear','model/resize_images/ResizeBilinear','mo del/resize_images/ResizeBilinear' --log_level=DEBUG Case study: run the model optimizer 2/2 The model optimizer enables to perform general purpose (i.e., agnostic to the target architecture) optimizations: set parameters, cut the model, modify data type or insert custom layer definitions for the input model In few seconds the it produces the Intermediate Representation of the model and generates three files used by the inference engine: 1. IRPydnet.xml 2. IRPydnet.bin 3. IRPydnet.mapping Case study: input and output blobs 1/2 ● In the OpenVINO terminology, a blob is the binary input or output of a network ● It simply consists of a Numpy tensor For PyDnet: ● The input blob of the network is a single (batch size N=1) RGB (channels C=3) image of size 256x512 (HxW) ● The output blob is a single image of size 256x512 (HxW) with two C=2 channels (depth is encoded with 16 bit) Case study: input and output blobs 2/2 PyDnet C=3, HxW=256x512 HxW=256x512, C=2 Case study: inference engine 1/2 ● Consists in the creation of a Python wrapper that uses Inference Engine API to perform inference for the specific model deployed ● The input and output blobs must be processed according to a specific order ● For PyDnet, the input BLOB’s shape must be [1,3,256,512] (NCHW) and output blob that has shape [1,2,256,512] (NCHW) has to be transpose in [256,512,3] (HWC) form Case study: inference engine 2/2 model_xml = "/home/marco/Scaricati/pydnet-master/pydnet-master/IRPydnet.xml" model_bin = "/home/marco/Scaricati/pydnet-master/pydnet-master/IRPydnet.bin" ie = IECore() net= IENetwork(model=model_xml,weights=model_bin) ie.add_extension("/opt/intel/openvino/inference_engine/lib/intel64/libcpu_extension_sse4.so", "CPU") #needed only for Interp Layer for CPU input_blob = next(iter(net.inputs)) n,c,h,w=net.inputs[input_blob].shape #exec_net = ie.load_network(network=net, device_name="CPU") #exec_net = ie.load_network(network=net, device_name="GPU") #Load the specific plugin for the device that you want to test exec_net = ie.load_network(network=net, device_name="MYRIAD") … #load image with cv2, #resize it and preprocess data img= img.transpose((2,1,0)) …. res = exec_net.infer(inputs={input_blob: img}) out= res['model/resize_images/ResizeBilinear'] #select the output … #postprocess and visualize data Case study: experimental results The network provides results at three resolutions: H Half (H), Quarter (Q) and Eighth (E) PyDnet Q E Case study: performance evaluation (H res) ● CPU: Intel Core 7 (7500U) 2.7 Ghz Power Consumption: ~15 W ● Graphic accelerator: Intel HD Graphics 620 Power Consumption: ~15 W ● Myriad V2: Intel Movidius stick 1 Power Consumption: 0.5 W OpenVINO on embedded systems using Intel’s Movidius Neural Compute Stick Low-power consumption is indispensable for autonomous/unmanned vehicles and IoT (Internet of Things)

Load more