The OpenVX Standard for Portable, Efficient Vision Processing

Frank Brill Radha Giduthuri Tom e r Schwartz Cadence AMD Intel September 2017 © Copyright 2017 - Page 1 Khronos Connects Software to Silicon

Industry Consortium creating OPEN STANDARD for hardware acceleration Any company is welcome – one company one vote

ROYALTY-FREE specifications Conformance Tests and Adopters State-of-the art IP framework protects Software Programs for specification integrity members AND the standards and cross-vendor portability

Low-level silicon APIs needed on International, non-profit organization every platform: graphics, parallel Membership and Adopters fees cover compute, vision, neural networks, Silicon operating and engineering expenses augmented and virtual reality…

Strong industry momentum 100s of man years invested by industry experts Well over a BILLION people use Khronos APIs Every Day…

© Copyright Khronos Group 2017 - Page 2 Khronos Open Standards

3D for the Web - VR/AR and games in-browser - Efficiently delivering runtime 3D assets Real-time 2D/3D - Virtual and Augmented Reality displays - Cross-platform gaming and UI - CAD and Product Design

Vision and Neural Networks - Tracking and odometry - Scene analysis/understanding - Neural Network inferencing Parallel Computation - Machine Learning acceleration - Embedded vision processing - High Performance Computing (HPC)

© Copyright Khronos Group 2017 - Page 3 OpenVX Vision Engines Wide range of vision hardware architectures Middleware OpenVX provides a high-level Graph-based abstraction -> Applications Enables Graph-level optimizations! Can be implemented on almost any hardware or processor! -> Portable, Efficient Vision Processing! Software GPU Portability DSP Hardware X100 Dedicated Hardware Vision DSPs X10 Vision GPU Node Compute Vision Vision Node Node Multi-core CNN CPU Nodes Power Efficiency X1 Computation Flexibility Vision Processing Graph

© Copyright Khronos Group 2017 - Page 4 Neural Net Workflow with Khronos Standards

Neural Network Training Third Vision/AI Party Applications Frameworks Tools

Datasets Trained Vision and Neural Network Network Inferencing Runtime Network Model Structure

Desktop and Cloud Hardware Embedded/Mobile Vision/Inferencing Embedded/Mobile Vision/Inferencing Embedded/MobileHardware Vision/Inferencing Embedded/Mobile/HardwareDesktop/Cloud Hardware OpenVX is an open, royalty-free standard for cross platform acceleration Vision/Inferencing Hardware of vision and neural network applications. GPU DSP CPU

NNEF is a Cross-vendor Neural Net file format Custom FPGA Encapsulates network formal semantics, structure, data formats, commonly-used operations (such as convolution, pooling, normalization, etc.) © Copyright Khronos Group 2017 - Page 5 Problem: Neural Network Fragmentation

Neural Network Inferencing Fragmentation toll on Applications

Inference Engine 1 Hardware 1 Vision/AI Hardware 2 Application Inference Engine 2 Hardware 3 Every Application Needs know Inference Engine 3 about Every Accelerator API

Neural Network Training and Inferencing Fragmentation

NN Authoring Framework 1 Inference Engine 1

NN Authoring Framework 2 Inference Engine 2

NN Authoring Framework 3 Inference Engine 3 Every Tool Needs an Exporter to Every Accelerator

© Copyright Khronos Group 2017 - Page 6 Solving Neural Network Fragmentation

Neural Network Inferencing Fragmentation toll on Applications

Inference Engine 1 Hardware 1 Vision/AI Hardware 2 Application Inference Engine 2 Inference Engine 3 Hardware 3

Neural Network Training and Inferencing Fragmentation

NN Authoring Framework 1 Inference Engine 1

NN Authoring Framework 2 Inference Engine 2

NN Authoring Framework 3 Inference Engine 3

Optimization and processing tools

© Copyright Khronos Group 2017 - Page 7 OpenVX in a nut-shell

• OpenVX contains - a set of memory objects that abstract the physical memory - a library of predefined and customizable vision and neural network functions Image processing and image filtering Feature detectors Neural networks Control flow - a graph-based execution model to combine function enabling both task and data-independent execution

• OpenVX is defined as a C API - object-oriented design - synchronous and asynchronous execution model - extend functionality using enum and callback functions

© Copyright Khronos Group 2017 - Page 8 OpenVX - Graph-Level Abstraction • OpenVX developers express a graph of image operations (‘Nodes’) - Using a C API • Nodes can be executed on any hardware or processor coded in any language - Implementers can optimize under the high-level graph abstraction • Graphs are the key to run-time power and performance optimizations - E.g. Node fusion, tiled graph processing for cache efficiency etc.

Pyr Camera OpenVX Nodes t Rendering Input Output

RGB YUV Gray Array of Frame Color Frame Channel Frame Image Optical Keypoints Harris Conversion Extract Pyramid Flow Track

Array of Features Ftr OpenVX Graph t-1

Feature Extraction Example Graph

© Copyright Khronos Group 2017 - Page 9 OpenVX Efficiency through Graphs..

Graph Memory Kernel Data Scheduling Management Fusion Tiling

Split the graph Execute a sub- Reuse execution across Replace a sub- graph at tile pre-allocated the whole graph with a granularity memory for system: CPU / single faster node instead of image multiple GPU / dedicated granularity intermediate data HW

Faster execution Less allocation overhead, Better memory Better use of or lower power more memory for locality, less kernel data cache and consumption other applications launch overhead local memory

© Copyright Khronos Group 2017 - Page 10 Simple Edge Detector in OpenVX

vx_graph g = vxCreateGraph();

vx_image input = vxCreateImage(1920, 1080); Declare Input and Output Images vx_image output = vxCreateImage(1920, 1080); vx_image horiz = vxCreateVirtualImage(g); Declare Intermediate Images vx_image vert = vxCreateVirtualImage(g); vx_image mag = vxCreateVirtualImage(g);

vxSobel3x3Node(g, input, horiz, vert); Construct the Graph topology vxMagnitudeNode(g, horiz, vert, mag); vxThresholdNode(g, mag, THRESH, output); Compile the Graph Execute the Graph

status = vxVerifyGraph(g); h status = vxProcessGraph(g); i S M m T o v

© Copyright Khronos Group 2017 - Page 11 OpenVX Evolution

Conformant Conformant Implementations Implementations

New Functionality Conditional node execution Feature detection Classification operators Expanded imaging operations New Functionality Extensions Under Discussion Neural Network Acceleration New Functionality NNEF Import Expanded Nodes Functionality Graph Save and Restore 16-bit image operation Enhanced Graph Framework Programmable user AMD OpenVX Tools kernels with accelerator offload - Open , highly optimized Safety Critical for x86 CPU and OpenCL for GPU OpenVX 1.1 SC for - “Graph Optimizer” looks at Streaming/pipelining safety-certifiable systems entire processing pipeline and removes, replaces, merges functions to improve performance OpenVX and bandwidth Roadmap - Scripting for rapid prototyping, OpenVX 1.2 without re-compiling, at production performance levels Spec released May 2017 http://gpuopen.com/compute-product/amd-openvx/ OpenVX 1.0 OpenVX 1.1 Spec released October 2014 Spec released May 2016

© Copyright Khronos Group 2017 - Page 12 OpenVX SC - Safety Critical Vision Processing • OpenVX 1.1 - based on OpenVX 1.1 main specification - Enhanced determinism - Specification identifies and numbers requirements • MISRA C clean per KlocWorks v10 • Divides functionality into “development” and “deployment” feature sets - Adds requirement to support import/export extension

OpenVX SC Verify Binary Import OpenVX SC Development Feature format Deployment Feature Set Set (Create Graph) Export (Execute Graph)

Entire graph creation API Implementation No graph creation API dependent format

© Copyright Khronos Group 2017 - Page 13 Benefits of Import-Export

• Minimize overheads on the target system - Faster setup-time - Simpler OpenVX runtime - Simpler certification (ISO 26262, etc..)

• Enable more graph optimizations when offline - No timing/resource constraints - Example: generate & compile specialized code for graphs

© Copyright Khronos Group 2017 - Page 14 How OpenVX Compares to Alternatives

Open standard API designed to be Community-driven, Open standard API designed to be Governance implemented and shipped by IHVs open source library implemented and shipped by IHVs

Programming Graph defined with C API and then Immediate runtime function calls – Explicit kernels are compiled and Model compiled for run-time execution reading to and from memory executed via run-time API

Built-in Vision Small but growing Vast. None. User programs their own or Functionality set of popular functions Mainly on PC/CPU call vision library over OpenCL

Target Any combination of processors or Any heterogeneous combination of Mainly PCs and GPUs Hardware non-programmable hardware IEEE FP-capable processors

Optimization Pre-declared graph enables Each function reads/writes memory. Any execution topology can be Opportunities significant optimizations Power performance inefficient explicitly programmed

Implementations must pass Extensive Test Suite but Implementations must pass Conformance conformance to use trademark no formal Adopters program conformance to use trademark

All core functions must be available Available functions vary depending on All core functions must be available Consistency in conformant implementations implementation / platform in all conformant implementations

© Copyright Khronos Group 2017 - Page 15 Layered Vision/ Neural Net Ecosystem

Implementers may use OpenCL or Vulkan to implement OpenVX nodes on programmable processors OpenVX enables the graph to be extended to include hardware architectures that don’t support programmable APIs

Application Vulkan Roadmap Enhanced compute – especially useful for vision and inferencing on mobile and Android platforms

OpenCL Roadmap Flexible precision for Programmable Vision Dedicated Vision widespread deployment on low- cost embedded processors Processors Hardware

And then implementors can use OpenVX to enable a developer to easily connect those nodes into a graph

The OpenVX graph abstraction enables implementers to optimize execution across diverse hardware architectures for optimal power and performance

© Copyright Khronos Group 2017 - Page 16 Quantization and new features in OpenVX 1.2

© Copyright Khronos Group 2017 - Page 17 Importance of Quantization • OpenVX support 8-bit by default and 16-bit by extension - Why?

• Quantization is very attractive in low power embedded system. - Less BW then float (Faster and less power) - Need less internal memory (Significant reduction in silicon area) - HW is smaller (Int8 and Int16 HW units are smaller then float)

• Quantization drawbacks - Less accurate - Degradation in performance quality (Some applications prefer the degradation if significant better performance is achieved)

© Copyright Khronos Group 2017 - Page 18 Neural Network Quantization • Q78 was the first trivial quantization

• UINT8 quantization was introduced by with GEMMLOWP. And it is in TensorFlow and used by their TPU card ( , ( )) - � = �����(� ∗ 2 ) 0 255 ���� ����

• INT8 with DFP (Dynamic Fixed Point) Is used be a tool called ristretto ( , ( )) - � = �����(� ∗ 2 ) Min_X Max_X -128 127

© Copyright Khronos Group 2017 - Page 19 Other Quantizations in 1.2 • HOG Features are 16 bit • Tensors are 8-bit and 16-bit only • Classification extension input is Tensor and therefore accept only 8-bits or 16 bits. • All Image processing operations are 8-bit or as 16-bit extension • Template Matching input is 8-bit. Output is 16-bit

© Copyright Khronos Group 2017 - Page 20 New OpenVX 1.2 Functions • Feature detection: find features useful for object detection and recognition - Histogram of gradients – HOG • Template matching - Local binary patterns – LBP • Line finding – Hough lines • Classification: detect and recognize objects in an image based on a set of features - Import a classifier model trained offline - Classify objects based on a set of input features • Image Processing: transform an image - Generalized nonlinear filter: Dilate, erode, median with arbitrary kernel shapes - Non maximum suppression: Find local maximum values in an image - Edge-preserving noise reduction – Bilateral Filter • Conditional execution & node predication - Selectively execute portions of a graph based on a true/false predicate A Condition • Many, many minor improvements • New Extensions B C - Import/export: compile a graph; save and run later - 16-bit support: signed 16-bit image data S - Neural networks: Layers are represented as OpenVX nodes If A then S ← B else S ← C

© Copyright Khronos Group 2017 - Page 21 Feature Detectors • Histogram of gradients – HOG - Technique which counts occurrences of gradient orientation in localized portions of an image • Local binary patterns – LBP - A binary pattern that describe that describe the image texture. • Hough Lines – - Find lines in a picture. used in line departure warning algorithms.

© Copyright Khronos Group 2017 - Page 22 Template Matching • a technique in digital image processing for finding small parts of an image which match a template image - Allow very strong denoising - Tracking - Object detection

© Copyright Khronos Group 2017 - Page 23 Classifier Extension (Detection and classification) • Classification is the problem of identifying to which of a set of categories (sub- populations) a new observation belongs. • Feature detection refers to methods that aim at computing abstractions of image information and making local decisions at every image point whether there is an image feature of a given type at that point or not • This extension do both. Possible methods: - SVM - Cascade (Decision Tree)

© Copyright Khronos Group 2017 - Page 24 OpenVX 1.2 and Neural Net Extension • Convolution Neural Network topologies can be represented as OpenVX graphs - Layers are represented as OpenVX nodes - Layers connected by multi-dimensional tensors objects - Layer types include convolution, activation, pooling, fully-connected, soft-max - CNN nodes can be mixed with traditional vision nodes - Only inference is supported • Import/Export Extension - Efficient handling of network Weights/Biases or complete networks • OpenVX will be able to import NNEF files into OpenVX Neural Nets

Vision Node Native Vision Vision Downstream Camera Application Node Node Control CNN Nodes Processing

An OpenVX graph mixing CNN nodes with traditional vision nodes

© Copyright Khronos Group 2017 - Page 25 Networks we are sure as supported Overfeat Alexnet GoogLeNet versions (Inception) LSTM RNN/BiRNN Faster-RCNN FCN

© Copyright Khronos Group 2017 - Page 26 OpenVX Neural Network Extension • Two main parts: (1) a tensor object and (2) a set of CNN layer nodes • A vx_tensor is a multi-dimensional array that supports at least 4 dimensions

• Tensor creation and deletion functions • Simple math for tensors - Element-wise Add, Subtract, Multiply, TableLookup, and Bit-depth conversion - Transposition of dimensions and generalized matrix multiplication - vxCopyTensorPatch, vxQueryTensor (#dims, dims, element type, Q)

© Copyright Khronos Group 2017 - Page 27 OpenVX Neural Network Extension • Tensor types of INT16, INT7.8, INT8, and U8 are supported - Other types may be supported by a vendor - Algorithms for quantization are needed for conversion from trained float models to INT16, INT7.8, INT8, and U8. - Extension is split to 2 variants. Full 16 and 8 bit support. And 8 bit only support. • Conformance tests will be up to some “tolerance” in precision - To allow for optimizations, e.g., weight compression • Eight neural network “layer” nodes:

vxActivationLayer vxConvolutionLayer vxDeconvolutionLayer

vxFullyConnectedLayer vxNormalizationLayer vxPoolingLayer

vxSoftmaxLayer vxROIPoolingLayer …

© Copyright Khronos Group 2017 - Page 28 An Example of Convolution Neural Network optimizations

Conv Activation Pooling Conv Activation Pooling Conv Activation Conv Activation Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer

Conv Activation Pooling Layer Layer Layer Fully Fully Fully Activ. Activ. Activation Connected Connected Connected Layer Layer Layer Layer Layer Layer Conv Activation Pooling Layer Layer Layer

Conv Activation Pooling Conv Activation Pooling Conv Activation Conv Activation Layer Layer Layer Layer Layer Layer Layer Layer Layer Layer

© Copyright Khronos Group 2017 - Page 29 OpenVX Benefits and Resources • Faster development of efficient and portable vision applications - Developers are protected from hardware complexities - No platform-specific performance optimizations needed • Graph description enables significant automatic optimizations - Scheduling, memory management, kernel fusion, and tiling • Performance portability to diverse hardware - Hardware agility for different use case requirements - Application software investment is protected as hardware evolves • OpenVX Resources - OpenVX Overview - https://www.khronos.org/openvx - OpenVX Specifications: current, previous, and extensions - https://www.khronos.org/registry/OpenVX - OpenVX Resources: implementations, tutorials, reference guides, forum, etc. - https://www.khronos.org/openvx/resources - https://forums.khronos.org © Copyright Khronos Group 2017 - Page 30