for Accelerating Vision and Inferencing An Overview of Options and Trade-offs Neil Trevett Khronos President VP Developer Ecosystems [email protected] | @neilt3d

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 1 Khronos is an open, non-profit, member-driven industry consortium developing royalty-free standards, and vibrant ecosystems, to harness the power of silicon acceleration for demanding graphics rendering and computationally intensive applications such as inferencing and vision processing

>150 Members ~ 40% US, 30% Europe, 30% Asia

Some Khronos Standards Relevant to Embedded Vision and Inferencing

Vision and Neural Network Heterogeneous Parallel ++ Parallel GPU Inferencing Runtime Exchange Format Parallel Compute Compute IR Compute Acceleration

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 2 Embedded Vision and Inferencing Acceleration

Networks trained on Neural Network high-end desktop Training Data and cloud systems Training

Trained Networks Compilation Ingestion Apps link to compiled code or Compiled Inferencing Application Vision inferencing library Code Engine Code Library

Hardware Acceleration APIs Sensor Data

Diverse Embedded DSP Hardware Dedicated (GPUs, DSPs, FPGAs) FPGA Hardware GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 3 Neural Network Training Acceleration

Neural Network Training Frameworks

Neural Net Training Network training needs 32- Neural Network Neural Net Training bit floating-point precision NeuralNeural NetFrameworks Net Training Training Authoring Interchange Frameworks Training Data FrameworksFrameworks Neural Network cuDNN MIOpen clDNN Acceleration Libraries Desktop and Cloud Hardware

Hardware Acceleration APIs TPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 4 Embedded Vision and Inferencing Acceleration

Networks trained on Neural Network high-end desktop Training Data and cloud systems Training

Trained Networks Compilation Ingestion Apps link to compiled code or Compiled Inferencing Application Vision inferencing library Code Engine Code Library

Hardware Acceleration APIs Sensor Data

Diverse Embedded DSP Hardware Dedicated (GPUs, DSPs, FPGAs) FPGA Hardware GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 5 Neural Network Exchange Formats

Before - Training and Inferencing Fragmentation After - NN Training and Inferencing Interoperability

Training Framework 1 Inference Engine 1 Training Framework 1 Neural Inference Engine 1 Network Training Framework 2 Inference Engine 2 Training Framework 2 Inference Engine 2 Exchange Format Training Framework 3 Inference Engine 3 Training Framework 3 Inference Engine 3 Every Inferencing Engine needs a Common Optimization custom importer from every Framework and processing tools

Two Neural Network Exchange Format Initiatives

ONNX and NNEF Embedded Inferencing Import Training Interchange are Complementary ONNX moves quickly to track authoring Defined Specification Open Project framework updates NNEF provides a stable bridge from Multi-company Governance at Khronos Initiated by Facebook & Microsoft training into edge inferencing engines

Stability for hardware deployment Software stack flexibility

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 6 NNEF and ONNX Industry Support

NNEF V1.0 released in August 2018 ONNX 1.6 Released in September 2019 After positive industry feedback on Provisional Specification. Introduced support for Quantization Maintenance update issued in September 2019 ONNX Runtime being integrated with GPU inferencing Extensions to V1.0 released for expanded functionality engines such as NVIDIA TensorRT

NNEF Working Group Participants ONNX Supporters

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 7 NNEF Tools Ecosystem

Compound operations TensorFlow and Live captured by exporting TensorFlow Lite Imminent graph Python script Import/Export NNEF Model Zoo Now available on GitHub. Useful for checking that ingested NNEF produces ONNX Caffe and Import/Export Caffe2 acceptable results on target system Import/Export Files NNEF adopts a rigorous approach to design lifecycle Syntax OpenVX Especially important for safety-critical or Parser and Ingestion and mission-critical applications in automotive, Validator Execution industrial and infrastructure markets

NNEF open source projects hosted on Khronos NNEF GitHub repository under Apache 2.0 https://github.com/KhronosGroup/NNEF-Tools

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 8 Embedded Vision and Inferencing Acceleration

Networks trained on Neural Network high-end desktop Training Data and cloud systems Training

Trained Networks Compilation Ingestion Apps link to compiled code or Compiled Inferencing Application Vision inferencing library Code Engine Code Library

Hardware Acceleration APIs Sensor Data

Diverse Embedded DSP Hardware Dedicated (GPUs, DSPs, FPGAs) FPGA Hardware GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 9 Primary Machine Learning Compilers

TensorFlow Graph, Caffe, Keras, MXNet, TensorFlow Graph, MXNet, PaddlePaddle, PyTorch, ONNX Import Formats ONNX PyTorch Keras, ONNX

Front-end / IR NNVM / Relay IR nGraph / Stripe IR Glow Core / Glow IR XLA HLO

LLVM, TPU IR, XLA IR OpenCL, LLVM, OpenCL, OpenCL TensorFlow Lite / NNAPI Output CUDA, Metal LLVM, CUDA LLVM (inc. HW accel)

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 10 Embedded NN Compilers ML Compiler Steps CEVA Deep Neural Network (CDNN) Cadence Xtensa Neural Network Compiler (XNNC)

Consistent Steps

1.Import Trained Network Description

2. Apply graph-level optimizations e.g. node fusion, node lowering and memory tiling

3. Decompose to primitive instructions and emit programs for accelerated run-times

Fast progress but still area of intense research If compiler optimizations are effective - hardware accelerator APIs can stay ‘simple’ and won’t need complex metacommands (combined primitive commands) like DirectML

Google’s Multi-Level IR Enables multiple domain-specific dialects within a common framework Experimenting with emitting to Vulkan/SPIR-V

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 11 SPIR-V Ecosystem SPIR-V Khronos-defined cross-API IR Native graphics and parallel compute support GLSL HLSL Easily parsed/extended 32-bit stream Data object/control flow retained for effective code generation/translation Third party kernel and languages glslang DXC MSL C++ for HLSL SYCL for OpenCL in SPIRV-Cross OpenCL C OpenCL C++ ISO C++ clang GLSL Front-end Front-end Front-end Front-end

SPIR-V (Dis)Assembler clspv

SPIR-V Validator Khronos cooperating LLVM to SPIR-V closely with clang/LLVM Bi-directional LLVM Community Translators SPIRV-opt | SPIRV-remap

Optimization Tools 3rd Party-hosted Open Source Projects

Khronos-hosted EnvironmentIHV spec Driver for Open Source Projects each target APIRuntimes used to drive compilation https://github.com/KhronosGroup/SPIRV-Tools

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 12 Embedded Vision and Inferencing Acceleration

Networks trained on Neural Network high-end desktop Training Data and cloud systems Training

Trained Networks Compilation Ingestion Apps link to compiled code or Compiled Inferencing Application Vision inferencing library Code Engine Code Library

Hardware Acceleration APIs Sensor Data

Diverse Embedded DSP Hardware Dedicated (GPUs, DSPs, FPGAs) FPGA Hardware GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 13 PC/Mobile Platform Inferencing Stacks

Consistent Steps 1. Import and optimize trained NN model file 2. Ingest graph into ML runtime 3. Accelerate inferencing operations using underlying low-level API

TensorFlow Lite

Core ML Model

Microsoft Windows Android Apple MacOS and iOS Windows Machine Learning (WinML) Neural Network API (NNAPI) CoreML https://docs.microsoft.com/en-us/windows/uwp/machine-learning/ https://developer.android.com/ndk/guides/neuralnetworks/ https://developer.apple.com/documentation/coreml

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 14 Inferencing Engines Acceleration APIs Cross-platform Inferencing Engines Platform Engines Android NNAPI Microsoft WinML Apple CoreML

Desktop IHVs AMD MIVisionX over MIOpen OpenVINO over clDNN Both provide NVIDIA TensorRT over cuDNN Inferencing AND Vision acceleration

Mobile / Embedded Huawei Arm Compute Library Huawei MACE Almost all Embedded Qualcomm Neural Processing SDK Inferencing Engines use OpenCL to access Synopsis MetaWare EV Dev Toolkit accelerator silicon TI Deep Learning Library (TIDL)

VeriSilicon VeriSilicon Acuity

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 15 OpenVX and OpenCV are Complementary

Callable API implemented, optimized and Community driven open source library Implementation shipped by hardware vendors Tight focus on dozens of core hardware accelerated 100s of imaging and vision functions functions plus extensions and accelerated custom nodes. Scope Multiple camera APIs/interfaces Uses external camera drivers Extensive OpenCV Test Suite but Implementations must pass Khronos Conformance no formal Adopters program Conformance Test Suite to use trademark Protected under Khronos IP Framework - Khronos None. Source code licensed under BSD. members agree not to assert patents against API when IP Protection Some modules require royalties/licensing used in Conformant implementations OpenCV 3.0 Transparent API (or T-API) enables Implementation free to use any underlying API such as Acceleration function offload to OpenCL devices OpenCL. Can use OpenCL for Custom Nodes OpenCV 4.0 G-API graph model for some filters, Graph-based execution of all Nodes. arithmetic/binary operations, and well-defined Efficiency Optimizable computation and data transfer geometrical transformations Deep Neural Network module to construct networks Neural Network layers and operations represented Inferencing from layers for forward pass computations only. directly in the OpenVX Graph. Import from ONNX, TensorFlow, Torch, Caffe NNEF direct import, ONNX through NNEF convertor

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 16 OpenVX Cross-Vendor Inferencing OpenVX A high-level graph-based abstraction for portable, efficient vision processing Optimized OpenVX drivers created and shipped by processor vendors Can be implemented on almost any hardware or processor Graph can contain vision processing and NN nodes – enables global optimizations Run-time graph execution can be almost completely autonomously from the host CPU

An OpenVX graph mixing CNN nodes with traditional vision nodes Vision Node Native Vision Vision Downstream Camera Application Node Node Control CNN Nodes Processing

NNEF Translator converts NNEF representation into OpenVX Node Graphs

Performance comparable to hand-optimized, non-portable code Real, complex applications on real, complex hardware Much lower development effort than hand-optimized code Hardware Implementations

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 17 Extending OpenVX with Custom Nodes

OpenVX/OpenCL Interop Kernel/Graph Import • Provisional Extension • Provisional Extension • Enables custom OpenCL acceleration to be • Defines container for executable or IR code invoked from OpenVX User Kernels • Enables arbitrary code to be inserted as a • Memory objects can be mapped or copied OpenVX Node in a graph

Application OpenVX user-kernels can access command queue and cl_mem objects to asynchronously Fully asynchronous host-device schedule OpenCL kernel execution operations during data exchange

Copy or export OpenVX data objects cl_mem buffers into OpenVX cl_mem buffers data objects

OpenCL Command Queue

Runtime Map or copy OpenVX data objects into cl_mem buffers Runtime

OpenVX/OpenCL Interop

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 18 OpenVX 1.3 Announced This Week!

OpenVX 1.3 Feature Sets Enables deployment flexibility while avoiding fragmentation Implementations with one or more complete feature sets are conformant - Baseline Graph Infrastructure (enables other Feature Sets) - Default Vision Functions - Enhanced Vision Functions (introduced in OpenVX 1.2) Version 1.3 - Neural Network Inferencing (including tensor objects) - NNEF Kernel import (including tensor objects) - Binary Images - Safety Critical (reduced features for easier safety certification) https://www.khronos.org/registry/OpenVX/specs/1.3/html/OpenVX_Specification_1_3.html

Open Source Prototype OpenVX 1.3 Open source OpenVX 1.3 for Conformance Test Suite Raspberry Pi 3 Model B with Raspbian OS Finalization expected before the end of 2019 Automatic optimization of memory access patterns via tiling and chaining https://github.com/KhronosGroup/OpenVX-cts/tree/openvx_1.3 Highly optimized kernels leveraging multimedia instruction set Automatic parallelization for multicore CPUs and GPUs Automatic merging of common kernel sequences Open Source OpenVX Tutorial https://github.com/KhronosGroup/OpenVX-sample-impl/tree/openvx_1.3 and Code Samples https://github.com/rgiduthuri/openvx_tutorial

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 19 Embedded Vision and Inferencing Acceleration

Networks trained on Neural Network high-end desktop Training Data and cloud systems Training

Trained Networks Compilation Ingestion Apps link to compiled code or Compiled Inferencing Application Vision inferencing library Code Engine Code Library

Hardware Acceleration APIs Sensor Data

Diverse Embedded DSP Hardware Dedicated (GPUs, DSPs, FPGAs) FPGA Hardware GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 20 GPU and Heterogeneous Acceleration APIs GPU APIs Heterogenous APIs Only Vulkan is cross platform API with rendering – Application Application including Embedded and Android Growing number of optimized OpenCL libraries Vision and Imaging Rendering + Compute Compute Only Machine Learning and Inferencing Linear Algebra and Mathematics GPU Rendering APIs Physics gradually expanding compute capabilities GPU e.g. short data types CPU CPUCPU GPUGPU Heterogeneous Many mobile SOCs and Embedded Systems FPGA DSP Compute becoming increasingly heterogeneous Resources Autonomous vehicles Vision and inferencing Custom Hardware

OpenCL provides a programming and runtime framework OpenCL C and C++ kernels can be compiled and loaded to any available device Low-level control over memory allocation and parallel task execution Simpler, relatively lightweight and more flexible than GPU APIs

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 21 OpenCL is Widely Deployed and Used

Parallel Computation Languages MATHLAB

Math and Physics Libraries

Hardware Implementations Vision and Imaging Libraries

Machine Learning Inferencing Compilers Modo

Huawei Intel clDNN Desktop Creative Apps Intel Synopsis MetaWare EV Android NNAPI SYCL-DNN CLBlast SYCL-BLAS Arm Compute Library

Linear Algebra Libraries TI Deep Learning Library (TIDL) VeriSilicon Machine Learning Libraries This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 22 SYCL Single Source C++ Parallel Programming • SYCL 1.2.1 Adopters Program released in July 2018 - https://www.khronos.org/news/press/khronos-releases-conformance-test-suite-for--1.2.1 • Multiple SYCL libraries for vision and inferencing E.g. complex ML frameworks can be - SYCL-BLAS, SYCL-DNN, SYCL-Eigen, SYCL Parallel STL directly compiled and accelerated

Single application source C++ templates and lambda file using STANDARD C++ functions separate host & device code

C++ Kernel Fusion can gives better performance on complex apps and libs than hand-coding

Ideal for accelerating larger C++-based engines and applications

Accelerated code passed into device OpenCL compilers

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 23 SYCL Implementations SYCL enables Khronos to influence ISO to enable standard C++ to (eventually) support heterogenous compute

Multiple Backend Support Coming SYCL beginning to be supported on low- level APIs in addition to OpenCL e.g. Vulkan and CUDA http://sycl.tech Intel Adoption Intel’s ‘One API’ Initiative uses SYCL https://newsroom.intel.com/news/intels-one-api-project- delivers-unified-programming-model-across-diverse- architectures/#gs.bydj6z

triSYCL LLVM/clang SYCL Compiler Open-source test-bed to Compiles C++-based SYCL ComputeCpp experiment with the HipSYCL source files into code for both Software's v1.2.1 specification of the OpenCL SYCL 1.2.1 implementation CPU and a wide range of conformant implementation SYCL C++ layer and to give that builds upon NVIDIA compute accelerators available to download today feedback to Khronos CUDA/AMD HIP/ROCm

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 24 OpenCL Evolution

OpenCL Extension Specs Scratch-Pad Memory Management Vulkan / OpenCL Interop Integration of Extensions Extended Subgroups plus New Core functionality SPIR-V 1.4 ingestion for compiler efficiency SPIR-V Extended debug info Focus for OpenCL Next is ‘Deployment Flexibility’ Flexible Profile enables embedded vendors to ship targeted functionality for their customers and be officially conformant Regular Maintenance Updates May 2017 Regular updates for spec clarifications, Target 2020 OpenCL 2.2 formatting and bug fixes ‘OpenCL Next’ https://www.khronos.org/registry/OpenCL/

Repeat Cycle for next Core Specification

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 25 Pervasive Vulkan Major GPU Companies supporting Vulkan for Desktop and Mobile Platforms

http://vulkan.gpuinfo.org/ Platforms

Apple Desktop Android (via porting Media Players Consoles (Android 7.0+) layers) (Vulkan 1.1 required on Android Q)

Virtual Reality Cloud Services Game Streaming Embedded

Game Engines

Croteam Serious Engine

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 26 Safety Critical GPU API Evolution

New Generation Safety Critical APIs for Graphics, Compute and Display OpenGL SC 1.0 - 2005 OpenGL SC 2.0 - April 2016 Fixed function graphics safety Programmable critical subset safety critical subset

OpenGL ES 1.0 - 2003 OpenGL ES 2.0 - 2007 Vulkan 1.0 - 2016 Fixed function graphics Programmable Shaders Explicit Graphics and Compute

Clearly Definable Design Goals to Adapt Vulkan for SC Rendering Compute Display Reduce driver size and complexity Vulkan is Compelling Starting -> Offline pipeline creation, no dynamic Point for SC GPU API Design display resolutions Industry Need • Widely adopted, royalty-free for GPU Acceleration APIs open standard Deterministic Behavior designed to ease system • Low-level explicit API – smaller surface -> No ignored parameters, static memory safety certification is area than OpenGL management, eliminate undefined increasing • Not burdened by debug functionality behaviors ISO 26262 / ASIL-D • Very little internal state • Well-defined thread behavior Robust Error Handling -> Error callbacks so app can respond, Fatal error callbacks for fast recovery initiation

Potential OpenCL SC work will leverage C API - MISRA C Compliance the deployment flexibility of ‘OpenCL Khronos Vulkan SC Working Group Next’ to minimize API surface area started work in February 2019

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 27 Need for New Camera Control API Standard? • Khronos suspended work on OpenKCam standard several years ago - Mobile market went proprietary - but embedded market has different needs • OpenKCAM was aiming at advanced control of ISP and camera with cross-platform portability - Generate sophisticated image stream for advanced imaging & vision apps - Portable access to growing sensor diversity: e.g. depth sensors and sensor arrays - Cross sensor synch: e.g. synch of multiple camera and MEMS sensors - Advanced, high-frequency per-frame burst control of camera/sensor: e.g. ROI - Multiple input, output re-circulating streams with RAW, Bayer or YUV Processing

Liaison opportunity for cooperation – Defines control of Sensor, Color Filter Array maybe over OpenKCam and GeniCam? Lens, Flash, Focus, Aperture Work together to integrate next generation Auto Exposure (AE) camera control and acceleration APIs? Auto White Balance (AWB) Auto Focus (AF) Stream of Image Signal images for Processor (ISP) downstream processing

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 28 Thank You and Resources • Khronos Standards: OpenVX, NNEF, OpenCL, SPIR, SYCL, Vulkan and more… - Any company is welcome to join Khronos www.khronos.org - OpenVX 1.3: www.khronos.org/openvx/ - SYCL: http://sycl.tech • Compilers for Machine Learning Conference Proceedings: www.c4ml.org • MLIR: www.blog.google/technology/ai/mlir-accelerating-ai-open-source-infrastructure/ • Neil Trevett: [email protected] | @neilt3d

Gain early insights Influence the design and Accelerate your time-to- into industry trends direction of key open standards market with early access and directions that will drive your business to specification drafts

Gather industry Draft Specifications Publicly Release requirements for future Confidential to Specifications and open standards Khronos members Conformance Tests

Network with domain State-of-the-art IP Enhance your company experts from diverse Framework protects your reputation as an industry leader companies in your industry Intellectual Property through Khronos participation Benefits of Khronos membership

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2019 - Page 29