Silicon Acceleration Embedded Technology 2016, Yokohama

Neil Trevett Vice President Developer Ecosystem, NVIDIA | President, Khronos [email protected] | @neilt3d November 2016

© Copyright Khronos Group 2016 - Page 1 Khronos and Open Standards

Software

Silicon

Khronos is an Industry Consortium of over 100 companies We create royalty-free, open standard APIs for hardware acceleration of Graphics, Parallel Compute, Neural Networks and Vision

© Copyright Khronos Group 2016 - Page 2 Accelerated API Landscape

Vision Frameworks

Neural Net Libraries OpenVX Neural Net Extension

High-level Language-based Acceleration Frameworks

3D Graphics Explicit Kernels

DSP Dedicated Hardware GPU FPGA

© Copyright Khronos Group 2016 - Page 3 OpenCL – Low-level Parallel Programing • Low level programming of heterogeneous parallel compute resources - One code tree can be executed on CPUs, GPUs, DSPs and FPGA • OpenCL C language to write kernel programs to execute on any compute device - Platform Layer API - to query, select and initialize compute devices - Runtime API - to build and execute kernels programs on multiple devices • New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14 - Adaptable and elegant sharable code – great for building libraries - Templates enable meta-programming for highly adaptive software - Lambdas used to implement nested/dynamic parallelism

GPU OpenCL CPU KernelOpenCL Kernel code DSP Runtime API CodeKernelOpenCL compiled for CPU loads and executes CPU CodeKernelOpenCL FPGA CodeKernel devices kernels across devices Devices Code Host

© Copyright Khronos Group 2016 - Page 4 OpenCL Conformant Implementations

1.0 | May09 1.1 | Jul11 1.2 | Jun12 1.0 | Aug09 1.1 | Aug10 1.2 | May12 2.0 | Dec14 1.0 | May10 1.1 | Feb11 Desktop 1.1 |Mar11 1.2 | Dec12 2.0 | Jul14 2.1 | Jun15

1.0 | May09 1.1 | Jun10 1.2 | May15

1.1 | Aug12 1.2 | Mar16

1.0 | Feb11 1.2 | Sep13 Mobile

1.1 | Nov12 1.2 | Apr14 2.0 | Nov15

1.1 | Apr12 1.2 | Dec14

1.2 | Sep14

1.1 | May13 1.0 | Jan10 Embedded 1.2 | May15 1.2 | Aug15 1.0 | Jul13 FPGA 1.0 | Dec14

Vendor timelines are Dec08 Jun10 Nov11 Nov13 Nov15 first implementation of OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 each spec generation Specification Specification Specification Specification Specification © Copyright Khronos Group 2016 - Page 5 SYCL for OpenCL • Single-source heterogeneous programming using STANDARD C++ - Use C++ templates and lambda functions for host & device code • Aligns the hardware acceleration of OpenCL with direction of the C++ standard - C++14 with open source C++17 Parallel STL hosted by Khronos

Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches

C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code

© Copyright Khronos Group 2016 - Page 6 Embedded Vision Acceleration Embedded Technology, Yokohama

Shorin Kyo, Huawei

© Copyright Khronos Group 2016 - Page 7 OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms - Precisely defined API for production deployment • Higher abstraction than OpenCL for performance portability across diverse architectures - Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware… • Extends portable vision acceleration to very low power domains - Doesn’t require high-power CPU/GPU Complex or OpenCL precision

Vision Engine Vision Processing X100 Dedicated Efficiency Middleware Hardware Application Vision DSPs X10 GPU Compute Multi-core CPU Power Efficiency Power X1 GPU DSP Computation Flexibility Hardware

© Copyright Khronos Group 2016 - Page 8 OpenVX Graphs • OpenVX developers express a graph of image operations (‘Nodes’) - Nodes can be on any hardware or processor coded in any language • Graphs can execute almost autonomously - Possible to Minimize host interaction during frame-rate graph execution • Graphs are the key to run-time optimization opportunities…

Pyr Camera OpenVX Nodes t Rendering Input Output

RGB YUV Gray Array of Frame Color Frame Channel Frame Image Optical Keypoints Harris Conversion Extract Pyramid Flow Track

Array of Features OpenVX Graph Ftrt-1

Feature Extraction Example Graph

© Copyright Khronos Group 2016 - Page 9 OpenVX Efficiency through Graphs..

Graph Memory Kernel Data Scheduling Management Merge Tiling

Execute a sub- Split the graph Reuse Replace a sub- graph at tile execution across pre-allocated graph with a granularity instead the whole system: memory for single faster node of image CPU / GPU / multiple granularity dedicated HW intermediate data

Faster execution Less allocation overhead, Better memory Better use of or lower power more memory for locality, less kernel data cache and consumption other applications launch overhead local memory

© Copyright Khronos Group 2016 - Page 10 Example Relative Performance

10 Relative Performance 9 8.7

8 OpenCV (GPU accelerated) 7 OpenVX (GPU accelerated) NVIDIA 6 implementation experience. Geometric mean of 5 >2200 primitives, grouped into each 4 categories, 2.9 running at different 3 image sizes and 2.5 parameter settings 2 1.5 1.1 1

0 Arithmetic Analysis Filter Geometric Overall

© Copyright Khronos Group 2016 - Page 11 Layered Vision Processing Ecosystem

Implementers may use OpenCL or Compute to implement OpenVX nodes on programmable processors And then developers can use OpenVX to enable a developer to easily connect those nodes into a graph AMD OpenVX - Open source, highly optimized Application for x86 CPU and OpenCL for GPU - “Graph Optimizer” looks at entire processing pipeline and removes/replaces/merges functions to improve C/C++ performance and bandwidth - Scripting for rapid prototyping, without re-compiling, at Programmable Vision Dedicated Vision production performance levels Processors Hardware http://gpuopen.com/compute-product/amd-openvx/

OpenVX enables the graph to be extended to include hardware architectures that don’t support programmable APIs

The OpenVX graph enables implementers to optimize execution across diverse hardware architectures an drive to lower power implementations

© Copyright Khronos Group 2016 - Page 12 OpenVX 1.0 Shipping, OpenVX 1.1 Released! • Multiple OpenVX 1.0 Implementations shipping – spec in October 2014 - Open source sample implementation and conformance tests available • OpenVX 1.1 Specification released 2nd May 2016 - Expands node functionality AND enhances graph framework • OpenVX is EXTENSIBLE - Implementers can add their own nodes at any time to meet customer and market needs

= shipping implementations

© Copyright Khronos Group 2016 - Page 13 OpenVX Neural Net Extension • Convolution Neural Network topologies can be represented as OpenVX graphs - Layers are represented as OpenVX nodes - Layers connected by multi-dimensional tensors objects - Layer types include convolution, activation, pooling, fully-connected, soft-max - CNN nodes can be mixed with traditional vision nodes • Import/Export Extension - Efficient handling of network Weights/Biases or complete networks • The specification is provisional - Welcome feedback from the deep learning community

Vision Node Native Vision Vision Downstream Camera Application Node Node Control CNN Nodes Processing

An OpenVX graph mixing CNN nodes with traditional vision nodes

© Copyright Khronos Group 2016 - Page 14 NNEF - Neural Network Exchange Format

NN Authoring Framework 1 Inference Engine 1

NN Authoring Framework 2 Inference Engine 2

NN Authoring Framework 3 Inference Engine 3

NN Authoring Framework 1 Inference Engine 1 Neural Net NN Authoring Framework 2 Exchange Inference Engine 2 Format NN Authoring Framework 3 Inference Engine 3

NNEF encapsulates neural network structure, data formats, commonly used operations (such as convolution, pooling, normalization, etc.) and formal network semantics

NNEF 1.0 is currently being defined OpenVX will import NNEF files

© Copyright Khronos Group 2016 - Page 15 Tessellation and geometry shaders 32-bit integers and floats ASTC Texture Compression OpenGL ES NPOT, 3D/depth textures Floating point render targets Vertex and Texture arrays Compute Shaders Debug and robustness for security Fixed function Multiple Render Targets Pipeline fragment shaders

Epic’s Rivalry demo using full Unreal Engine 4 https://www.youtube.com/watch?v=jRr-G95GdaM

Driver Silicon Silicon Driver Silicon Driver Update Update Update Update Update Update 2003 2004 2007 2012 2014 2015 2016 ES 1.0 ES 1.1 ES 2.0 ES 3.0 ES 3.1 ES 3.2 AEP N L Close to 2 Billion OpenGL ES devices shipped in 2015

OpenGL ES 2.0

OpenGL ES 3.x

Vertex and Fragment Shaders Compute Shaders http://hwstats.unity3d.com/mobile/gpu.html © Copyright Khronos Group 2016 - Page 16 The Key Principle of Vulkan: Explicit Control

• Application tells the driver what it is going to do You have to supply a LOT of information! - In enough detail that driver doesn’t have to guess - When the driver needs to know it And, you have to supply it early

• In return, driver promises to do - What the application asks for - When it asks for it - Very quickly Efficient, predictable performance

• No driver magic – no surprises

© Copyright Khronos Group 2016 - Page 17 Vulkan Explicit GPU Control

Vulkan Benefits

Resource management in app code: Less driver hitches and surprises Application Single thread per context Simpler drivers: Application Improved efficiency/performance Memory allocation Reduced CPU bottlenecks Thread management Language Front-end Lower latency Synchronization Compilers Increased portability High-level Driver Multi-threaded generation Initially GLSL Abstraction of command buffers Multi-threaded Command Buffers: Context management Command creation can be multi-threaded Memory allocation Multiple CPU cores increase performance Full GLSL compiler SPIR-V pre-compiled shaders Error detection Graphics, compute and DMA queues: Layered GPU Control Thin Driver Work dispatch flexibility Explicit GPU Control Loadable debug and SPIR-V Pre-compiled Shaders: validation layers No front-end compiler in driver GPU GPU Future shading language flexibility Loadable Layers No error handling overhead in Vulkan 1.0 provides access to production code OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality but with increased performance and flexibility

© Copyright Khronos Group 2016 - Page 18 Vulkan Multi-threading Efficiency 1. Multiple threads can construct Command Buffers in parallel Application is responsible for thread management and synch CPU Thread Command Buffer

CPU Thread Command CPU Buffer Thread

CPU Command Thread Command Buffer Queue GPU

CPU Command Thread 2. Command Buffers placed in Command Buffer Queue by separate submission thread

Applications can create graphics, compute CPU and DMA command buffers with a general Thread Command queue model that can be extended to more Buffer heterogeneous processing in the future

© Copyright Khronos Group 2016 - Page 19 Khronos Safety Critical APIs

DO-178B/C

ISO 26262

Experience and Guidelines New Generation APIs for safety certifiable vision, graphics and compute OpenGL SC 1.0 - 2005 OpenGL SC 2.0 - April 2016 Fixed function graphics subset programmable pipeline subset

OpenGL ES 1.0 - 2003 OpenGL ES 2.0 - 2007 Fixed function graphics Shader programmable pipeline

Safety Critical Safety Critical vision heterogeneous Small driver size processing compute Advanced functionality Graphics and compute

https://www.khronos.org/news/press/advisory-panel-to-create-design-guidelines-for-safety-critical-apis © Copyright Khronos Group 2016 - Page 20 Please Consider Joining Khronos!

Understand early Ship products that conform to industry requirements! international standards! Influence how Access draft specs to standards evolve! build products faster!

Khronos is proven to RAPIDLY generate hardware API standards that create significant market opportunities

Any company or organization is welcome to join Khronos for a voice and a vote in any of its standards www.khronos.org

© Copyright Khronos Group 2016 - Page 21 VeriSilicon Computer Vision solution with OpenVX over “VIP”

Gaku Ogura Simon Jones

18 November 2016

© Copyright Khronos Group 2016 - Page 22 VeriSilicon – Who We Are

• Founded in 2001 • Over 650 employees • 70% dedicated to R&D

© Copyright Khronos Group 2016 - Page 23 VeriSilicon – What we can do ?

IP Development & Licensing

Idea to Spec System Architecture Shipping Silicon Design Spec to RTL Test

RTL to Netlist Package

Netlist to Manufacture GDS

© Copyright Khronos Group 2016 - Page 24 Movement of Intelligent Devices

▲ Multi-Media ► Graphics, , Audio, Voice

• Natural User Interface - VISION - Natural Language Processing - Sensors

▲Many applications – need flexibility ▲1000s of algorithms – new one every day ▲Compute intensive – need hardware acceleration ▲Algorithms keep changing – need programmability

© Copyright Khronos Group 2016 - Page 25 Deep Learning: Neural Networks Return with Vengeance • With big data and high density compute, deep neural networks can be trained to solve “impossible” computer vision problems

% Teams using Deep Learning

© Copyright Khronos Group 2016 - Page 26 The Basics of Deep Neural Networks (DNN) • Training: 1016-1022 MACs/dataset

forward Offline (mostly) “Pug” Analogous to compiling

=? “Rottweiler”

backward error 100K-100M Network Coefficients • Inference: 106-1011 MACs/image On-the-fly

Analogous to running s/w forward Embedded systems use this “Rottweiler”

© Copyright Khronos Group 2016 - Page 27 Challenges of Real-Time Inference

Classification Detection Segmentation

100G MAC/s 1T MAC/s 10T MAC/s

• Majority of deep neural net computation is Multiply-Accumulate (MAC) for convolutions and inner products • Memory bandwidth is a bigger challenge - Convolution in deep neural nets works with not just 2D images, but high-dimensional arrays (i.e. “tensor”) - Example: AlexNet (classification) has 60M parameters (240MB FP32), DDR BW with brute force implementation: 35GB/s

© Copyright Khronos Group 2016 - Page 28 Processor Architectures

Programmability

CPU

GPU OpenCL VIP • VeriSilicon OpenVX VeriSilicon • ARM OpenCV • Imagination

DSP • VeriSilicon • CEVA • Videantis • Cadence • Synopsys Custom RTL

Energy

© Copyright Khronos Group 2016 - Page 29 Efficient Processor Architecture to Enable DNN for Embedded Vision

• Programmable engine to cope Vision Applications with new CV algorithms • Future new NN layers can be Vision Deep Learning Framework Frameworks implemented here • Scalable architecture to support different PPA API Custom RTL to accelerate (Performance, Power, Area) mature CV algorithms for requirements 24/7 low power operation Scalable Mature CV Block ProgrammableProgrammable ScalableScalable HW Accelerator I/F to extend HW acceleration Programmable MACs • High utilization MACs for DNN EngineEngine MACsMACs for application specific purpose Engine Mature CV Block • Scalable architecture to HW Accelerator support different PPA Intelligent (Performance, Power, Area) Customer Defined Data Fabrics requirements HW Accelerator For data synchronization between threads/blocks and minimize DDR BW • Handles other DNN functions Local Memory / Cache (normalization, pooling,…) • Handles pruning, compression, batching… to increase MAC utilization and decrease DDR BW © Copyright Khronos Group 2016 - Page 30 Use Case Study: Faster RCNN • One of state-of-the-art object detection CNNs - Network configuration Programmabl NN Cores - 300 region proposals e Engine - 60M parameters, 30G MACs/image TP Fabric • Collaborative computing in VIP - NN Engine: number crunching Local Memory / Cache - Convolution layers, full connected layers - Tensor Processing Fabric: data rearrangementOutput of a layer is queued in DDR if next layer is on different processing engine - LRN, max pooling, ROI pooling - Programmable Engine: precision compute - RPN, NMS, softmax • Real-time performance (30fps HD) under 0.5W @ 28nm HPM

© Copyright Khronos Group 2016 - Page 31 Scalable Performance from VIP8000

VIP8000 Series VIP Nano VIP8000UL VIP8000L VIP8000 Number of Execution cores 1 2 4 8 Clock Frequency SVT @WC125C (MHz) 800 800 800 800 HD Performance@800MHz Perspective Warping 60 fps 120 fps 240 fps 480 fps Optical Flow LK 30 fps 60 fps 120 fps 240 fps Pedestrian Detection 30 fps 60 fps 120 fps 240 fps Convolutional Neural Network (AlexNet) 125 fps 250 fps 500 fps 1000 fps

• Scalable performance with SAME application code on different processor variants

© Copyright Khronos Group 2016 - Page 32 Comprehensive Multi-Media/Pixel Subsystems

VPUCore GC Core VIP Core ZSP CPU ISP (Video) (GPU) (Vision/Image) (Audio/Voice)

AXI Bus VeriSilicon Universal Compression

Local Bus

DMA DC Wireless Mixed IO Controller Memory Controller Engine (Display) Connectivity Signal IPs

Company Proprietary and Confidential © Copyright Khronos Group 2016 - Page 33 [email protected]

© Copyright Khronos Group 2016 - Page 34