Silicon Acceleration APIs Embedded Technology 2016, Yokohama
Neil Trevett Vice President Developer Ecosystem, NVIDIA | President, Khronos [email protected] | @neilt3d November 2016
© Copyright Khronos Group 2016 - Page 1 Khronos and Open Standards
Software
Silicon
Khronos is an Industry Consortium of over 100 companies We create royalty-free, open standard APIs for hardware acceleration of Graphics, Parallel Compute, Neural Networks and Vision
© Copyright Khronos Group 2016 - Page 2 Accelerated API Landscape
Vision Frameworks
Neural Net Libraries OpenVX Neural Net Extension
High-level Language-based Acceleration Frameworks
3D Graphics Explicit Kernels
DSP Dedicated Hardware GPU FPGA
© Copyright Khronos Group 2016 - Page 3 OpenCL – Low-level Parallel Programing • Low level programming of heterogeneous parallel compute resources - One code tree can be executed on CPUs, GPUs, DSPs and FPGA • OpenCL C language to write kernel programs to execute on any compute device - Platform Layer API - to query, select and initialize compute devices - Runtime API - to build and execute kernels programs on multiple devices • New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14 - Adaptable and elegant sharable code – great for building libraries - Templates enable meta-programming for highly adaptive software - Lambdas used to implement nested/dynamic parallelism
GPU OpenCL CPU KernelOpenCL Kernel code DSP Runtime API CodeKernelOpenCL compiled for CPU loads and executes CPU CodeKernelOpenCL FPGA CodeKernel devices kernels across devices Devices Code Host
© Copyright Khronos Group 2016 - Page 4 OpenCL Conformant Implementations
1.0 | May09 1.1 | Jul11 1.2 | Jun12 1.0 | Aug09 1.1 | Aug10 1.2 | May12 2.0 | Dec14 1.0 | May10 1.1 | Feb11 Desktop 1.1 |Mar11 1.2 | Dec12 2.0 | Jul14 2.1 | Jun15
1.0 | May09 1.1 | Jun10 1.2 | May15
1.1 | Aug12 1.2 | Mar16
1.0 | Feb11 1.2 | Sep13 Mobile
1.1 | Nov12 1.2 | Apr14 2.0 | Nov15
1.1 | Apr12 1.2 | Dec14
1.2 | Sep14
1.1 | May13 1.0 | Jan10 Embedded 1.2 | May15 1.2 | Aug15 1.0 | Jul13 FPGA 1.0 | Dec14
Vendor timelines are Dec08 Jun10 Nov11 Nov13 Nov15 first implementation of OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 each spec generation Specification Specification Specification Specification Specification © Copyright Khronos Group 2016 - Page 5 SYCL for OpenCL • Single-source heterogeneous programming using STANDARD C++ - Use C++ templates and lambda functions for host & device code • Aligns the hardware acceleration of OpenCL with direction of the C++ standard - C++14 with open source C++17 Parallel STL hosted by Khronos
Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches
C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code
© Copyright Khronos Group 2016 - Page 6 Embedded Vision Acceleration Embedded Technology, Yokohama
Shorin Kyo, Huawei
© Copyright Khronos Group 2016 - Page 7 OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms - Precisely defined API for production deployment • Higher abstraction than OpenCL for performance portability across diverse architectures - Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware… • Extends portable vision acceleration to very low power domains - Doesn’t require high-power CPU/GPU Complex or OpenCL precision
Vision Engine Vision Processing X100 Dedicated Efficiency Middleware Hardware Application Vision DSPs X10 GPU Compute Multi-core CPU Power Efficiency Power X1 GPU DSP Computation Flexibility Hardware
© Copyright Khronos Group 2016 - Page 8 OpenVX Graphs • OpenVX developers express a graph of image operations (‘Nodes’) - Nodes can be on any hardware or processor coded in any language • Graphs can execute almost autonomously - Possible to Minimize host interaction during frame-rate graph execution • Graphs are the key to run-time optimization opportunities…
Pyr Camera OpenVX Nodes t Rendering Input Output
RGB YUV Gray Array of Frame Color Frame Channel Frame Image Optical Keypoints Harris Conversion Extract Pyramid Flow Track
Array of Features OpenVX Graph Ftrt-1
Feature Extraction Example Graph
© Copyright Khronos Group 2016 - Page 9 OpenVX Efficiency through Graphs..
Graph Memory Kernel Data Scheduling Management Merge Tiling
Execute a sub- Split the graph Reuse Replace a sub- graph at tile execution across pre-allocated graph with a granularity instead the whole system: memory for single faster node of image CPU / GPU / multiple granularity dedicated HW intermediate data
Faster execution Less allocation overhead, Better memory Better use of or lower power more memory for locality, less kernel data cache and consumption other applications launch overhead local memory
© Copyright Khronos Group 2016 - Page 10 Example Relative Performance
10 Relative Performance 9 8.7
8 OpenCV (GPU accelerated) 7 OpenVX (GPU accelerated) NVIDIA 6 implementation experience. Geometric mean of 5 >2200 primitives, grouped into each 4 categories, 2.9 running at different 3 image sizes and 2.5 parameter settings 2 1.5 1.1 1
0 Arithmetic Analysis Filter Geometric Overall
© Copyright Khronos Group 2016 - Page 11 Layered Vision Processing Ecosystem
Implementers may use OpenCL or Compute Shaders to implement OpenVX nodes on programmable processors And then developers can use OpenVX to enable a developer to easily connect those nodes into a graph AMD OpenVX - Open source, highly optimized Application for x86 CPU and OpenCL for GPU - “Graph Optimizer” looks at entire processing pipeline and removes/replaces/merges functions to improve C/C++ performance and bandwidth - Scripting for rapid prototyping, without re-compiling, at Programmable Vision Dedicated Vision production performance levels Processors Hardware http://gpuopen.com/compute-product/amd-openvx/
OpenVX enables the graph to be extended to include hardware architectures that don’t support programmable APIs
The OpenVX graph enables implementers to optimize execution across diverse hardware architectures an drive to lower power implementations
© Copyright Khronos Group 2016 - Page 12 OpenVX 1.0 Shipping, OpenVX 1.1 Released! • Multiple OpenVX 1.0 Implementations shipping – spec in October 2014 - Open source sample implementation and conformance tests available • OpenVX 1.1 Specification released 2nd May 2016 - Expands node functionality AND enhances graph framework • OpenVX is EXTENSIBLE - Implementers can add their own nodes at any time to meet customer and market needs
= shipping implementations
© Copyright Khronos Group 2016 - Page 13 OpenVX Neural Net Extension • Convolution Neural Network topologies can be represented as OpenVX graphs - Layers are represented as OpenVX nodes - Layers connected by multi-dimensional tensors objects - Layer types include convolution, activation, pooling, fully-connected, soft-max - CNN nodes can be mixed with traditional vision nodes • Import/Export Extension - Efficient handling of network Weights/Biases or complete networks • The specification is provisional - Welcome feedback from the deep learning community
Vision Node Native Vision Vision Downstream Camera Application Node Node Control CNN Nodes Processing
An OpenVX graph mixing CNN nodes with traditional vision nodes
© Copyright Khronos Group 2016 - Page 14 NNEF - Neural Network Exchange Format
NN Authoring Framework 1 Inference Engine 1
NN Authoring Framework 2 Inference Engine 2
NN Authoring Framework 3 Inference Engine 3
NN Authoring Framework 1 Inference Engine 1 Neural Net NN Authoring Framework 2 Exchange Inference Engine 2 Format NN Authoring Framework 3 Inference Engine 3
NNEF encapsulates neural network structure, data formats, commonly used operations (such as convolution, pooling, normalization, etc.) and formal network semantics
NNEF 1.0 is currently being defined OpenVX will import NNEF files
© Copyright Khronos Group 2016 - Page 15 Tessellation and geometry shaders 32-bit integers and floats ASTC Texture Compression OpenGL ES NPOT, 3D/depth textures Floating point render targets Vertex and Texture arrays Compute Shaders Debug and robustness for security Fixed function Multiple Render Targets Pipeline fragment shaders
Epic’s Rivalry demo using full Unreal Engine 4 https://www.youtube.com/watch?v=jRr-G95GdaM
Driver Silicon Silicon Driver Silicon Driver Update Update Update Update Update Update 2003 2004 2007 2012 2014 2015 2016 ES 1.0 ES 1.1 ES 2.0 ES 3.0 ES 3.1 ES 3.2 AEP N L Close to 2 Billion OpenGL ES devices shipped in 2015
OpenGL ES 2.0
OpenGL ES 3.x
Vertex and Fragment Shaders Compute Shaders http://hwstats.unity3d.com/mobile/gpu.html © Copyright Khronos Group 2016 - Page 16 The Key Principle of Vulkan: Explicit Control
• Application tells the driver what it is going to do You have to supply a LOT of information! - In enough detail that driver doesn’t have to guess - When the driver needs to know it And, you have to supply it early
• In return, driver promises to do - What the application asks for - When it asks for it - Very quickly Efficient, predictable performance
• No driver magic – no surprises
© Copyright Khronos Group 2016 - Page 17 Vulkan Explicit GPU Control
Vulkan Benefits
Resource management in app code: Less driver hitches and surprises Application Single thread per context Simpler drivers: Application Improved efficiency/performance Memory allocation Reduced CPU bottlenecks Thread management Language Front-end Lower latency Synchronization Compilers Increased portability High-level Driver Multi-threaded generation Initially GLSL Abstraction of command buffers Multi-threaded Command Buffers: Context management Command creation can be multi-threaded Memory allocation Multiple CPU cores increase performance Full GLSL compiler SPIR-V pre-compiled shaders Error detection Graphics, compute and DMA queues: Layered GPU Control Thin Driver Work dispatch flexibility Explicit GPU Control Loadable debug and SPIR-V Pre-compiled Shaders: validation layers No front-end compiler in driver GPU GPU Future shading language flexibility Loadable Layers No error handling overhead in Vulkan 1.0 provides access to production code OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality but with increased performance and flexibility
© Copyright Khronos Group 2016 - Page 18 Vulkan Multi-threading Efficiency 1. Multiple threads can construct Command Buffers in parallel Application is responsible for thread management and synch CPU Thread Command Buffer
CPU Thread Command CPU Buffer Thread
CPU Command Thread Command Buffer Queue GPU
CPU Command Thread 2. Command Buffers placed in Command Buffer Queue by separate submission thread
Applications can create graphics, compute CPU and DMA command buffers with a general Thread Command queue model that can be extended to more Buffer heterogeneous processing in the future
© Copyright Khronos Group 2016 - Page 19 Khronos Safety Critical APIs
DO-178B/C
ISO 26262
Experience and Guidelines New Generation APIs for safety certifiable vision, graphics and compute OpenGL SC 1.0 - 2005 OpenGL SC 2.0 - April 2016 Fixed function graphics subset Shader programmable pipeline subset
OpenGL ES 1.0 - 2003 OpenGL ES 2.0 - 2007 Fixed function graphics Shader programmable pipeline
Safety Critical Safety Critical vision heterogeneous Small driver size processing compute Advanced functionality Graphics and compute
https://www.khronos.org/news/press/advisory-panel-to-create-design-guidelines-for-safety-critical-apis © Copyright Khronos Group 2016 - Page 20 Please Consider Joining Khronos!
Understand early Ship products that conform to industry requirements! international standards! Influence how Access draft specs to standards evolve! build products faster!
Khronos is proven to RAPIDLY generate hardware API standards that create significant market opportunities
Any company or organization is welcome to join Khronos for a voice and a vote in any of its standards www.khronos.org
© Copyright Khronos Group 2016 - Page 21 VeriSilicon Computer Vision solution with OpenVX over “VIP”
Gaku Ogura Simon Jones
18 November 2016
© Copyright Khronos Group 2016 - Page 22 VeriSilicon – Who We Are
• Founded in 2001 • Over 650 employees • 70% dedicated to R&D
© Copyright Khronos Group 2016 - Page 23 VeriSilicon – What we can do ?
IP Development & Licensing
Idea to Spec System Architecture Shipping Silicon Design Spec to RTL Test
RTL to Netlist Package
Netlist to Manufacture GDS
© Copyright Khronos Group 2016 - Page 24 Movement of Intelligent Devices
▲ Multi-Media ► Graphics, Video, Audio, Voice
• Natural User Interface - VISION - Natural Language Processing - Sensors
▲Many applications – need flexibility ▲1000s of algorithms – new one every day ▲Compute intensive – need hardware acceleration ▲Algorithms keep changing – need programmability
© Copyright Khronos Group 2016 - Page 25 Deep Learning: Neural Networks Return with Vengeance • With big data and high density compute, deep neural networks can be trained to solve “impossible” computer vision problems
% Teams using Deep Learning
© Copyright Khronos Group 2016 - Page 26 The Basics of Deep Neural Networks (DNN) • Training: 1016-1022 MACs/dataset
forward Offline (mostly) “Pug” Analogous to compiling
=? “Rottweiler”
backward error 100K-100M Network Coefficients • Inference: 106-1011 MACs/image On-the-fly
Analogous to running s/w forward Embedded systems use this “Rottweiler”
© Copyright Khronos Group 2016 - Page 27 Challenges of Real-Time Inference
Classification Detection Segmentation
100G MAC/s 1T MAC/s 10T MAC/s
• Majority of deep neural net computation is Multiply-Accumulate (MAC) for convolutions and inner products • Memory bandwidth is a bigger challenge - Convolution in deep neural nets works with not just 2D images, but high-dimensional arrays (i.e. “tensor”) - Example: AlexNet (classification) has 60M parameters (240MB FP32), DDR BW with brute force implementation: 35GB/s
© Copyright Khronos Group 2016 - Page 28 Processor Architectures
Programmability
CPU
GPU OpenCL VIP • VeriSilicon OpenVX VeriSilicon • ARM OpenCV • Imagination
DSP • VeriSilicon • CEVA • Videantis • Cadence • Synopsys Custom RTL
Energy
© Copyright Khronos Group 2016 - Page 29 Efficient Processor Architecture to Enable DNN for Embedded Vision
• Programmable engine to cope Vision Applications with new CV algorithms • Future new NN layers can be Vision Deep Learning Framework Frameworks implemented here • Scalable architecture to support different PPA API Custom RTL to accelerate (Performance, Power, Area) mature CV algorithms for requirements 24/7 low power operation Scalable Mature CV Block ProgrammableProgrammable ScalableScalable HW Accelerator I/F to extend HW acceleration Programmable MACs • High utilization MACs for DNN EngineEngine MACsMACs for application specific purpose Engine Mature CV Block • Scalable architecture to HW Accelerator support different PPA Intelligent (Performance, Power, Area) Customer Defined Data Fabrics requirements HW Accelerator For data synchronization between threads/blocks and minimize DDR BW • Handles other DNN functions Local Memory / Cache (normalization, pooling,…) • Handles pruning, compression, batching… to increase MAC utilization and decrease DDR BW © Copyright Khronos Group 2016 - Page 30 Use Case Study: Faster RCNN • One of state-of-the-art object detection CNNs - Network configuration Programmabl NN Cores - 300 region proposals e Engine - 60M parameters, 30G MACs/image TP Fabric • Collaborative computing in VIP - NN Engine: number crunching Local Memory / Cache - Convolution layers, full connected layers - Tensor Processing Fabric: data rearrangementOutput of a layer is queued in DDR if next layer is on different processing engine - LRN, max pooling, ROI pooling - Programmable Engine: precision compute - RPN, NMS, softmax • Real-time performance (30fps HD) under 0.5W @ 28nm HPM
© Copyright Khronos Group 2016 - Page 31 Scalable Performance from VIP8000
VIP8000 Series VIP Nano VIP8000UL VIP8000L VIP8000 Number of Execution cores 1 2 4 8 Clock Frequency SVT @WC125C (MHz) 800 800 800 800 HD Performance@800MHz Perspective Warping 60 fps 120 fps 240 fps 480 fps Optical Flow LK 30 fps 60 fps 120 fps 240 fps Pedestrian Detection 30 fps 60 fps 120 fps 240 fps Convolutional Neural Network (AlexNet) 125 fps 250 fps 500 fps 1000 fps
• Scalable performance with SAME application code on different processor variants
© Copyright Khronos Group 2016 - Page 32 Comprehensive Multi-Media/Pixel Subsystems
VPUCore GC Core VIP Core ZSP CPU ISP (Video) (GPU) (Vision/Image) (Audio/Voice)
AXI Bus VeriSilicon Universal Compression
Local Bus
DMA DC Wireless Mixed IO Controller Memory Controller Engine (Display) Connectivity Signal IPs
Company Proprietary and Confidential © Copyright Khronos Group 2016 - Page 33 [email protected]
© Copyright Khronos Group 2016 - Page 34