Silicon Acceleration Apis Embedded Technology 2016, Yokohama
Total Page:16
File Type:pdf, Size:1020Kb
Silicon Acceleration APIs Embedded Technology 2016, Yokohama Neil Trevett Vice President Developer Ecosystem, NVIDIA | President, Khronos [email protected] | @neilt3d November 2016 © Copyright Khronos Group 2016 - Page 1 Khronos and Open Standards Software Silicon Khronos is an Industry Consortium of over 100 companies We create royalty-free, open standard APIs for hardware acceleration of Graphics, Parallel Compute, Neural Networks and Vision © Copyright Khronos Group 2016 - Page 2 Accelerated API Landscape Vision Frameworks Neural Net Libraries OpenVX Neural Net Extension High-level Language-based Acceleration Frameworks 3D Graphics Explicit Kernels DSP Dedicated Hardware GPU FPGA © Copyright Khronos Group 2016 - Page 3 OpenCL – Low-level Parallel Programing • Low level programming of heterogeneous parallel compute resources - One code tree can be executed on CPUs, GPUs, DSPs and FPGA • OpenCL C language to write kernel programs to execute on any compute device - Platform Layer API - to query, select and initialize compute devices - Runtime API - to build and execute kernels programs on multiple devices • New in OpenCL 2.2 - OpenCL C++ kernel language - a static subset of C++14 - Adaptable and elegant sharable code – great for building libraries - Templates enable meta-programming for highly adaptive software - Lambdas used to implement nested/dynamic parallelism GPU OpenCL CPU KernelOpenCL Kernel code DSP Runtime API CodeKernelOpenCL compiled for CPU loads and executes CPU CodeKernelOpenCL FPGA CodeKernel devices kernels across devices Devices Code Host © Copyright Khronos Group 2016 - Page 4 OpenCL Conformant Implementations 1.0 | May09 1.1 | Jul11 1.2 | Jun12 1.0 | Aug09 1.1 | Aug10 1.2 | May12 2.0 | Dec14 1.0 | May10 1.1 | Feb11 Desktop 1.1 |Mar11 1.2 | Dec12 2.0 | Jul14 2.1 | Jun15 1.0 | May09 1.1 | Jun10 1.2 | May15 1.1 | Aug12 1.2 | Mar16 1.0 | Feb11 1.2 | Sep13 Mobile 1.1 | Nov12 1.2 | Apr14 2.0 | Nov15 1.1 | Apr12 1.2 | Dec14 1.2 | Sep14 1.1 | May13 1.0 | Jan10 Embedded 1.2 | May15 1.2 | Aug15 1.0 | Jul13 FPGA 1.0 | Dec14 Vendor timelines are Dec08 Jun10 Nov11 Nov13 Nov15 first implementation of OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 each spec generation Specification Specification Specification Specification Specification © Copyright Khronos Group 2016 - Page 5 SYCL for OpenCL • Single-source heterogeneous programming using STANDARD C++ - Use C++ templates and lambda functions for host & device code • Aligns the hardware acceleration of OpenCL with direction of the C++ standard - C++14 with open source C++17 Parallel STL hosted by Khronos Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code © Copyright Khronos Group 2016 - Page 6 Embedded Vision Acceleration Embedded Technology, Yokohama Shorin Kyo, Huawei © Copyright Khronos Group 2016 - Page 7 OpenVX – Low Power Vision Acceleration • Targeted at vision acceleration in real-time, mobile and embedded platforms - Precisely defined API for production deployment • Higher abstraction than OpenCL for performance portability across diverse architectures - Multi-core CPUs, GPUs, DSPs and DSP arrays, ISPs, Dedicated hardware… • Extends portable vision acceleration to very low power domains - Doesn’t require high-power CPU/GPU Complex or OpenCL precision Vision Engine Vision Processing X100 Dedicated Efficiency Middleware Hardware Application Vision DSPs X10 GPU Compute Multi-core CPU Power Efficiency Power X1 GPU DSP Computation Flexibility Hardware © Copyright Khronos Group 2016 - Page 8 OpenVX Graphs • OpenVX developers express a graph of image operations (‘Nodes’) - Nodes can be on any hardware or processor coded in any language • Graphs can execute almost autonomously - Possible to Minimize host interaction during frame-rate graph execution • Graphs are the key to run-time optimization opportunities… Pyr Camera OpenVX Nodes t Rendering Input Output RGB YUV Gray Array of Frame Color Frame Channel Frame Image Optical Keypoints Harris Conversion Extract Pyramid Flow Track Array of Features OpenVX Graph Ftrt-1 Feature Extraction Example Graph © Copyright Khronos Group 2016 - Page 9 OpenVX Efficiency through Graphs.. Graph Memory Kernel Data Scheduling Management Merge Tiling Execute a sub- Split the graph Reuse Replace a sub- graph at tile execution across pre-allocated graph with a granularity instead the whole system: memory for single faster node of image CPU / GPU / multiple granularity dedicated HW intermediate data Faster execution Less allocation overhead, Better memory Better use of or lower power more memory for locality, less kernel data cache and consumption other applications launch overhead local memory © Copyright Khronos Group 2016 - Page 10 Example Relative Performance 10 Relative Performance 9 8.7 8 OpenCV (GPU accelerated) 7 OpenVX (GPU accelerated) NVIDIA 6 implementation experience. Geometric mean of 5 >2200 primitives, grouped into each 4 categories, 2.9 running at different 3 image sizes and 2.5 parameter settings 2 1.5 1.1 1 0 Arithmetic Analysis Filter Geometric Overall © Copyright Khronos Group 2016 - Page 11 Layered Vision Processing Ecosystem Implementers may use OpenCL or Compute Shaders to implement OpenVX nodes on programmable processors And then developers can use OpenVX to enable a developer to easily connect those nodes into a graph AMD OpenVX - Open source, highly optimized Application for x86 CPU and OpenCL for GPU - “Graph Optimizer” looks at entire processing pipeline and removes/replaces/merges functions to improve C/C++ performance and bandwidth - Scripting for rapid prototyping, without re-compiling, at Programmable Vision Dedicated Vision production performance levels Processors Hardware http://gpuopen.com/compute-product/amd-openvx/ OpenVX enables the graph to be extended to include hardware architectures that don’t support programmable APIs The OpenVX graph enables implementers to optimize execution across diverse hardware architectures an drive to lower power implementations © Copyright Khronos Group 2016 - Page 12 OpenVX 1.0 Shipping, OpenVX 1.1 Released! • Multiple OpenVX 1.0 Implementations shipping – spec in October 2014 - Open source sample implementation and conformance tests available • OpenVX 1.1 Specification released 2nd May 2016 - Expands node functionality AND enhances graph framework • OpenVX is EXTENSIBLE - Implementers can add their own nodes at any time to meet customer and market needs = shipping implementations © Copyright Khronos Group 2016 - Page 13 OpenVX Neural Net Extension • Convolution Neural Network topologies can be represented as OpenVX graphs - Layers are represented as OpenVX nodes - Layers connected by multi-dimensional tensors objects - Layer types include convolution, activation, pooling, fully-connected, soft-max - CNN nodes can be mixed with traditional vision nodes • Import/Export Extension - Efficient handling of network Weights/Biases or complete networks • The specification is provisional - Welcome feedback from the deep learning community Vision Node Native Vision Vision Downstream Camera Application Node Node Control CNN Nodes Processing An OpenVX graph mixing CNN nodes with traditional vision nodes © Copyright Khronos Group 2016 - Page 14 NNEF - Neural Network Exchange Format NN Authoring Framework 1 Inference Engine 1 NN Authoring Framework 2 Inference Engine 2 NN Authoring Framework 3 Inference Engine 3 NN Authoring Framework 1 Inference Engine 1 Neural Net NN Authoring Framework 2 Exchange Inference Engine 2 Format NN Authoring Framework 3 Inference Engine 3 NNEF encapsulates neural network structure, data formats, commonly used operations (such as convolution, pooling, normalization, etc.) and formal network semantics NNEF 1.0 is currently being defined OpenVX will import NNEF files © Copyright Khronos Group 2016 - Page 15 Tessellation and geometry shaders 32-bit integers and floats ASTC Texture Compression OpenGL ES NPOT, 3D/depth textures Floating point render targets Vertex and Texture arrays Compute Shaders Debug and robustness for security Fixed function Multiple Render Targets Pipeline fragment shaders Epic’s Rivalry demo using full Unreal Engine 4 https://www.youtube.com/watch?v=jRr-G95GdaM Driver Silicon Silicon Driver Silicon Driver Update Update Update Update Update Update 2003 2004 2007 2012 2014 2015 2016 ES 1.0 ES 1.1 ES 2.0 ES 3.0 ES 3.1 ES 3.2 AEP N L Close to 2 Billion OpenGL ES devices shipped in 2015 OpenGL ES 2.0 OpenGL ES 3.x Vertex and Fragment Shaders Compute Shaders http://hwstats.unity3d.com/mobile/gpu.html © Copyright Khronos Group 2016 - Page 16 The Key Principle of Vulkan: Explicit Control • Application tells the driver what it is going to do You have to supply a LOT of information! - In enough detail that driver doesn’t have to guess - When the driver needs to know it And, you have to supply it early • In return, driver promises to do - What the application asks for - When it asks for it - Very quickly Efficient, predictable performance • No driver magic – no surprises © Copyright Khronos Group 2016 - Page 17 Vulkan Explicit GPU Control Vulkan Benefits Resource management in app code: Less driver hitches and surprises Application Single thread per context Simpler drivers: Application Improved efficiency/performance Memory allocation Reduced CPU bottlenecks Thread management