CTHPC 2016 Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing

Khronos computing and graphics APIs in smart devices

May 27, 2016

Bor-Sung Liang, Ph.D [email protected]

© Copyright Khronos Group 2016 - Page 1 http://accelerateyourworld.org/

© Copyright Khronos Group 2016 - Page 2 Khronos Open Standards for Graphics and Compute

1990’s Workhorse cross-platform professional 3D apps & gaming

2000’s Ubiquitous mobile gaming & graphics apps

2005 Safety Critical Graphics

2008 Heterogeneous parallel compute

Portable intermediate representation for graphics and parallel compute 2014

High-efficiency GPU graphics and compute for performance critical apps 2015

© Copyright Khronos Group 2016 - Page 3 Khronos Connects Software to Silicon

Industry Consortium creating OPEN STANDARD APIs for hardware acceleration Any company is welcome – one company one vote

ROYALTY-FREE specifications Software Conformance Tests and Adopters State-of-the art IP framework protects Programs for specification integrity members AND the standards and cross-vendor portability

Low- silicon APIs needed on almost every platform: International, non-profit organization graphics, parallel compute, Silicon Membership and Adopters fees cover rich media, vision, sensor operating and engineering expenses and camera processing Strong industry momentum 100s of man years invested by industry experts Well over a BILLION people use Khronos APIs Every Day…

© Copyright Khronos Group 2016 - Page 4 PROMOTER MEMBERS

Over 130 members worldwide Any company is welcome to join

Members in East Asia Members in Taiwan © Copyright Khronos Group 2016 - Page 5 Khronos Cooperative Framework

API Working Groups (Industry and Academic members) $

Conformance Tests and Royalty-Free Documentation, Tools Educator Guidelines $ Adopters Program Specifications SDKs, Code Samples, Courseware Materials

Members Wider Community

Adopters Developers Educators / Certifiers Build conformant Develop applications Create Courses implementation and products using the APIs Training and Certification

© Copyright Khronos Group 2016 - Page 6 Evolution of Vulkan, OpenCL, SPIR-V

© Copyright Khronos Group 2016 - Page 7 “Khronos” “Vulkan” Chronos (Khronos)

Chronos (Khronos) is the personification of Time in pre-Socratic philosophy and later literature.

Greek: Χρόνος, "time," also transliterated as Chronos , Khronos or Latinised as Chronus

Source: Wikipedia

Vulcan

Vulcan is the god of fire including the fire of volcanoes, metalworking, and the forge in ancient Roman religion and myth.

Latin: Volcānus or Vulcānus

Source: Wikipedia

8 Example: Apple Application Processor

SoC A4 A5 A6 A7 A8 A8x A9 A9x

Year 2010 2011 2012 2013 2014 2014 2015 2015

Proc. 45nm 32nm 32nm 28nm 20nm 20nm 16nm(T)/14nm(S) 16nm(T)

Size 53.3 mm2 69.6 mm2 96.71 mm2 102 mm2 89 mm2 128 mm2 104.5 / 96mm2 147mm2

CA8 CA9 Swift Cyclone Cyclone Cyclone Twister Twister Single-core Dual-core Dual-core Dual-core Dual-core Triple-core dual-core dual-core CPU 1.0 GHz 1.0 GHz 1.3 GHz 1.3GHz 1.4 GHz 1.5 GHz 1.85 GHz 2.16~2.26 GHz

CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU

SGX535 SGX543MP2 SGX543MP3 G6430 GX6450 GXA6850 GT7600 7XT Series Single-core dual-core Tri-core Quad-core Quad-core Octa-core hexa-core dodeca-core 200 MHz 250 MHz 266 MHz 450 MHz 450 MH 450 MHz 450 MHz custom design [ GPU 2 GFLOPS 16 GFLOPS 25.5 GFLOPS 115.2 GFLOPS 115.2 GFLOPS 230.4 GFLOPS 172.8 GFLOPS 345.6GFlops

GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU GPU 9 Source: Wikipedia, Apple, Chipworks, TechInsights and public information on internet Performance Gap between CPU & GPU

Driver

Driver overhead affects performance!!

Example: Apple iPad family 10

* Source: Apple, public information on internet New Low overhead API

iOS OSX iOS

Application Application responsible for Traditional graphics memory allocation and thread drivers include significant management to generate command context, memory and error Driver buffers management Overhead Direct GPU Control

GPU GPU

11 Low Overhead API in PC Direct X11 (Traditional) AMD Mantle DirectX 12 Low Overhead API Low Overhead API

Source: GamerBoy PLAYS, Directx 11 VS Mantle VS Directx 12 3Dmark API Overhead Test R9 280x Catalyst 15.200.1012.2 Win 10TP https://www.youtube.com/watch?v=tTh1xOhADuA 12

2014 Status

RenderScript

RSC GLSL OpenCL C Kernel Language Language Kernel language

SPIR

RenderScript OGS/ES OpenCL API /Runtime API /Runtime API /Runtime

13 2016 Status

Low-Overhead Low-Overhead Metal RenderScript

RSC GLSL GLSL OpenCL C/C++ C++ Kernel Language Shader Language Shader Language Kernel language

SPIR-V SPIR-V

Metal RenderScript Vulkan OGS/ES OpenCL API/Runtime API /Runtime API /Runtime API /Runtime API /Runtime

14 Low overhead 3D Graphics API The Genesis of Vulkan

Khronos members from all segments of the graphics industry Significant proposals, IP contributions agree the need for new and engineering effort from many Khronos’ first API generation cross-platform GPU API working group members ‘hard launch’

Including an unprecedented level of 18 months Specification, Conformance Tests, SDKs - all open source…

participation from developers A high-energy Reference Materials, Compiler front-ends, Samples… working group effort Multiple Conformant Drivers on multiple OS

Vulkan Working Group Participants

© Copyright Khronos Group 2016 - Page 16 The Need for a New Generation GPU API • Explicit - Open up the high-level driver abstraction to give direct, low-level GPU control • Streamlined - Faster performance, lower overhead, less latency • Portable - Cloud, desktop, console, mobile and embedded • Extensible - Platform for rapid innovation

OpenGL has evolved over 25 years and GPUs are increasingly programmable and GPUs will accelerate graphics, compute, vision continues to meet industry needs – but there is compute capable + platforms are becoming and deep learning across diverse platforms: a need for a complementary API approach mobile, memory-unified and multi-core FLEXIBILITY and PORTABILITY are key

© Copyright Khronos Group 2016 - Page 17 Vulkan Explicit GPU Control

Vulkan Benefits

Simpler drivers: Improved efficiency/performance Application Reduced CPU bottlenecks Single thread per context Lower latency Application Increased portability Memory allocation Thread management Language Front-end Resource management in app code: Multi-threaded generation Compilers Less hitches and surprises High-level Driver of command buffers Initially GLSL Abstraction Multi-queue work Command Buffers: submission Context management Command creation can be multi-threaded Memory allocation Multiple CPU cores increase performance Full GLSL compiler SPIR-V pre-compiled Error detection Graphics, compute and DMA queues: Layered GPU Control Thin Driver Work dispatch flexibility Explicit GPU Control Loadable debug and SPIR-V Pre-compiled Shaders: validation layers No front-end compiler in driver GPU GPU Future shading language flexibility Loadable Layers No error handling overhead in Vulkan 1.0 provides access to production code OpenGL ES 3.1 / OpenGL 4.X-class GPU functionality but with increased performance and flexibility

© Copyright Khronos Group 2016 - Page 18 Next Generation GPU APIs

Only Windows 10 Only Apple

Cross Platform

© Copyright Khronos Group 2016 - Page 19 Vulkan - No Compromise Performance

Potential Performance Gain

Retains Traditional Binding Model (but missing functionality such as Tessellation and Geometry Shaders)

Amount of work to port from traditional OpenGL and OpenGL ES

© Copyright Khronos Group 2016 - Page 20 Which Developers Should Use Vulkan? • Vulkan puts more work and responsibility into the application - Not every developer will need or want to make that extra investment • For many developers OpenGL and OpenGL ES will remain the most effective API - Khronos actively evolving OpenGL and OpenGL ES in parallel with Vulkan

Can your graphics Is your app Yes work creation be CPU bound? parallelized?

You’ll do whatever Yes Yes it takes to squeeze out maximum Yes performance No You put You can a premium on manage your avoiding Yes graphics resource hitches allocations

Vulkan provides more choice to developers and can be used to create new classes of end-user experience © Copyright Khronos Group 2016 - Page 21 Vulkan vs OpenGL ES

Source: Imagination, PowerVR Rogue GPUs running Gnome Horde demo (Vulkan prototype)

https://www.youtube.com/watch?v=P_I8an8jXuM

Fish demo (nVidia)

• Vulkan and OpenGL ES 3.1 • Can change - # of schools of fish - # of fish per school - # of fish per drawcall • Worker threads create command buffers in Vulkan mode • Reports - Drawcalls/sec - FPS - CPU time per thread - GPU time Source, nVidia, Vulkan API - Multithreaded Fish Demo Download • Android and Windows https://www.youtube.com/watch?v=6DpJY1izxc4 • will be available soon

© Copyright Khronos Group 2016 - Page 23 200K Fishies, 100 fish per draw call drawcalls / sec 200,000 7x = 2K Draw call @ 1 frame 180,000 = 60K Draw call @ 30 frame 160,000 = 120K Draw call @ 60 frame 140,000 120,000 = 180K Draw call @ 90 frame 100,000 OpenGL ES Vulkan 80,000 60,000 40,000 1.5x 20,000 1.2x 0 Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

© Copyright Khronos Group 2016 - Page 24 200K Fishies, 1 fish per draw call drawcalls / sec 18,000,000 19x = 200K draw calls @ 1 frame 16,000,000 = 6,000K draw calls @ 30 frame 14,000,000 = 12,000K draw calls @ 60 frame 12,000,000

10,000,000 = 18,000K draw calls @ 90 frame OpenGL ES

8,000,000 Vulkan 6,000,000

4,000,000 6x 2,000,000 5x 0 Geforce GTX 980 SHIELD Android TV SHIELD Tablet K1

© Copyright Khronos Group 2016 - Page 25 The Power of a Three Layer Ecosystem

Applications using game engines Application will automatically benefit from Application uses Vulkan’s enhanced performance Applications utility libraries to can use Vulkan speed directly for development maximum Game Engines flexibility and fully optimized control Utility libraries over Vulkan and layers

Rich Area for Innovation • Many utilities and layers will be in open source • Layers to ease transition from OpenGL • Domain specific flexibility

Similar ecosystem dynamic as WebGL A widely pervasive, powerful, flexible foundation layer enables diverse middleware tools and libraries

© Copyright Khronos Group 2016 - Page 26 OpenCL Computing API

© Copyright Khronos Group 2016 - Page 27 OpenCL Ecosystem

OpenCL 2.2 – Top to Bottom C++ Hardware Implementers Working Group Members Desktop/Mobile/Embedded/FPGA Apps/Tools/Tests/Courseware

Single Source C++ Programming

Core API and Language Specs

Portable Kernel Intermediate Language

© Copyright Khronos Group 2016 - Page 28 OpenCL 2.2 • Provisional - seeking industry feedback before finalization at SIGGRAPH or SC16 • OpenCL C++ kernel language into core • SPIR-V 1.1 adds OpenCL C++ support • SYCL 2.2 fully leverages OpenCL 2.2 from a single source file • Runs on any OpenCL 2.0-capable hardware OpenCL C++ Kernel Language SPIR-V 1.1 with C++ support SYCL 2.2 for single source C++

SPIR-V in Core Subgroups into core Subgroup query operations Shared Virtual Memory clCloneKernel On-device dispatch Low-latency device Generic Address Space timer queries 3-component vectors Enhanced Image Support Device partitioning Additional image formats C11 Atomics Separate compilation and linking Multiple hosts and devices Pipes Enhanced image support Buffer region operations Android ICD Built-in kernels / custom devices Enhanced event-driven execution Enhanced DX and OpenGL Interop Additional OpenCL C built-ins Improved OpenGL data/event interop

Dec08 18 months Jun10 18 months Nov11 24 months Nov13 24 months Nov15 7months May16 OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 OpenCL 2.2 Specification Specification Specification Specification Specification PROVISIONAL

© Copyright Khronos Group 2016 - Page 29 OpenCL Implementations

1.0 | May09 1.1 | Jul11 1.2 | Jun12 1.0 | Aug09 1.1 | Aug10 1.2 | May12 2.0 | Dec14

1.0 | May10 1.1 | Feb11 Desktop 1.1 |Mar11 1.2 | Dec12 2.0 | Jul14

1.0 | May09 1.1 | Jun10 1.2 | May15

1.1 | Aug12 1.2 | Mar16

1.0 | Feb11 1.2 | Sep13 Mobile

1.1 | Nov12 1.2 | Apr14 2.0 | Nov15

1.1 | Apr12 1.2 | Dec14

1.2 | Sep14

1.1 | May13 1.0 | Jan10 Embedded 1.2 | May15 1.2 | Aug15 1.0 | Jul13 FPGA 1.0 | Dec14

Vendor timelines are Dec08 Jun10 Nov11 Nov13 Nov15 first implementation of OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 each spec generation Specification Specification Specification Specification Specification © Copyright Khronos Group 2016 - Page 30 Add Compute to Vulkan? In Discussion…

Desktop Use cases: Video and Image Processing, Gaming Compute HPC, SciViz, Datacenter Roadmap: Vulkan interop, arbitrary precision for Use case: Numerical Simulation, Virtualization increased performance, pre-emption, collective Roadmap: enhanced streaming processing, programming and improved execution model enhanced library support FPGAs Use cases: Network and Stream Processing Vulkan Compute? Roadmap: enhanced execution model, self- Gaming Compute, Pixel Processing, Inference synchronized and self-scheduled graphs, fine- grained synchronization between kernels, Fine grain graphics and compute (no interop needed) DSL in C++ SPIR-V for shading language flexibility – C/C++ Low-latency, fine grain run-time Google Android adoption Embedded Competes well with Metal (=C++/OpenCL 1.2) Use cases: Signal and Pixel Processing Roadmap: arbitrary precision, SVM, Roadmap: arbitrary precision for power efficiency, hard real-time scheduling, dynamic parallelism, pre-emption and QoS scheduling asynch DMA

Vulkan Lessons 1. Engine developer insights were essential during design Mobile 2. Engine prototyping during design was essential during design Use case: Photo and Vision Processing 3. Open sourcing tests, tools, specs drives deeper community engagement Roadmap: arbitrary precision for 4. Explicit API – supports strong middleware ecosystem inference engine and pixel processing efficiency, pre- BUT its ‘just’ a GPU API – still need OpenCL! emption and QoS scheduling for power efficiency

© Copyright Khronos Group 2016 - Page 31 Possible OpenCL Evolution

Increasing language expressiveness Guaranteeing degrees of forward progress Definitions of concurrency

Evolution of OpenCL … … filling the gap between imprecise HLL and imperfect hardware

Increasing parallel hardware flexibility Execution and memory model enhancements Pre-emption, virtual memory, on-device dispatch, synchronization

Should OpenCL evolve to focus on the things that ONLY OpenCL can do… 1. Enable low-level, explicit access to heterogeneous hardware – needed by languages and libraries 2. Provide efficient runtime coordination of tasks, resources, scheduling on target hardware 3. Leverage, synergize and co-exist with Vulkan compute – and learn from Vulkan … 4. Define feature sets so target hardware does not have to implement inappropriate functionality 5. Adopt layered tools architecture to drive tools momentum and decrease run-time overhead 6. Leave usability, portability and performance portability to higher levels in the ecosystem

© Copyright Khronos Group 2016 - Page 32 SPIR – Standard Portable Intermediate Representation

© Copyright Khronos Group 2016 - Page 33 SPIR-V Ecosystem

Khronos has open sourced Open source C++ front-end released these tools and translators GLSL https://github.com/KhronosGroup/SPIR/tree/spirv-1.1

Khronos plans to open Third party kernel and source these tools soon shader Languages OpenCL C OpenCL C++

SPIR-V Tools LLVM to SPIR-V SPIR-V Validator Bi-directional Translator Other SPIR-V (Dis)Assembler Intermediate LLVM Forms

SPIR-V • Khronos defined and controlled cross-API intermediate language • Native support for graphics and parallel constructs • 32-bit Word Stream • Extensible and easily parsed • Retains data object and control IHV Driver flow information for effective Runtimes code generation and translation © Copyright Khronos Group 2016 - Page 34 SPIR-V serves many APIs

Vulkan OpenCL

● Need ongoing coherent design. Talk to us early. ● Aim to serve everyone with common features ● Updates need to be ratified by SPIR WG and client API WG. OpenGL

© Copyright Khronos Group 2016 - Page 35 Evolution of SPIR Family • SPIR–V is first fully specified Khronos-defined SPIR standard - Does not use LLVM to isolate from LLVM roadmap changes - Includes full flow control, graphics and parallel constructs beyond LLVM - Khronos will open source SPIR-V <-> LLVM conversion tools

SPIR 1.2 SPIR 2.0 SPIR-V 1.0

100% Khronos defined LLVM Interaction Uses LLVM 3.2 Uses LLVM 3.4 Round-trip lossless conversion

Compute Constructs Metadata/Intrinsics Metadata/Intrinsics Native

Graphics Constructs No No Native

Supported OpenCL C 1.2 / 2.0 OpenCL C 1.2 Language OpenCL C 1.2 OpenCL C++ OpenCL C 2.0 Feature Set GLSL OpenCL 1.2 OpenCL 2.0 OpenCL 2.1 OpenCL Ingestion Extension Extension Core Vulkan Ingestion - - Vulkan Core

© Copyright Khronos Group 2016 - Page 36 Support for Both SPIR-V and LLVM • LLVM is an SDK, not a formally defined standard - Khronos moved away from trying to use LLVM IR as a standard - Issues with versioning, metadata, etc. • But LLVM is a treasure chest of useful transforms - SPIR-V tools can encapsulation and use LLVM to do useful SPIR-V transforms • SPIR-V tools can all use different rules – and there will be lots of these - May be lossy and only support SPIR-V subset - Internal form is not standardized - May hide LLVM version, metadata GLSL OpenCL C HLSL OpenCL C++

‘Rendezvous’ format Transform Tool for interchange Tool-encapsulated - Compression Native expression of graphics SPIR-V - Optimization and parallel functionality for LLVM - Stripping Khronos APIs - Linker/Merger

Driver © Copyright Khronos Group 2016 - Page 37 Next Step

© Copyright Khronos Group 2016 - Page 38 Khronos Open Standards for Graphics and Compute

LATEST STATUS

Workhorse cross-platform professional 3D apps & gaming New Extensions to enable latest 1990’s desktop graphics capabilities

Ubiquitous mobile gaming & graphics apps OpenGL ES 3.2 released to bring 2000’s AEP functionality to core

Safety Critical Graphics New Safety Critical Working 2005 Group – Call for Participation

Heterogeneous parallel compute OpenCL 2.0 specification update 2008 and C++ Headers released

Portable intermediate representation Provisional Spec Update and for graphics and parallel compute 2014 significant open source activity

High-efficiency GPU graphics and Adopted by Android and other compute for performance critical apps 2015 platforms. Building ecosystem

© Copyright Khronos Group 2016 - Page 39 Roadmap Possibilities

SPIR-V Ingestion for OpenGL and OpenGL ES for shading language flexibility

Thin and predictable graphics and compute for 1. C++ Shading Language safety critical 2. Single source C++ systems Programming from SYCL 3. OpenCL-class Heterogeneous Compute to Vulkan runtime

© Copyright Khronos Group 2016 - Page 40 Virtual Reality Will Influence Graphics APIs • The ability to generate ‘Presence’ is becoming achievable at reasonable cost - Using visual input to generate subconscious belief in a virtual situation - PC-based AND mobile systems • VR Requirements will affect how graphics APIs generate visual imagery - Control over generation of stereo pairs – slightly different view for each eye - Optical system geometric correction in rendering path - Reduce latency through elimination of in-driver buffering - Asynchronously warp framebuffer for instantaneous response to head movement Samsung Gear VR

Many companies driving the VR revolution are participating in the Vulkan working group © Copyright Khronos Group 2016 - Page 41 Khronos Standards for Advanced Processing

3D File formats for AUTHORING and TRANSMISSION of 3D runtime assets

Low-power vision processing for tracking, odometry and scene analysis 3D Graphics for Portable display of augmentations

Image: ScreenMedia and visualizations on every platform

Heterogeneous Processing Acceleration e.g. Neural Net Processing for scene understanding

© Copyright Khronos Group 2016 - Page 42 CTHPC 2016 Workshop on Compiler Techniques and System Software for High-Performance and Embedded Computing

Thank You!

Bor-Sung Liang, Ph.D [email protected]

© Copyright Khronos Group 2016 - Page 43