Agenda
• What is SYCL and why Intel needs it? • SYCL language/API • SYCL programming model overview • Kernel execution model • “Hello world” example • Language evolution • DPC++/SYCL Compilers • App compilation and execution flow • SYCL implementations • References
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Heterogeneous Compute Platforms
- A modern platform includes: - One or more CPUs - One or more GPUs - DSP processors - FPGA - Individual processors have many (possibly heterogeneous) cores Programmers want to write a single portable program that uses ALL resources in the heterogeneousIntel platform IceLake architecture Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. What is SYCL?
Single-source heterogeneous programming using STANDARD C++ ▪ Use C++ templates and lambda functions for host & device code Aligns the hardware acceleration of OpenCL with direction of the C++ standard
Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches
C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. Why SYCL? Reactive and Proactive Motivation:
Reactive to OpenCL Pros and Cons: Proactive about Future C++: • OpenCL has a well-defined, portable • SYCL is based on purely modern C++ execution model. and should feel familiar to C++11 users. • OpenCL is prohibitively verbose for • SYCL expected to run ahead of C++Next many application developers. regarding heterogeneity and parallelism. ISO C++ of tomorrow may • OpenCL remains a C API and only look a lot like SYCL. recently supported C++ kernels. • Not held back by C99 or C++03 • Just-in-time compilation model and compatibility goals. disjoint source code is awkward and contrary to HPC usage models.
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others.
Terminology: SYCL platform
A SYCL implementation consists of: SYCL Platform ▪ A Host: Makes SYCL API calls
▪ SYCL Devices: Runs SYCL Kernels Host ▪ The Host and SYCL Devices make up an SYCL Platform ▪ One system may have multiple installed SYCL Platforms Example: Intel Platform (CPU) + NVIDIA Platform (GPU) CPU SYCL Device GPU SYCL Device ▪ SYCL Host Device is native C++ implementation of SYCL API on the host Some Other FPGA SYCL Device SYCL Device
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. SYCL Software Architecture
Applications create SYCL Command Queues per SYCL Platform Device ▪ Can create multiple Command Queues for a Host Device, e.g. one per thread ▪ No cross-device Command Queues, e.g. for GPU Queue automatic load balancing CPUCPU Queue Queue
Applications execute SYCL Command Groups via CPU SYCL Device GPU SYCL Device Command Queues ▪ Example Commands: Copy Memory, Fill Buffers, Launch Kernels, etc. Some Other FPGA SYCL Device SYCL Device
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. Kernel Execution Model
• Explicit ND-range for control • 1, 2 or 3 dimensional range • Kernel code is SPMD/SIMT ND-range
Work-group
Work-item Global work space
GEN11 slice architecture diagram
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Subgroups
• Additional level of execution model hierarchy - access SIMD unit • GEN EU can execute 16 FP32 floating point operations per clock • [2 ALUs x SIMD-4 x 2 Ops (Add + Mul)]
Sub-group decomposition of work-group
Work-group
Sub-group
Work-item Global work space
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. SYCL Constructs to Describe Parallelism
1. Basic data parallel parallel_for( num_work_items )
2. Explicit ND-Range parallel_for( NDRange( global_size, work_group_size ) )
▪ Enable SPMD/SIMT coding for OpenCL and CUDA experts
parallel_for_work_group( num_work_groups ) { 3. Hierarchical parallelism parallel_for_work_item( items_per_work_group ) } ▪ Exploit scope to control execution granularity
▪ Similar to Kokkos hierarchical parallelism 4. Single task single_task() {}
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Hierarchical parallelism (logical view)
parallel_for_work_group (…) { … parallel_for_work_item (…) { … } }
▪ Fundamentally top down expression of parallelism ▪ Many embedded features and details, not covered here
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Memory Model
• Private Memory Private Private Private Private Memory Memory Memory Memory –Per Work Item (kernel invocation) Work-Item Work-Item Work-Item Work-Item • Local Memory Local Memory Local Memory
–Shared within a Work Group Workgroup Workgroup
• Global/Constant Memory Global/Constant Memory –Visible to all Work Groups Compute Device • Host Memory Host Memory –On the Host Host Memory management is explicit. We must explicitly move data from host -> global -> local and back.
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. SYCL Device/Kernel Features
Supported features Unsupported features ₊ templates ₋ dynamic memory allocation ₊ classes ₋ dynamic polymorphism ₊ operator overloading ₋ runtime type information ₊ static polymorphism ₋ exception handling ₊ lambdas ₋ function pointers* ₊ short vector types (2/3/4/8/16-wide) ₊ reach library of built-in functions ₋ pointer structure members* ₋ static variables*
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
... }
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { Step 3: submit a command for ... (asynchronous) execution }); }
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A = d_A.get_access
... }); }
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A = d_A.get_access
... }); Accessors create DAG to trigger data movement and } represent execution dependencies
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 22 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 27 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 28 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example
void vector_add(const std::vector
sycl::gpu_selector device_selector; sycl::queue q(device_selector);
q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access
h.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 29 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 30 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access
Queue.submit([&](handler& h) { auto C = c.get_access
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 31 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 32 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions Queue.submit([&](handler& h) { auto A = a.get_access
Queue.submit([&](handler& h) { add1 add2 auto A = a.get_access
Queue.submit([&](handler& h) { add3 auto C = c.get_access
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 33 *Other names and brands may be claimed as the property of others.
Framing: DPC++ and SYCL
Intel has pushed many improvements / fixes into Intel Product this spec (now rev5) DPC++
2015 2016 2017 2018
SYCL Specs 1.2 2.2 1.2.1 2020 2022? … Provisional
Generalized Aligned with OpenCL 1.2 2.2 1.2 Backend Interoperability SYCL 1.2.1 = 2 years of evolution over 2.2 provisional
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 35 *Other names and brands may be claimed as the property of others. Key Intel DPC++ extensions on top of SYCL
Extensions Implementation status Unified shared memory (USM) PoC complete ND-range subgroups Done on GPU. Q3’19 on CPU Ordered queue Prototyping Function pointers Prototyping Data flow pipes (spatial) ETA Q3’19 ND-range reductions TBD Lambda naming Prototyping
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 36 *Other names and brands may be claimed as the property of others. Unified Shared Memory (USM)
Pointer-based alternative to Buffer model ▪ Defining several capability levels – Higher levels require more advanced hardware ▪ Lets the programmer have the level of desired control – Explicit Data movement APIs – Implicit Data movement ▪ Simplifies porting to the device – Port, Profile, Tune cycle to get best performance ▪ Aligns with CUDA Unified Memory
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 37 *Other names and brands may be claimed as the property of others. Simple USM Example
… float* a = (float*) sycl_malloc_shared(100*sizeof(float)); float* b = (float*) sycl_malloc_shared(100*sizeof(float)); for (int i = 0; i < 100; i++) { a[i] = func(); } q.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 38 *Other names and brands may be claimed as the property of others. Ordered Queues
SYCL first solved the hard problem (offload compute graph with dependencies) ▪ This extension makes the simple cases trivial to code
Enable in-order work queues in DPC++ ▪ Simplifies common patterns ▪ Reduces boilerplate ▪ Reduces effort to move from CUDA to DPC++
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 39 *Other names and brands may be claimed as the property of others. Lambda naming
Separate compilation (different host and device compilers) first class in SYCL ▪ Powerful but adds language requirements that are: 1. Annoying for users doing simple things 2. Problem for libs (e.g. PSTL) 3. Problem for frameworks (e.g. Kokkos) ▪ Fixing for typical use cases (in Intel compilers) until ISO C++ solution is invented
class name; int main() { int main() { queue().submit([&](handler& h){ queue().submit([&](handler& h){ h.single_task([]{…}); h.single_task
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 40 *Other names and brands may be claimed as the property of others. Build Time (developer workstation), Execution Time (end-user system)
41 Regular C++ Compilation Flow main.cpp:
#include
Linker Executable!
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 42 *Other names and brands may be claimed as the property of others. SYCL Mode Compilation Flow main.cpp: clang++ -fsycl main.cpp –o main.exe –lOpenCL -lsycl #include
Bundler/ Device KernelKernelKernel IR/ISA IR/ISAIR/ISA Wrapper/ Executable! Compiler (SPIR-V, Gen ISA) (SPIR(SPIR--V?V? ISA?) ISA?) Linker
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 43 *Other names and brands may be claimed as the property of others. SYCL Execution Flow
Executable (ISV)
SYCL Runtime
Host device is partially supported by Intel Intel CPU Intel GPU
DPC++ OpenCL Runtime OpenCL Runtime
CPU Device CPU
Host Device Host GPU Device GPU
Intel CPU Intel GPU FPGA, …
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 44 *Other names and brands may be claimed as the property of others. SYCL Implementations (5 are known today)
Intel Open Source Project
Only conformant SYCL implementation today (Codeplay) * Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 45 *Other names and brands may be claimed as the property of others. SYCL Implementations (2)
Codeplay hired the developer
* Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 46 *Other names and brands may be claimed as the property of others. SYCL Implementations (3)
Intel Open Source Project
Mostly
* Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 47 *Other names and brands may be claimed as the property of others. Intel DPC++ toolchain
DPC++ compiler passes 88% of Khronos SYCL certification tests • Missing features: image class, hierarchical parallelism DPC++ toolchain includes • Debugger: GDB support • Profiler: Intel VTune • Compatibility tool to simplify migration of legacy CUDA codebase to SYCL
Please, try it out and give us your feedback!
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 48 *Other names and brands may be claimed as the property of others. Resources
Intel internal • https://soco.intel.com/groups/dpc-incremental-releases-for-internal/ - Intel oneAPI internal forum (includes releases, useful documents, etc.) Public • Intel SYCL compiler: https://github.com/intel/llvm/tree/sycl • http://sycl.tech/ - digest of SYCL projects, articles, presentations, news, etc. • https://www.khronos.org/sycl/ - SYCL page at Khronos • Latest Spec: SYCL 1.2.1
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 49 *Other names and brands may be claimed as the property of others.
Use Cases: TensorFlow “TensorFlow™ is an open source software library for numerical computation using data flow graphs.” For heavy math computation Eigen is used - “a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.”
Frontend Architecture ( TensorFlow ) Math expression is represented as Tree-based DSEL a compilation time type ( Eigen's Expression Templates )
Backend Architecture SYCL back-end enables SYCL C++ CUDA OpenCL hardware
Physical Hardware WIP links: OCL Device CPU GPU Eigen ( https://goo.gl/0tRlSo ); TensorFlow ( https://goo.gl/fngQPb );
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Use Cases: SYCL Parallel STL
SYCL Parallel STL is an implementation of C++ 17 parallel algorithms from Standard Template Library
https://github.com/KhronosGroup/SyclParallelSTL
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Function Pointers
Indirection through function pointers not available on all accelerators ▪ Enabling in DPC++ for some workloads (e.g. raytracing) ▪ Basic example:
int foo(int bar) [[device_indirectly_callable]] { return ++bar; } ... myQueue.submit([&](handler& h) { … h.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 53 *Other names and brands may be claimed as the property of others. Enabling Performance
1. Explicit execution model familiar to developers
2. Ability to tune work and scratchpad sizes for architecture/application parallel_for(nd_range<2>(range<2>(6, 4), range<2>(3, 2))
3. Primitives libraries
Sub-group decomposition of work-group 4. Exposing hardware features ▪ SIMD and other intrinsics ▪ Extensions for device-specific features
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 54 *Other names and brands may be claimed as the property of others. OpenCL Model Explicit ND-Range (2D) parallel_for( NDRange( global_size, work_group_size ) )
▪ Explicit global + work-group sizes ▪ Direct performance tuning for architecture 2 myQueue.submit([&](handler & h) { stream os(1024, 80, h); range<2> global_range(6, 4); 3 range<2> local_range(3, 2); h.parallel_for
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 55 *Other names and brands may be claimed as the property of others. GEN11 Memory Model
Private Private Private Private Memory Memory Memory Memory
Work-Item Work-Item Work-Item Work-Item
Local Memory Local Memory
Workgroup Workgroup
Global/Constant Memory Compute Device
Host GEN11 slice architecture diagram
Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 56 *Other names and brands may be claimed as the property of others.