Agenda

• What is SYCL and why Intel needs it? • SYCL language/API • SYCL programming model overview • Kernel execution model • “Hello world” example • Language evolution • DPC++/SYCL Compilers • App compilation and execution flow • SYCL implementations • References

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 2 *Other names and brands may be claimed as the property of others. Heterogeneous Compute Platforms

- A modern platform includes: - One or more CPUs - One or more GPUs - DSP processors - FPGA - Individual processors have many (possibly heterogeneous) cores Programmers want to write a single portable program that uses ALL resources in the heterogeneousIntel platform IceLake architecture Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 3 *Other names and brands may be claimed as the property of others. What is SYCL?

Single-source heterogeneous programming using STANDARD C++ ▪ Use C++ templates and lambda functions for host & device code Aligns the of OpenCL with direction of the C++ standard

Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches

C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 4 *Other names and brands may be claimed as the property of others. Why SYCL? Reactive and Proactive Motivation:

Reactive to OpenCL Pros and Cons: Proactive about Future C++: • OpenCL has a well-defined, portable • SYCL is based on purely modern C++ execution model. and should feel familiar to C++11 users. • OpenCL is prohibitively verbose for • SYCL expected to run ahead of C++Next many application developers. regarding heterogeneity and parallelism. ISO C++ of tomorrow may • OpenCL remains a C API and only look a lot like SYCL. recently supported C++ kernels. • Not held back by C99 or C++03 • Just-in-time compilation model and compatibility goals. disjoint source code is awkward and contrary to HPC usage models.

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 5 *Other names and brands may be claimed as the property of others.

Terminology: SYCL platform

A SYCL implementation consists of: SYCL Platform ▪ A Host: Makes SYCL API calls

▪ SYCL Devices: Runs SYCL Kernels Host ▪ The Host and SYCL Devices make up an SYCL Platform ▪ One system may have multiple installed SYCL Platforms Example: Intel Platform (CPU) + Platform (GPU) CPU SYCL Device GPU SYCL Device ▪ SYCL Host Device is native C++ implementation of SYCL API on the host Some Other FPGA SYCL Device SYCL Device

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 7 *Other names and brands may be claimed as the property of others. SYCL Software Architecture

Applications create SYCL Command Queues per SYCL Platform Device ▪ Can create multiple Command Queues for a Host Device, e.g. one per ▪ No cross-device Command Queues, e.g. for GPU Queue automatic load balancing CPUCPU Queue Queue

Applications execute SYCL Command Groups via CPU SYCL Device GPU SYCL Device Command Queues ▪ Example Commands: Copy Memory, Fill Buffers, Launch Kernels, etc. Some Other FPGA SYCL Device SYCL Device

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 8 *Other names and brands may be claimed as the property of others. Kernel Execution Model

• Explicit ND-range for control • 1, 2 or 3 dimensional range • Kernel code is SPMD/SIMT ND-range

Work-group

Work-item Global work space

GEN11 slice architecture diagram

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 9 *Other names and brands may be claimed as the property of others. Subgroups

• Additional level of execution model hierarchy - access SIMD unit • GEN EU can execute 16 FP32 floating point operations per clock • [2 ALUs x SIMD-4 x 2 Ops (Add + Mul)]

Sub-group decomposition of work-group

Work-group

Sub-group

Work-item Global work space

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 10 *Other names and brands may be claimed as the property of others. SYCL Constructs to Describe Parallelism

1. Basic data parallel parallel_for( num_work_items )

2. Explicit ND-Range parallel_for( NDRange( global_size, work_group_size ) )

▪ Enable SPMD/SIMT coding for OpenCL and CUDA experts

parallel_for_work_group( num_work_groups ) { 3. Hierarchical parallelism parallel_for_work_item( items_per_work_group ) } ▪ Exploit scope to control execution granularity

▪ Similar to Kokkos hierarchical parallelism 4. Single task single_task() {}

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 11 *Other names and brands may be claimed as the property of others. Hierarchical parallelism (logical view)

parallel_for_work_group (…) { … parallel_for_work_item (…) { … } }

▪ Fundamentally top down expression of parallelism ▪ Many embedded features and details, not covered here

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 12 *Other names and brands may be claimed as the property of others. Memory Model

• Private Memory Private Private Private Private Memory Memory Memory Memory –Per Work Item (kernel invocation) Work-Item Work-Item Work-Item Work-Item • Local Memory Local Memory Local Memory

–Shared within a Work Group Workgroup Workgroup

• Global/Constant Memory Global/Constant Memory –Visible to all Work Groups Compute Device • Host Memory Host Memory –On the Host Host Memory management is explicit. We must explicitly move data from host -> global -> local and back.

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 13 *Other names and brands may be claimed as the property of others. SYCL Device/Kernel Features

Supported features Unsupported features ₊ templates ₋ dynamic memory allocation ₊ classes ₋ dynamic polymorphism ₊ operator overloading ₋ runtime type information ₊ static polymorphism ₋ exception handling ₊ lambdas ₋ function pointers* ₊ short vector types (2/3/4/8/16-wide) ₊ reach library of built-in functions ₋ pointer structure members* ₋ static variables*

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 14 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { ... } Let’s compute C = A + B on GPU!

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 15 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { Step 1: create buffers sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; (represent both host and sycl::buffer d_C{ C.data(), C.size() }; device memory)

... }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 16 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 17 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() }; Step 2: create a device queue sycl::gpu_selector device_selector; (developer can specify a device type via sycl::queue q(device_selector); device selector or use default selector) ... } Multiple queues can be created to target multiple devices from the same (or multiple) thread

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 18 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { Step 3: submit a command for ... (asynchronous) execution }); }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 19 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A = d_A.get_access(h); accessors to access auto acc_B = d_B.get_access(h); buffer data on the device auto acc_C = d_C.get_access(h);

... }); }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 20 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { Step 4: create buffer auto acc_A = d_A.get_access(h); accessors to access auto acc_B = d_B.get_access(h); buffer data on the device auto acc_C = d_C.get_access(h);

... }); Accessors create DAG to trigger data movement and } represent execution dependencies

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 21 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); Step 5: send a kernel auto acc_C = d_C.get_access(h); (lambda) for execution h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { ... }); }); }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 22 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { ... }); }); Kernel invocations are } executed in parallel Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 23 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { ... }); }); Template parameter } specifies kernel name Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 24 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { ... }); }); Kernel is invoked for each } element of the range Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 25 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { ... }); }); Kernel invocation has access } to the invocation id Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 26 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { acc_C[I] = acc_A[I] + acc_B[I]; }); Step 6: write a kernel }); }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 27 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { acc_C[I] = acc_A[I] + acc_B[I]; }); }); Kernel code (executed on accelerator) lives in the same source } with the rest of the program code executed on CPU

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 28 *Other names and brands may be claimed as the property of others. SYCL Vector Add Example

void vector_add(const std::vector &A, const std::vector &B, const std::vector &C) { sycl::buffer d_A{ A.data(), A.size() }; sycl::buffer d_B{ B.data(), B.size() }; sycl::buffer d_C{ C.data(), C.size() };

sycl::gpu_selector device_selector; sycl::queue q(device_selector);

q.submit([&](sycl::handler &h) { auto acc_A = d_A.get_access(h); auto acc_B = d_B.get_access(h); auto acc_C = d_C.get_access(h);

h.parallel_for(d_A.get_range(), [=](sycl::id<1> I) { acc_C[I] = acc_A[I] + acc_B[I]; }); Done! }); } The results are copied to vector `C` at`d_C` buffer destruction

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 29 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access(h); auto B = b.get_access(h); add1 auto C = c.get_access(h); h.parallel_for( range<2>{N, M}, [=](id<2> index) { C[index] = A[index] + B[index]; }); }); C

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 30 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access(h); auto B = b.get_access(h); add1 auto C = c.get_access(h); h.parallel_for( range<2>{N, M}, D [=](id<2> index) { C[index] = A[index] + B[index]; }); }); C

Queue.submit([&](handler& h) { auto C = c.get_access(h); add2 auto D = d.get_access(h); auto E = e.get_access(h); h.parallel_for( range<2>{P, Q}, [=](id<2> index) { E[index] = C[index] + D[index]; }); E });

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 31 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions A B Queue.submit([&](handler& h) { auto A = a.get_access(h); auto B = b.get_access(h); add1 auto C = c.get_access(h); h.parallel_for( range<2>{N, M}, D [=](id<2> index) { C[index] = A[index] + B[index]; }); C }); No explicit “wait” operation! Queue.submit([&](handler& h) { SYCL runtime is responsible auto C = c.get_access(h); add2 auto D = d.get_access(h); for synchronization. auto E = e.get_access(h); h.parallel_for( range<2>{P, Q}, [=](id<2> index) { E[index] = C[index] + D[index]; }); E });

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 32 *Other names and brands may be claimed as the property of others. Graph of Asynchronous Executions Queue.submit([&](handler& h) { auto A = a.get_access(h); auto B = b.get_access(h); A B A D auto C = c.get_access(h); h.parallel_for( range<2>{N, M}, [=](id<2> index) { C[index] = A[index] + B[index]; }); });

Queue.submit([&](handler& h) { add1 add2 auto A = a.get_access(h); auto D = d.get_access(h); auto E = e.get_access(h); C E h.parallel_for( range<2>{P, Q}, [=](id<2> index) { E[index] = A[index] + D[index]; }); });

Queue.submit([&](handler& h) { add3 auto C = c.get_access(h); auto E = e.get_access(h); auto F = f.get_access(h); F h.parallel_for( range<2>{S, T}, [=](id<2> index) { F[index] = C[index] + E[index]; }); }); • SYCL queues are out-of-order by default – data dependencies order kernel executions • Will also be able to use in-order queue policies to simplify porting

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 33 *Other names and brands may be claimed as the property of others.

Framing: DPC++ and SYCL

Intel has pushed many improvements / fixes into Intel Product this spec (now rev5) DPC++

2015 2016 2017 2018

SYCL Specs 1.2 2.2 1.2.1 2020 2022? … Provisional

Generalized Aligned with OpenCL 1.2 2.2 1.2 Backend Interoperability SYCL 1.2.1 = 2 years of evolution over 2.2 provisional

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 35 *Other names and brands may be claimed as the property of others. Key Intel DPC++ extensions on top of SYCL

Extensions Implementation status Unified (USM) PoC complete ND-range subgroups Done on GPU. Q3’19 on CPU Ordered queue Prototyping Function pointers Prototyping Data flow pipes (spatial) ETA Q3’19 ND-range reductions TBD Lambda naming Prototyping

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 36 *Other names and brands may be claimed as the property of others. Unified Shared Memory (USM)

Pointer-based alternative to Buffer model ▪ Defining several capability levels – Higher levels require more advanced hardware ▪ Lets the programmer have the level of desired control – Explicit Data movement – Implicit Data movement ▪ Simplifies porting to the device – Port, Profile, Tune cycle to get best performance ▪ Aligns with CUDA Unified Memory

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 37 *Other names and brands may be claimed as the property of others. Simple USM Example

… float* a = (float*) sycl_malloc_shared(100*sizeof(float)); float* b = (float*) sycl_malloc_shared(100*sizeof(float)); for (int i = 0; i < 100; i++) { a[i] = func(); } q.parallel_for([=](id<1> i) { b[i] = 3.14 * a[i]; }); q.wait(); for (int i = 0; i < 100; i++) { … = b[i]; } …

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 38 *Other names and brands may be claimed as the property of others. Ordered Queues

SYCL first solved the hard problem (offload compute graph with dependencies) ▪ This extension makes the simple cases trivial to code

Enable in-order work queues in DPC++ ▪ Simplifies common patterns ▪ Reduces boilerplate ▪ Reduces effort to move from CUDA to DPC++

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 39 *Other names and brands may be claimed as the property of others. Lambda naming

Separate compilation (different host and device compilers) first class in SYCL ▪ Powerful but adds language requirements that are: 1. Annoying for users doing simple things 2. Problem for libs (e.g. PSTL) 3. Problem for frameworks (e.g. Kokkos) ▪ Fixing for typical use cases (in Intel compilers) until ISO C++ solution is invented

class name; int main() { int main() { queue().submit([&](handler& h){ queue().submit([&](handler& h){ h.single_task([]{…}); h.single_task([]{…}); }); }); } }

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 40 *Other names and brands may be claimed as the property of others. Build Time (developer workstation), Execution Time (end-user system)

41 Regular C++ Compilation Flow main.cpp:

#include clang++ main.cpp –o main.exe #include using namespace cl::sycl; class Hi; int main() { const size_t array_size = 16; int data[array_size]; { buffer resultBuf{ data, range<1>{array_size} }; queue q; q.submit([&](handler& h) { auto resultAcc = resultBuf.get_access(h); C++ Standard h.parallel_for(range<1>{array_size}, [=](id<1> i) { resultAcc[i] = static_cast(i.get(0)); }); Compiler }); Object File } for( int i = 0; i < array_size; i++ ) { (MSVC, gcc, CLANG) std::cout << "data[" << i << "] = " << data[i] << std::endl; (main.o) } return 0; }

Linker Executable!

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 42 *Other names and brands may be claimed as the property of others. SYCL Mode Compilation Flow main.cpp: clang++ -fsycl main.cpp –o main.exe –lOpenCL -lsycl #include #include using namespace cl::sycl; class Hi; int main() { const size_t array_size = 16; int data[array_size]; { buffer resultBuf{ data, range<1>{array_size} }; queue q; q.submit([&](handler& h) { auto resultAcc = resultBuf.get_access(h); Host Standard h.parallel_for(range<1>{array_size}, [=](id<1> i) { resultAcc[i] = static_cast(i.get(0)); }); }); Object File } Compiler for( int i = 0; i < array_size; i++ ) { std::cout << "data[" << i << "] = " << data[i] << std::endl; (main.o) } return 0; }

Bundler/ Device KernelKernelKernel IR/ISA IR/ISAIR/ISA Wrapper/ Executable! Compiler (SPIR-V, Gen ISA) (SPIR(SPIR--V?V? ISA?) ISA?) Linker

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 43 *Other names and brands may be claimed as the property of others. SYCL Execution Flow

Executable (ISV)

SYCL Runtime

Host device is partially supported by Intel Intel CPU Intel GPU

DPC++ OpenCL Runtime OpenCL Runtime

CPU Device CPU

Host Device Host GPU Device GPU

Intel CPU Intel GPU FPGA, …

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 44 *Other names and brands may be claimed as the property of others. SYCL Implementations (5 are known today)

Intel Open Source Project

Only conformant SYCL implementation today () * Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 45 *Other names and brands may be claimed as the property of others. SYCL Implementations (2)

Codeplay hired the developer

* Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 46 *Other names and brands may be claimed as the property of others. SYCL Implementations (3)

Intel Open Source Project

Mostly

* Source: Aksel Alpay: https://twitter.com/illuhad/status/1083863225479892993 Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 47 *Other names and brands may be claimed as the property of others. Intel DPC++ toolchain

DPC++ compiler passes 88% of Khronos SYCL certification tests • Missing features: image class, hierarchical parallelism DPC++ toolchain includes • Debugger: GDB support • Profiler: Intel VTune • Compatibility tool to simplify migration of legacy CUDA codebase to SYCL

Please, try it out and give us your feedback!

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 48 *Other names and brands may be claimed as the property of others. Resources

Intel internal • https://soco.intel.com/groups/dpc-incremental-releases-for-internal/ - Intel oneAPI internal forum (includes releases, useful documents, etc.) Public • Intel SYCL compiler: https://github.com/intel/llvm/tree/sycl • http://sycl.tech/ - digest of SYCL projects, articles, presentations, news, etc. • https://www.khronos.org/sycl/ - SYCL page at Khronos • Latest Spec: SYCL 1.2.1

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 49 *Other names and brands may be claimed as the property of others.

Use Cases: TensorFlow “TensorFlow™ is an open source software library for numerical computation using data flow graphs.” For heavy math computation Eigen is used - “a C++ template library for linear algebra: matrices, vectors, numerical solvers, and related algorithms.”

Frontend Architecture ( TensorFlow ) Math expression is represented as Tree-based DSEL a compilation time type ( Eigen's Expression Templates )

Backend Architecture SYCL back-end enables SYCL C++ CUDA OpenCL hardware

Physical Hardware WIP links: OCL Device CPU GPU Eigen ( https://goo.gl/0tRlSo ); TensorFlow ( https://goo.gl/fngQPb );

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Use Cases: SYCL Parallel STL

SYCL Parallel STL is an implementation of C++ 17 parallel algorithms from Standard Template Library

https://github.com/KhronosGroup/SyclParallelSTL

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Function Pointers

Indirection through function pointers not available on all accelerators ▪ Enabling in DPC++ for some workloads (e.g. raytracing) ▪ Basic example:

int foo(int bar) [[device_indirectly_callable]] { return ++bar; } ... myQueue.submit([&](handler& h) { … h.parallel_for(range<1> { 1024 }, [=](id<1> idx) { int (*fptr)(int) = foo; int result = fptr(1); …

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 53 *Other names and brands may be claimed as the property of others. Enabling Performance

1. Explicit execution model familiar to developers

2. Ability to tune work and scratchpad sizes for architecture/application parallel_for(nd_range<2>(range<2>(6, 4), range<2>(3, 2))

3. Primitives libraries

Sub-group decomposition of work-group 4. Exposing hardware features ▪ SIMD and other intrinsics ▪ Extensions for device-specific features

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 54 *Other names and brands may be claimed as the property of others. OpenCL Model Explicit ND-Range (2D) parallel_for( NDRange( global_size, work_group_size ) )

▪ Explicit global + work-group sizes ▪ Direct performance tuning for architecture 2 myQueue.submit([&](handler & h) { stream os(1024, 80, h); range<2> global_range(6, 4); 3 range<2> local_range(3, 2); h.parallel_for(nd_range<2>(global_range, local_range), 6 [=] (nd_item<2> index) { os << index << "\n"; }); }); 4

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 55 *Other names and brands may be claimed as the property of others. GEN11 Memory Model

Private Private Private Private Memory Memory Memory Memory

Work-Item Work-Item Work-Item Work-Item

Local Memory Local Memory

Workgroup Workgroup

Global/Constant Memory Compute Device

Host GEN11 slice architecture diagram

Optimization Notice Copyright © 2019, Intel Corporation. All rights reserved. 56 *Other names and brands may be claimed as the property of others.