Cross-Architecture Programming for Accelerated Compute, Freedom of Choice for Hardware Introduction to heterogenous programming with Data Parallel C++

One Intel Software & Architecture group Intel Architecture, Graphics & Software March 2021

All information provided in this deck is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Agenda

What is DPC++ ? Intel Compilers “Hello World” Example Basic Concepts: buffer, accessor, queue, kernel, etc. Error Handling Device Selection Compilation and Execution Flow Demo/Lab – part I

Unified Sub-groups Demo/Lab – part II

2 What is DPC++

3 Data Parallel C++ Standards-based, Cross-architecture Language

DPC++ = ISO C++ and Khronos SYCL and community extensions

Freedom of Choice: Future-Ready Programming Model Direct Programming: ▪ Allows code reuse across hardware targets Data Parallel C++ ▪ Permits custom tuning for a specific accelerator ▪ Open, cross-industry alternative to proprietary language Community Extensions

DPC++ = ISO C++ and Khronos SYCL and community Khronos SYCL extensions ▪ Delivers C++ productivity benefits, using common, familiar C and C++ constructs ▪ Adds SYCL from the for and heterogeneous programming ISO C++

Community Project Drives Language Enhancements ▪ Provides extensions to simplify data parallel programming ▪ Continues evolution through open and cooperative development

The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs. 4 announced a DPC++ compiler that targets GPUs. Intel® oneAPI DPC++/C++ Compiler Parallel Programming Productivity & Performance

Compiler to deliver uncompromised parallel programming oneAPI DPC++/C++ Compiler and Runtime productivity and performance across CPUs and accelerators DPC++ Source Code

▪ Allows code reuse across hardware targets, while permitting custom tuning for a specific accelerator Clang/LLVM ▪ Open, cross-industry alternative to single architecture proprietary language https://github.com/intel/llvm

DPC++ Runtime https://github.com/intel/compute-runtime Builds upon Intel’s decades of experience in architecture and high-performance compilers

Code samples: CPU GPU FPGA tinyurl.com/dpcpp-tests tinyurl.com/oneapi-samples

There will still be a need to tune for each architecture. 5 DPC++ Extensions tinyurl.com/dpcpp-ext tinyurl.com/sycl2020-spec

Extension Purpose SYCL 2020 USM (Unified Shared Memory) Pointer-based programming ✓

Sub-groups Cross-lane operations ✓

Reductions Efficient parallel primitives ✓

Work-group collectives Efficient parallel primitives ✓

Pipes Spatial data flow support

Argument restrict Optimization

Optional lambda name for kernels Simplification ✓

In-order queues Simplification ✓

Class template argument Simplification ✓ deduction and simplification

7 tinyurl.com/openmp-and-dpcpp Intel® Compilers

OpenMP Included in OpenMP Intel Compiler Target Offload oneAPI Support Support Toolkit Intel® C++ Compiler, IL0 CPU Yes No HPC icc Intel® oneAPI DPC++/C++ Compiler CPU, GPU, Yes Yes Base dpcpp FPGA* Intel® oneAPI DPC++/C++ Compiler CPU Yes Yes Base icx GPU* Intel® Fortran Compiler, IL0 CPU Yes No HPC ifort Intel® Fortran Compiler Beta CPU, GPU* Yes Yes HPC ifx Cross Compiler Binary Compatible and Linkable!

tinyurl.com/oneAPI-Base-download tinyurl.com/oneAPI-HPC-download

8 Codeplay Launched Data Parallel C++ Compiler for Nvidia GPUs

▪ Developers can retarget and reuse code between NVIDIA and Intel compute accelerators from a single source base ▪ Codeplay is the first oneAPI industry contributor to Other News implement a developer tool based on oneAPI Codeplay Brings NVIDIA GPU Support to specifications Industry-Standard Math Library ▪ They leveraged the DPC++ LLVM-based open source Intel Open Sources the oneAPI Math Kernel project that Intel established Library Interface ▪ Codeplay is a key driver of the Khronos SYCL standard, upon with DPC++ is based ▪ More details in the Codeplay blog post ▪ Build DPC++ toolchain with support for NVIDIA CUDA: tinyurl.com/dpcpp-cuda-be tinyurl.com/dpcpp-cuda-be-webinar

Other names and brands may be claimed as the property of others. 9 “Hello World” Example

11 DPC++ Basics

Host (CPU) DPC++ Application Host Memory Executed on… Host code Device code submits... Device

Command

Local Memory Local CommandGroup Local Memory Local

Local Memory Local GroupCommand Local Memory Local CommandGroup Compute Unit Group (CU) Executed on… Command Private Memory CommandQueue CommandQueue Global Memory Queue

12 Anatomy of a DPC++ Application

#include using namespace sycl;

int main() { std::vector A(1024), B(1024), C(1024); // some data initialization { Host code buffer bufA {A}, bufB {B}, bufC {C}; queue q; q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); auto B = bufB.get_access(h, read_only); auto C = bufC.get_access(h, write_only); h.parallel_for(1024, [=](auto i){ Accelerator C[i] = A[i] + B[i]; }); device code }); } 13 for (int i = 0; i < 1024; i++) std::cout << "C[" << i << "] = " << C[i] << std::endl; Host code

}

13 Asynchronous Execution

Host Graph #include using namespace sycl;

Host code int main() { Graph executes execution std::vector A(1024), B(1024), C(1024); asynchronously // some data initialization { to host program buffer bufA {A}, bufB {B}, bufC {C}; queue q(gpu_selector{}); q.submit([&](handler &cgh) { auto A = bufA.get_access(cgh, access::mode::read ); auto B = bufB.get_access(cgh, access::mode::read ); A Enqueues auto C = bufC.get_access(cgh, access::mode::write); kernel to cgh.parallel_for(1024, [=](auto i){ graph, and C[i] = A[i] + B[i]; Kernel keeps }); going }); } A for (int i = 0; i < 1024; i++) std::cout << C[i] << std::endl; }

1414 Anatomy of a DPC++ Application

#include using namespace sycl;

int main() { std::vector A(1024), B(1024), C(1024); // some data initialization Application scope { buffer bufA {A}, bufB {B}, bufC {C}; queue q; q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); Command group auto B = bufB.get_access(h, read_only); auto C = bufC.get_access(h, write_only); scope h.parallel_for(1024, [=](auto i){ C[i] = A[i] + B[i]; }); Device scope }); } 15 for (int i = 0; i < 1024; i++) std::cout << "C[" << i << "] = " << C[i] << std::endl; Application scope }

15 DPC++ Basics std::vector A(1024), B(1024), C(1024); { Buffers creation via host vectors/pointers buffer bufA {A}, bufB {B}, bufC {C}; queue q; Buffers encapsulate data q.submit([&](handler &h) { in a SYCL application auto A = bufA.get_access(h, read_only); • Across both devices and auto B = bufB.get_access(h, read_only); host! auto C = bufC.get_access(h, write_only); h.parallel_for(1024, [=](auto i){ C[i] = A[i] + B[i]; }); }); } for (int i = 0; i < 1024; i++) 16

std::cout << "C[" << i << "] = " << C[i] << std::endl;

}

16 DPC++ Basics std::vector A(1024), B(1024), C(1024); { • buffer bufA {A}, bufB {B}, bufC {C}; A queue submits command groups to queue q; be executed by the q.submit([&](handler &h) { SYCL runtime auto A = bufA.get_access(h, read_only); • Queue is a auto B = bufB.get_access(h, read_only); mechanism where auto C = bufC.get_access(h, write_only); work is submitted to a h.parallel_for(1024, [=](auto i){ device. C[i] = A[i] + B[i]; }); }); } for (int i = 0; i < 1024; i++) 17

std::cout << "C[" << i << "] = " << C[i] << std::endl;

}

17 DPC++ Basics std::vector A(1024), B(1024), C(1024); { buffer bufA {A}, bufB {B}, bufC {C}; queue q; • Accessors creation q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); • Mechanism to access auto B = bufB.get_access(h, read_only); buffer data auto C = bufC.get_access(h, write_only); • Create data h.parallel_for(1024, [=](auto i){ dependencies in the C[i] = A[i] + B[i]; SYCL graph that order kernel executions }); }); } for (int i = 0; i < 1024; i++) 18

std::cout << "C[" << i << "] = " << C[i] << std::endl;

}

18 DPC++ Basics std::vector A(1024), B(1024), C(1024); { • Vector addition kernel buffer bufA {A}, bufB {B}, bufC {C}; enqueues a parallel_for queue q; task. q.submit([&](handler &h) { • Pass a function auto A = bufA.get_access(h, read_only); object/lambda to be auto B = bufB.get_access(h, read_only); executed by each work- auto C = bufC.get_access(h, write_only); item h.parallel_for(1024, [=](auto i){ C[i] = A[i] + B[i]; }); range<1>{1024} id<1> }); } 19 for (int i = 0; i < 1024; i++) std::cout << "C[" << i << "] = " << C[i] << std::endl; } $ dpcpp vector_add.cpp $ icx -fsycl -fsycl-unnamed-lambda vector_add.cpp

19 Graph of Kernel Executions

int main() { Automatic data and control auto R = range<1>{ num }; buffer A{ R }, B{ R }; dependence resolution! queue Q; A Q.submit([&](handler& h) { accessor out(A, h, write_only); B h.parallel_for(R, [=](id<1> idx) { out[idx] = idx[0]; }); }); Kernel 1 Kernel 1

Q.submit([&](handler& h) { Kernel 3 accessor out(A, h, write_only); A h.parallel_for(R, [=](id<1> idx) { Kernel 2 out[idx] = idx[0]; }); }); Kernel 2 Q.submit([&](handler& h) { accessor out(B, h, write_only); B h.parallel_for(R, [=](id<1> idx) { Kernel 3 A out[idx] = idx[0]; }); }); = data Q.submit([&](handler& h) { dependence accessor in(A, h, read_only); accessor inout(B, h); Kernel 4 h.parallel_for(R, [=](id<1> idx) { inout[idx] *= in[idx]; }); }); Kernel 4 } Program completion

2020 DPC++ Functions for Invoking Kernels

h.single_task( [=]() { // kernel function is executed EXACTLY once on a SINGLE work-item });

h.parallel_for( range<3>(1024,1024,1024), // using 3D in this example Basic data parallel [=](id<3> myID) { // kernel function is executed on an n-dimensional range (NDrange) }); h.parallel_for( nd_<3>({1024,1024,1024},{16,16,16}), // using 3D in this example Explicit ND-Range [=](nd_rangeitem<3> myID) { // kernel function is executed on an n-dimensional range (NDrange)

}); h.parallel_for_work_group( Hierarchical parallelism range<2>(1024,1024), // using 2D in this example [=](group<2> grp) { // kernel function is executed once per work-group }); grp.parallel_for_work_item( range<1>(1024), // using 1D in this example // kernel function is executed once per work-item

[=](h_item<1> myItem) {

});

21 Basic Parallel Kernels

The functionality of basic parallel kernels is exposed via range, id and item classes

• range class is used to describe the h.parallel_for(range<1>(1024), [=](id<1> idx){ iteration space of parallel execution // CODE THAT RUNS ON DEVICE • id class is used to index an individual }); instance of a kernel in a parallel execution h.parallel_for(range<1>(1024), [=](item<1> item){ • item class represents an individual auto idx = item.get_id(); instance of a kernel function, exposes additional functions to query auto R = item.get_range(); properties of the execution range // CODE THAT RUNS ON DEVICE });

2222 Synchronization

23 Synchronization

▪ Synchronization within kernel function • Barriers for synchronizing work items within a workgroup • No synchronization primitives across workgroups ▪ Synchronization between host and device • Call to wait() member function of device queue • Buffer destruction will synchronize the data with host memory • Host accessor constructor is a blocked call and returns only after all enqueued kernels operating on this buffer finishes execution • DAG construction from command group function objects enqueued into the device queue

24 2 5 Host Accessors

▪ An accessor which uses host buffer access target

▪ Created outside of command group scope

▪ The data that this gives access to will be available on the host

▪ Used to synchronize the data back to the host by constructing the host accessor objects

25 Host Accessor

int main() { constexpr int N = 100; ▪ Buffer takes ownership of the auto R = range<1>(N); data stored in vector. std::vector v(N, 10); queue q; buffer buf(v); ▪ Creating host accessor is a q.submit([&](handler& h) { accessor a(buf, h) blocking call and will only return h.parallel_for(R, [=](auto i) { after all enqueued DPC++ a[i] -= 2; }); kernels that modify the same }); buffer in any queue completes

host_accessor b(buf, read_only); execution and the data is for (int i = 0; i < N; i++) available to the host via this host std::cout << b[i] << "\n"; return 0; accessor. }

26 Error Handling

28 Error Handling DPC++ is based on C++ • Errors in C++ are handled through exceptions • SYCL uses exceptions, not return codes! Synchronous exceptions • Detected immediately • Use try...catch block Asynchronous exceptions • Detected later after an API call has returned • E.g. faults inside a command group or a kernel • Use error handler function

29 Synchronous exceptions

▪ Thrown immediately when an API call fails • failure to construct an object, e.g. can’t create buffer ▪ Normal C++ exceptions

try { device_queue.reset(new queue(device_selector)); } catch (exception const& e) { std::cout << "Caught a synchronous SYCL exception:“ << e.what(); return; }

30 Asynchronous exceptions

▪ Caused by a future failure • E.g. error occurring during execution of a kernel on a device • Host program has already moved on to new things! ▪ Programmer provides processing function, and says when to

static auto exception_handler = [](sycl::exception_list e_list){ for (std::exception_ptr const &e : e_list) { try { std::rethrow_exception(e); } catch (std::exception const &e) { #if _DEBUG std::cout << "Failure" << std::endl; #endif std::terminate(); } )}; ▪ queue::wait _and_throw(), queue::throw_asynchronous(), event::wait_and_throw()

31 Device Selection

32 Where is my “Hello World” code executed? Device Selector

Get a device (any device): queue q (); // default_selector{} queue q(cpu_selector{}); Create queue targeting a queue q(gpu_selector{}); pre-configured classes of queue q(intel::fpga_selector{}); devices: queue q(accelerator_selector{}); queue q(host_selector{}); SYCL 1.2.1 class custom_selector : public device_selector { Create queue targeting int operator()(…… // Any logic you want! specific device (custom criteria): … queue q(custom_selector{}); default_selector • DPC++ runtime scores all devices and picks one with highest compute power • Environment variable export SYCL_DEVICE_TYPE=GPU | CPU | HOST export SYCL_DEVICE_FILTER={backend:device_type:device_num} *will be available in oneAPI Update 1 33 tinyurl.com/dpcpp-env-vars for more details on env variables DPC++ Device Selection

35 Compilation and Execution Flow

36 Just-in-Time Compilation (JIT)

$ dpcpp vector_add.cpp COMPILER

▪Default: Device binary

generated in runtime Host Host object LINKER code Binary + + flexibility SPIR-V + less binary size Source code FAT binary - runtime performance SPIR-V - Runtime errors

37 Ahead-of-Time Compilation (AOT)

$ dpcpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice vector_add.cpp ▪Device binary generated COMPILER in compile-time

- flexibility Host object Host code LINKER Binary + + runtime performance Device Source + compile-time check code FAT binary + debug SPIR-V Device object code

tinyurl.com/fsycl-targets 38 Runtime Architecture

Controlled via DPC++ Device SYCL_BE env var: native ISA SPIR-V application PI_OPENCL,

DPC++ runtime library PI_LEVEL0 DPC++ SYCL API runtime PI_CUDA for open DPC++ Runtime Plugin Interface (PI) source compiler PI plugin PI/Level Zero plugin PI/OpenCL plugin Replaced with Native runtime Level Zero Runtime OpenCL Runtime SYCL_DEVICE_FILTER & driver

Device Intel GPU Intel GPU FPGA CPU

More info: tinyurl.com/dpcpp-pi, tinyurl.com/levelzero-spec 40 Demo/Lab

Data Parallel C++ Essentials

41 Practice time!

▪ Go to oneapi.teams ▪ Git clone https://oneapi.team/ashadrin/oneapi_demo.git

▪ Or try in your own environment: • source /opt/intel/oneapi/setvars.sh • dpcpp --help

43 Device information

▪ oneapi_demo/dpcpp_compiler/src/get-device-info.cpp ▪ Task1: run with openCL backend • Hint: you can ask PI interface to show a debug info using: SYCL_PI_TRACE=1 ▪ Task2: run on CPU only

44 VectorAdd Buffers

▪ oneapi_demo/dpcpp_compiler/src/vector-add-buffers.cpp $ dpcpp -O2 -g -std=c++17 -o vector-add-buffers src/vector-add-buffers.cpp $ SYCL_DEVICE_TYPE=GPU ./vector-add-buffers Running on device: Intel(R) Graphics Gen9 [0x3ea5] Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.

45 JIT \ AOT

▪ oneapi_demo/dpcpp_compiler/src/vector-add-buffers.cpp ▪ Work with –fsycl-targets to build kernels for CPU and GPU • -fsycl-targets=spir64_x86_64-unknown-unknown-sycldevice • -fsycl-targets=spir64_gen-unknown-unknown-sycldevice

More info: tinyurl.com/aot-compilation 46 Unified Shared Memory

47 Motivation

The SYCL 1.2.1 standard provides a Buffer memory abstraction • Powerful and elegantly expresses data dependences However… • Replacing all pointers and arrays with buffers in a C++ program can be a burden to programmers

USM provides a pointer-based alternative in DPC++ • Simplifies porting to an accelerator • Gives programmers the desired level of control • Complementary to buffers

4848 Developer View Of USM

▪ Developers can reference same memory object in host and device code with Unified Shared Memory

4949 Unified Shared Memory Show me the code!

#include #include using namespace sycl; using namespace sycl; int main() { int main() { std::vector A(1024), B(1024), C(1024); queue q(gpu_selector{}); int *A = malloc_shared(1024, q); // some data initialization int *B = malloc_shared(1024, q); { int *C = malloc_shared(1024, q); buffer bufA {A}, bufB {B}, bufC {C}; // some data initialization queue q(gpu_selector{}); q.submit([&](handler &cgh) { q.submit([&](handler &cgh) { cgh.parallel_for (1024,[=](auto i){ auto A = bufA.get_access(cgh, access::mode::read ); C[i] = A[i] + B[i]; }); auto B = bufB.get_access(cgh, access::mode::read ); auto C = bufC.get_access(cgh, access::mode::write); cgh.parallel_for (1024, [=](auto i){ }); C[i] = A[i] + B[i]; }); q.wait(); for (int i = 0; i < 1024; i++){ std::cout << C[i] << std::endl; }); } } free(A, q); for (int i = 0; i < 1024; i++){ free(B, q); std::cout << C[i] << std::endl; free(C, q); } } } 50 50 DPC++ Unified Shared Memory

Unified Shared Memory provides both explicit and implicit models for managing memory.

Allocation Type Description Accessible on HOST Accessible on DEVICE device Allocations in device memory (explicit) NO YES host Allocations in host memory (implicit) YES YES shared Allocations can migrate between host and device YES YES memory (implicit)

Automatic data accessibility and explicit data movement supported

5151 USM - Explicit Data Movement

queue q; int hostArray[42]; int *deviceArray = (int*) malloc_device(42 * sizeof(int), q);

for (int i = 0; i < 42; i++) hostArray[i] = 42; // copy hostArray to deviceArray q.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int)); q.wait(); q.submit([&](handler& h){ h.parallel_for(42, [=](auto ID) { deviceArray[ID]++; }); }); q.wait(); // copy deviceArray back to hostArray 52 q.memcpy(&hostArray[0], deviceArray, 42 * sizeof(int)); q.wait(); free(deviceArray, q);

52 USM - Implicit Data Movement

queue q; int *hostArray = (int*) malloc_host(42 * sizeof(int), q); int *sharedArray = (int*) malloc_shared(42 * sizeof(int), q);

for (int i = 0; i < 42; i++) hostArray[i] = 1234; q.submit([&](handler& h){ h.parallel_for(42, [=](auto ID) { // access sharedArray and hostArray on device sharedArray[ID] = hostArray[ID] + 1; }); }); q.wait(); for (int i = 0; i < 42; i++) hostArray[i] = sharedArray[i]; 53 free(sharedArray, q); free(hostArray, q);

53 USM - Data Dependency in Queues

No accessors in USM Dependences must be specified explicitly using events • queue.wait() • wait on event objects • use the depends_on method inside a command group

54 USM - Data Dependency in Queues

Explicit wait() used to queue q; ensure data dependency in int* data = malloc_shared(N, q); for(int i=0;i(range<1>(N), [=](id<1> i){ data[i] += 2; wait() will block execution }); }).wait(); on host q.submit([&] (handler &h){ h.parallel_for(range<1>(N), [=](id<1> i){ A data[i] += 3; }); }).wait(); q.submit([&] (handler &h){ h.parallel_for(range<1>(N), [=](id<1> i){ data[i] += 5; B }); }).wait(); for(int i=0;i

5555 USM - Data Dependency in Queues

queue q; Use depends_on() method int* data = malloc_shared(N, q); for(int i=0;i(range<1>(N), [=](id<1> i){ handler know that specified data[i] += 2; }); event should be complete }); auto e2 = q.submit([&] (handler &h){ before specified task can h.depends_on(e1); h.parallel_for(range<1>(N), [=](id<1> i){ execute. (SYCL2020) data[i] += 3; }); A }); // non-blocking; execution of host code is possible

q.submit([&] (handler &h){ h.depends_on(e2); B h.parallel_for (range<1>(N), [=](id<1> i){ data[i] += 5; }); }).wait(); for(int i=0;i

5656 USM - Data Dependency in Queues queue q; int* data1 = malloc_shared(N, q); Use depends_on() method int* data2 = malloc_shared(N, q); for(int i=0;i(range<1>(N), [=](id<1> i){ handler know that specified data1[i] += 2; }); events should be complete }); auto e2 = q.submit([&] (handler &h){ before specified task can h.parallel_for(range<1>(N), [=](id<1> i){ data2[i] += 3; execute }); }); q.submit([&] (handler &h){ h.depends_on({e1,e2}); h.parallel_for(range<1>(N), [=](id<1> i){ A B data1[i] += data2[i]; }); }).wait(); for(int i=0;i

5757 USM - Data Dependency in Queues

A more simplified way of queue q; int* data1 = malloc_shared(N, q); specifying dependency as int* data2 = malloc_shared(N, q); for(int i=0;i(range<1>(N), [=](id<1> i){ data1[i] += 2; }); auto e2 = q.parallel_for (range<1>(N), [=](id<1> i){ data2[i] += 3; A B }); q.parallel_for (range<1>(N), {e1, e2}, [=](id<1> i){ data1[i] += data2[i]; }).wait();

for(int i=0;i

5858 USM - Data Dependency in Queues

queue q{property::queue::in_order()}; Use in_order property for int *data = malloc_shared(N, q); for(int i=0;i(range<1>(N), [=](id<1> i){ Execution will not overlap even if the data[i] += 2; queues have no data dependency }); }); // non-blocking; execution of host code is possible q.submit([&] (handler &h){ h.parallel_for(range<1>(N), [=](id<1> i){ A data[i] += 3; }); }); // non-blocking; execution of host code is possible q.submit([&] (handler &h){ h.parallel_for(range<1>(N), [=](id<1> i){ B data[i] += 5; }); }).wait(); for(int i=0;i

5959 USM – What else you need to know?

▪ Remember about tradeoff between convenience and performance • Host? Device? Shared? ▪ Remember about context ▪ Use sycl::free instead of C++ free for USM allocations ▪ There are multiple styles of allocations: • C Style: void *malloc_device(size_t size, const device &dev, const context &ctxt); • C++ Style: template T *malloc_device (size_t Count, const device &Dev, const context &Ctxt); • C++ Allocators: template class usm_allocator

60 60 NDRange

61 DPC++ Hierarchy and Mapping

62 DPC++ Thread Hierarchy and Mapping

All work-items in a work-group are scheduled on one Compute Unit, which has its own local memory

All work-items in a sub-group are mapped to vector hardware

63 Shared Local Memory

SLM has limited size, e.g. 64KB SLM for each work-group

std::cout << "Local Memory Size: " << q.get_device().get_info() << std::endl;

Use local_accessor class or target::local accessor specialization

q.submit([&](auto &h) { sycl::accessor slm(sycl::range(32 * 64), h); h.parallel_for(sycl::nd_range(sycl::range{N}, sycl::range{32}), [=](sycl::nd_item<1> it) { //...

tinyurl.com/SLM-in-SYCL 64 Logical Memory Hierarchy

Device

Work-Group Work-Group

Private Private Private Private Private Private Memory Memory Memory Memory Memory Memory

… Work-Item Work-Item … Work-Item Work-Item Work-Item … Work-Item

Local Memory Local Memory

Global Memory Constant Memory

6565 ND-range Kernels

▪ Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. ▪ ND-range kernel is another way to express parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware.

• The entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware. • The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution.

6666 ND-range Kernels The functionality of nd_range kernels is exposed via nd_range and nd_item classes

h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){ auto idx = item.get_global_id(); auto local_id = item.get_local_id(); // CODE THAT RUNS ON DEVICE }); global size work-group size

nd_range class represents a grouped execution range using global execution range and the local execution range of each work-group. nd_item class represents an individual instance of a kernel function and allows to query for work-group range and index.

6767 Sub-groups

68 Sub-groups

▪ A subset of work-items within a work-group that may map to vector hardware. ▪ Why use Sub-groups? • Work-items in a sub-group can communicate directly using shuffle operations, without explicit memory operations. • Work-items in a sub-group can synchronize using sub-group barriers and guarantee memory consistency using sub-group memory fences. • Work-items in a sub-group have access to sub-group collectives, providing fast implementations of common parallel patterns.

6969 Sub-groups

sub_group class h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item) {

▪ The sub-group handle can ONEAPI::sub_group sg = item.get_sub_group(); be obtained from the nd_item using the // KERNEL CODE get_sub_group() });

▪ Once you have the sub-group handle, you can query for more information about the sub- group, do shuffle operations or use collective functions.

▪ Explicit kernel attribute [[intel::reqd_sub_group_size(N)]] to control the sub-group size

7070 Sub-groups

The sub-group handle can be quired to get other information: h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){ ONEAPI::sub_group sg = item.get_sub_group(); • get_local_id() returns the index of the work-item within its sub-group if(sg.get_local_id() == 0){ • get_local_range() returns the size of sub_group out << "sub_group id: " << sg.get_group_id()[0] • get_group_id() returns the index of << " of " << sg.get_group_range() the sub-group << ", size=" << sg.get_local_range()[0] • get_group_range() returns the << endl; number of sub-groups within the } parent work-group });

7171 nd_range & nd_item

▪Example: Process every pixel in a 1920x1080 image • Each pixel needs processing, kernel is executed on each pixel (work-item) • 1920 x 1080 = 2M pixels = global size • Not all 2M can run in parallel on device, there is hardware resource limits. • We have to split into smaller groups of pixel blocks = local size (work-group)

73 nd_range & nd_item ▪Example: Process every pixel in a 1920x1080 image • Let compiler determine work-group size

h.parallel_for(range<2>(1920,1080), [=](id<2> item){ ▪ // CODE THAT RUNS ON DEVICE }); global local size • Programmer specifies work-group size size (work-group size)

h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)), [=](id<2> item){ // CODE THAT RUNS ON DEVICE })

74 nd_range & nd_item ▪Example: Process every pixel in a 1920x1080 image • How do we choose work-group size? GOOD • Work-group size of 8x8 divides equally for 1920x1080 • Work-group size of 9x9 does not divide equally for 1920x1080 • Compiler will throw error (invalid work group size error) • Work-group size of 10x10 divides equally for 1920x1080 • Works, but always better to use multiple of 8 for better resource utilization • Work-group size of 24x24 divides equally for 1920x1080 • 24x24=576, will fail compile assuming GPU max work-group size is 256

75 ▪ Demo/Lab

DPC++ Unified Shared Memory DPC++ Subgroups

77 VectorAdd USM

▪ oneapi_demo/dpcpp_compiler/src/vector-add-usm.cpp $ dpcpp -O2 -g -std=c++17 -o vector-add-usm src/vector-add-usm.cpp $ SYCL_DEVICE_TYPE=GPU ./vector-add-usm Running on device: Intel(R) Graphics Gen9 [0x3ea5] Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.

78 Lab USM

▪ oneapi_demo/dpcpp_compiler/src/usm-lab.cpp ▪ Task: The code uses USM and has three kernels that are submitted to device. The first two kernels modify two different memory objects and the third one has a dependency on the first two. There is no data dependency between the three queue submissions, so the code can be fixed to get desired output of 25.

79 Subgroups

▪ oneapi_demo/dpcpp_compiler/src/sugroup.cpp

80 Useful Links Open source projects oneAPI Data Parallel C++ compiler: github.com/intel/llvm Graphics Compute Runtime: github.com/intel/compute-runtime Graphics Compiler: github.com/intel/intel-graphics-compiler

SYCL 2020: tinyurl.com/sycl2020-spec DPC++ Extensions: tinyurl.com/dpcpp-ext Environment Variables: tinyurl.com/dpcpp-env-vars DPC++ book: tinyurl.com/dpcpp-book oneAPI training: colfax-intl.com/training/intel-oneapi-training oneAPI Base Training Modules: devcloud.intel.com/oneapi/get_started/baseTrainingModules/ Code samples: tinyurl.com/dpcpp-tests tinyurl.com/oneapi-samples

81 QUESTIONS?

82 Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, Xeon, Core, VTune, OpenVINO, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

83 84