Cross-Architecture Programming for Accelerated Compute, Freedom of Choice for Hardware Introduction to heterogenous programming with Data Parallel C++
One Intel Software & Architecture group Intel Architecture, Graphics & Software March 2021
All information provided in this deck is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Agenda
What is DPC++ ? Intel Compilers “Hello World” Example Basic Concepts: buffer, accessor, queue, kernel, etc. Error Handling Device Selection Compilation and Execution Flow Demo/Lab – part I
Unified Shared Memory Sub-groups Demo/Lab – part II
2 What is DPC++
3 Data Parallel C++ Standards-based, Cross-architecture Language
DPC++ = ISO C++ and Khronos SYCL and community extensions
Freedom of Choice: Future-Ready Programming Model Direct Programming: ▪ Allows code reuse across hardware targets Data Parallel C++ ▪ Permits custom tuning for a specific accelerator ▪ Open, cross-industry alternative to proprietary language Community Extensions
DPC++ = ISO C++ and Khronos SYCL and community Khronos SYCL extensions ▪ Delivers C++ productivity benefits, using common, familiar C and C++ constructs ▪ Adds SYCL from the Khronos Group for data parallelism and heterogeneous programming ISO C++
Community Project Drives Language Enhancements ▪ Provides extensions to simplify data parallel programming ▪ Continues evolution through open and cooperative development
The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs. 4 Codeplay announced a DPC++ compiler that targets Nvidia GPUs. Intel® oneAPI DPC++/C++ Compiler Parallel Programming Productivity & Performance
Compiler to deliver uncompromised parallel programming oneAPI DPC++/C++ Compiler and Runtime productivity and performance across CPUs and accelerators DPC++ Source Code
▪ Allows code reuse across hardware targets, while permitting custom tuning for a specific accelerator Clang/LLVM ▪ Open, cross-industry alternative to single architecture proprietary language https://github.com/intel/llvm
DPC++ Runtime https://github.com/intel/compute-runtime Builds upon Intel’s decades of experience in architecture and high-performance compilers
Code samples: CPU GPU FPGA tinyurl.com/dpcpp-tests tinyurl.com/oneapi-samples
There will still be a need to tune for each architecture. 5 DPC++ Extensions tinyurl.com/dpcpp-ext tinyurl.com/sycl2020-spec
Extension Purpose SYCL 2020 USM (Unified Shared Memory) Pointer-based programming ✓
Sub-groups Cross-lane operations ✓
Reductions Efficient parallel primitives ✓
Work-group collectives Efficient parallel primitives ✓
Pipes Spatial data flow support
Argument restrict Optimization
Optional lambda name for kernels Simplification ✓
In-order queues Simplification ✓
Class template argument Simplification ✓ deduction and simplification
7 tinyurl.com/openmp-and-dpcpp Intel® Compilers
OpenMP Included in OpenMP Intel Compiler Target Offload oneAPI Support Support Toolkit Intel® C++ Compiler, IL0 CPU Yes No HPC icc Intel® oneAPI DPC++/C++ Compiler CPU, GPU, Yes Yes Base dpcpp FPGA* Intel® oneAPI DPC++/C++ Compiler CPU Yes Yes Base icx GPU* Intel® Fortran Compiler, IL0 CPU Yes No HPC ifort Intel® Fortran Compiler Beta CPU, GPU* Yes Yes HPC ifx Cross Compiler Binary Compatible and Linkable!
tinyurl.com/oneAPI-Base-download tinyurl.com/oneAPI-HPC-download
8 Codeplay Launched Data Parallel C++ Compiler for Nvidia GPUs
▪ Developers can retarget and reuse code between NVIDIA and Intel compute accelerators from a single source base ▪ Codeplay is the first oneAPI industry contributor to Other News implement a developer tool based on oneAPI Codeplay Brings NVIDIA GPU Support to specifications Industry-Standard Math Library ▪ They leveraged the DPC++ LLVM-based open source Intel Open Sources the oneAPI Math Kernel project that Intel established Library Interface ▪ Codeplay is a key driver of the Khronos SYCL standard, upon with DPC++ is based ▪ More details in the Codeplay blog post ▪ Build DPC++ toolchain with support for NVIDIA CUDA: tinyurl.com/dpcpp-cuda-be tinyurl.com/dpcpp-cuda-be-webinar
Other names and brands may be claimed as the property of others. 9 “Hello World” Example
11 DPC++ Basics
Host (CPU) DPC++ Application Host Memory Executed on… Host code Device code submits... Device
Command
Local Memory Local CommandGroup Local Memory Local
Local Memory Local GroupCommand Local Memory Local CommandGroup Compute Unit Group (CU) Executed on… Command Private Memory CommandQueue CommandQueue Global Memory Queue
12 Anatomy of a DPC++ Application
#include
int main() { std::vector
}
13 Asynchronous Execution
Host Graph #include
Host code int main() { Graph executes execution std::vector
1414 Anatomy of a DPC++ Application
#include
int main() { std::vector
15 DPC++ Basics std::vector
std::cout << "C[" << i << "] = " << C[i] << std::endl;
}
16 DPC++ Basics std::vector
std::cout << "C[" << i << "] = " << C[i] << std::endl;
}
17 DPC++ Basics std::vector
std::cout << "C[" << i << "] = " << C[i] << std::endl;
}
18 DPC++ Basics std::vector
19 Graph of Kernel Executions
int main() { Automatic data and control auto R = range<1>{ num }; buffer
Q.submit([&](handler& h) { Kernel 3 accessor out(A, h, write_only); A h.parallel_for(R, [=](id<1> idx) { Kernel 2 out[idx] = idx[0]; }); }); Kernel 2 Q.submit([&](handler& h) { accessor out(B, h, write_only); B h.parallel_for(R, [=](id<1> idx) { Kernel 3 A out[idx] = idx[0]; }); }); = data Q.submit([&](handler& h) { dependence accessor in(A, h, read_only); accessor inout(B, h); Kernel 4 h.parallel_for(R, [=](id<1> idx) { inout[idx] *= in[idx]; }); }); Kernel 4 } Program completion
2020 DPC++ Functions for Invoking Kernels
h.single_task( [=]() { // kernel function is executed EXACTLY once on a SINGLE work-item });
h.parallel_for( range<3>(1024,1024,1024), // using 3D in this example Basic data parallel [=](id<3> myID) { // kernel function is executed on an n-dimensional range (NDrange) }); h.parallel_for( nd_<3>({1024,1024,1024},{16,16,16}), // using 3D in this example Explicit ND-Range [=](nd_rangeitem<3> myID) { // kernel function is executed on an n-dimensional range (NDrange)
}); h.parallel_for_work_group( Hierarchical parallelism range<2>(1024,1024), // using 2D in this example [=](group<2> grp) { // kernel function is executed once per work-group }); grp.parallel_for_work_item( range<1>(1024), // using 1D in this example // kernel function is executed once per work-item
[=](h_item<1> myItem) {
});
21 Basic Parallel Kernels
The functionality of basic parallel kernels is exposed via range, id and item classes
• range class is used to describe the h.parallel_for(range<1>(1024), [=](id<1> idx){ iteration space of parallel execution // CODE THAT RUNS ON DEVICE • id class is used to index an individual }); instance of a kernel in a parallel execution h.parallel_for(range<1>(1024), [=](item<1> item){ • item class represents an individual auto idx = item.get_id(); instance of a kernel function, exposes additional functions to query auto R = item.get_range(); properties of the execution range // CODE THAT RUNS ON DEVICE });
2222 Synchronization
23 Synchronization
▪ Synchronization within kernel function • Barriers for synchronizing work items within a workgroup • No synchronization primitives across workgroups ▪ Synchronization between host and device • Call to wait() member function of device queue • Buffer destruction will synchronize the data with host memory • Host accessor constructor is a blocked call and returns only after all enqueued kernels operating on this buffer finishes execution • DAG construction from command group function objects enqueued into the device queue
24 2 5 Host Accessors
▪ An accessor which uses host buffer access target
▪ Created outside of command group scope
▪ The data that this gives access to will be available on the host
▪ Used to synchronize the data back to the host by constructing the host accessor objects
25 Host Accessor
int main() { constexpr int N = 100; ▪ Buffer takes ownership of the auto R = range<1>(N); data stored in vector. std::vector
host_accessor b(buf, read_only); execution and the data is for (int i = 0; i < N; i++) available to the host via this host std::cout << b[i] << "\n"; return 0; accessor. }
26 Error Handling
28 Error Handling DPC++ is based on C++ • Errors in C++ are handled through exceptions • SYCL uses exceptions, not return codes! Synchronous exceptions • Detected immediately • Use try...catch block Asynchronous exceptions • Detected later after an API call has returned • E.g. faults inside a command group or a kernel • Use error handler function
29 Synchronous exceptions
▪ Thrown immediately when an API call fails • failure to construct an object, e.g. can’t create buffer ▪ Normal C++ exceptions
try { device_queue.reset(new queue(device_selector)); } catch (exception const& e) { std::cout << "Caught a synchronous SYCL exception:“ << e.what(); return; }
30 Asynchronous exceptions
▪ Caused by a future failure • E.g. error occurring during execution of a kernel on a device • Host program has already moved on to new things! ▪ Programmer provides processing function, and says when to process
static auto exception_handler = [](sycl::exception_list e_list){ for (std::exception_ptr const &e : e_list) { try { std::rethrow_exception(e); } catch (std::exception const &e) { #if _DEBUG std::cout << "Failure" << std::endl; #endif std::terminate(); } )}; ▪ queue::wait _and_throw(), queue::throw_asynchronous(), event::wait_and_throw()
31 Device Selection
32 Where is my “Hello World” code executed? Device Selector
Get a device (any device): queue q (); // default_selector{} queue q(cpu_selector{}); Create queue targeting a queue q(gpu_selector{}); pre-configured classes of queue q(intel::fpga_selector{}); devices: queue q(accelerator_selector{}); queue q(host_selector{}); SYCL 1.2.1 class custom_selector : public device_selector { Create queue targeting int operator()(…… // Any logic you want! specific device (custom criteria): … queue q(custom_selector{}); default_selector • DPC++ runtime scores all devices and picks one with highest compute power • Environment variable export SYCL_DEVICE_TYPE=GPU | CPU | HOST export SYCL_DEVICE_FILTER={backend:device_type:device_num} *will be available in oneAPI Update 1 33 tinyurl.com/dpcpp-env-vars for more details on env variables DPC++ Device Selection
35 Compilation and Execution Flow
36 Just-in-Time Compilation (JIT)
$ dpcpp vector_add.cpp COMPILER
▪Default: Device binary
generated in runtime Host Host object LINKER code Binary + + flexibility SPIR-V + less binary size Source code FAT binary - runtime performance SPIR-V - Runtime errors
37 Ahead-of-Time Compilation (AOT)
$ dpcpp -fsycl-targets=spir64_gen-unknown-unknown-sycldevice vector_add.cpp ▪Device binary generated COMPILER in compile-time
- flexibility Host object Host code LINKER Binary + + runtime performance Device Source + compile-time check code FAT binary + debug SPIR-V Device object code
tinyurl.com/fsycl-targets 38 Runtime Architecture
Controlled via DPC++ Device SYCL_BE env var: native ISA SPIR-V application PI_OPENCL,
DPC++ runtime library PI_LEVEL0 DPC++ SYCL API runtime PI_CUDA for open DPC++ Runtime Plugin Interface (PI) source compiler PI plugin PI/Level Zero plugin PI/OpenCL plugin Replaced with Native runtime Level Zero Runtime OpenCL Runtime SYCL_DEVICE_FILTER & driver
Device Intel GPU Intel GPU FPGA CPU
More info: tinyurl.com/dpcpp-pi, tinyurl.com/levelzero-spec 40 Demo/Lab
Data Parallel C++ Essentials
41 Practice time!
▪ Go to oneapi.teams ▪ Git clone https://oneapi.team/ashadrin/oneapi_demo.git
▪ Or try in your own environment: • source /opt/intel/oneapi/setvars.sh • dpcpp --help
43 Device information
▪ oneapi_demo/dpcpp_compiler/src/get-device-info.cpp ▪ Task1: run with openCL backend • Hint: you can ask PI interface to show a debug info using: SYCL_PI_TRACE=1 ▪ Task2: run on CPU only
44 VectorAdd Buffers
▪ oneapi_demo/dpcpp_compiler/src/vector-add-buffers.cpp $ dpcpp -O2 -g -std=c++17 -o vector-add-buffers src/vector-add-buffers.cpp $ SYCL_DEVICE_TYPE=GPU ./vector-add-buffers Running on device: Intel(R) Graphics Gen9 [0x3ea5] Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device.
45 JIT \ AOT
▪ oneapi_demo/dpcpp_compiler/src/vector-add-buffers.cpp ▪ Work with –fsycl-targets to build kernels for CPU and GPU • -fsycl-targets=spir64_x86_64-unknown-unknown-sycldevice • -fsycl-targets=spir64_gen-unknown-unknown-sycldevice
More info: tinyurl.com/aot-compilation 46 Unified Shared Memory
47 Motivation
The SYCL 1.2.1 standard provides a Buffer memory abstraction • Powerful and elegantly expresses data dependences However… • Replacing all pointers and arrays with buffers in a C++ program can be a burden to programmers
USM provides a pointer-based alternative in DPC++ • Simplifies porting to an accelerator • Gives programmers the desired level of control • Complementary to buffers
4848 Developer View Of USM
▪ Developers can reference same memory object in host and device code with Unified Shared Memory
4949 Unified Shared Memory Show me the code!
#include
Unified Shared Memory provides both explicit and implicit models for managing memory.
Allocation Type Description Accessible on HOST Accessible on DEVICE device Allocations in device memory (explicit) NO YES host Allocations in host memory (implicit) YES YES shared Allocations can migrate between host and device YES YES memory (implicit)
Automatic data accessibility and explicit data movement supported
5151 USM - Explicit Data Movement
queue q; int hostArray[42]; int *deviceArray = (int*) malloc_device(42 * sizeof(int), q);
for (int i = 0; i < 42; i++) hostArray[i] = 42; // copy hostArray to deviceArray q.memcpy(deviceArray, &hostArray[0], 42 * sizeof(int)); q.wait(); q.submit([&](handler& h){ h.parallel_for(42, [=](auto ID) { deviceArray[ID]++; }); }); q.wait(); // copy deviceArray back to hostArray 52 q.memcpy(&hostArray[0], deviceArray, 42 * sizeof(int)); q.wait(); free(deviceArray, q);
52 USM - Implicit Data Movement
queue q; int *hostArray = (int*) malloc_host(42 * sizeof(int), q); int *sharedArray = (int*) malloc_shared(42 * sizeof(int), q);
for (int i = 0; i < 42; i++) hostArray[i] = 1234; q.submit([&](handler& h){ h.parallel_for(42, [=](auto ID) { // access sharedArray and hostArray on device sharedArray[ID] = hostArray[ID] + 1; }); }); q.wait(); for (int i = 0; i < 42; i++) hostArray[i] = sharedArray[i]; 53 free(sharedArray, q); free(hostArray, q);
53 USM - Data Dependency in Queues
No accessors in USM Dependences must be specified explicitly using events • queue.wait() • wait on event objects • use the depends_on method inside a command group
54 USM - Data Dependency in Queues
Explicit wait() used to queue q; ensure data dependency in int* data = malloc_shared 5555 USM - Data Dependency in Queues queue q; Use depends_on() method int* data = malloc_shared q.submit([&] (handler &h){ h.depends_on(e2); B h.parallel_for 5656 USM - Data Dependency in Queues queue q; int* data1 = malloc_shared 5757 USM - Data Dependency in Queues A more simplified way of queue q; int* data1 = malloc_shared for(int i=0;i 5858 USM - Data Dependency in Queues queue q{property::queue::in_order()}; Use in_order property for int *data = malloc_shared 5959 USM – What else you need to know? ▪ Remember about tradeoff between convenience and performance • Host? Device? Shared? ▪ Remember about context ▪ Use sycl::free instead of C++ free for USM allocations ▪ There are multiple styles of allocations: • C Style: void *malloc_device(size_t size, const device &dev, const context &ctxt); • C++ Style: template 60 60 NDRange 61 DPC++ Thread Hierarchy and Mapping 62 DPC++ Thread Hierarchy and Mapping All work-items in a work-group are scheduled on one Compute Unit, which has its own local memory All work-items in a sub-group are mapped to vector hardware 63 Shared Local Memory SLM has limited size, e.g. 64KB SLM for each work-group std::cout << "Local Memory Size: " << q.get_device().get_info Use local_accessor class or target::local accessor specialization q.submit([&](auto &h) { sycl::accessor tinyurl.com/SLM-in-SYCL 64 Logical Memory Hierarchy Device Work-Group Work-Group Private Private Private Private Private Private Memory Memory Memory Memory Memory Memory … Work-Item Work-Item … Work-Item Work-Item Work-Item … Work-Item Local Memory Local Memory Global Memory Constant Memory 6565 ND-range Kernels ▪ Basic Parallel Kernels are easy way to parallelize a for-loop but does not allow performance optimization at hardware level. ▪ ND-range kernel is another way to express parallelism which enable low level performance tuning by providing access to local memory and mapping executions to compute units on hardware. • The entire iteration space is divided into smaller groups called work-groups, work-items within a work-group are scheduled on a single compute unit on hardware. • The grouping of kernel executions into work-groups will allow control of resource usage and load balance work distribution. 6666 ND-range Kernels The functionality of nd_range kernels is exposed via nd_range and nd_item classes h.parallel_for(nd_range<1>(range<1>(1024),range<1>(64)), [=](nd_item<1> item){ auto idx = item.get_global_id(); auto local_id = item.get_local_id(); // CODE THAT RUNS ON DEVICE }); global size work-group size nd_range class represents a grouped execution range using global execution range and the local execution range of each work-group. nd_item class represents an individual instance of a kernel function and allows to query for work-group range and index. 6767 Sub-groups 68 Sub-groups ▪ A subset of work-items within a work-group that may map to vector hardware. ▪ Why use Sub-groups? • Work-items in a sub-group can communicate directly using shuffle operations, without explicit memory operations. • Work-items in a sub-group can synchronize using sub-group barriers and guarantee memory consistency using sub-group memory fences. • Work-items in a sub-group have access to sub-group collectives, providing fast implementations of common parallel patterns. 6969 Sub-groups sub_group class h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item) { ▪ The sub-group handle can ONEAPI::sub_group sg = item.get_sub_group(); be obtained from the nd_item using the // KERNEL CODE get_sub_group() }); ▪ Once you have the sub-group handle, you can query for more information about the sub- group, do shuffle operations or use collective functions. ▪ Explicit kernel attribute [[intel::reqd_sub_group_size(N)]] to control the sub-group size 7070 Sub-groups The sub-group handle can be quired to get other information: h.parallel_for(nd_range<1>(N,B), [=](nd_item<1> item){ ONEAPI::sub_group sg = item.get_sub_group(); • get_local_id() returns the index of the work-item within its sub-group if(sg.get_local_id() == 0){ • get_local_range() returns the size of sub_group out << "sub_group id: " << sg.get_group_id()[0] • get_group_id() returns the index of << " of " << sg.get_group_range() the sub-group << ", size=" << sg.get_local_range()[0] • get_group_range() returns the << endl; number of sub-groups within the } parent work-group }); 7171 nd_range & nd_item ▪Example: Process every pixel in a 1920x1080 image • Each pixel needs processing, kernel is executed on each pixel (work-item) • 1920 x 1080 = 2M pixels = global size • Not all 2M can run in parallel on device, there is hardware resource limits. • We have to split into smaller groups of pixel blocks = local size (work-group) 73 nd_range & nd_item ▪Example: Process every pixel in a 1920x1080 image • Let compiler determine work-group size h.parallel_for(range<2>(1920,1080), [=](id<2> item){ ▪ // CODE THAT RUNS ON DEVICE }); global local size • Programmer specifies work-group size size (work-group size) h.parallel_for(nd_range<2>(range<2>(1920,1080),range<2>(8,8)), [=](id<2> item){ // CODE THAT RUNS ON DEVICE }) 74 nd_range & nd_item ▪Example: Process every pixel in a 1920x1080 image • How do we choose work-group size? GOOD • Work-group size of 8x8 divides equally for 1920x1080 • Work-group size of 9x9 does not divide equally for 1920x1080 • Compiler will throw error (invalid work group size error) • Work-group size of 10x10 divides equally for 1920x1080 • Works, but always better to use multiple of 8 for better resource utilization • Work-group size of 24x24 divides equally for 1920x1080 • 24x24=576, will fail compile assuming GPU max work-group size is 256 75 ▪ Demo/Lab DPC++ Unified Shared Memory DPC++ Subgroups 77 VectorAdd USM ▪ oneapi_demo/dpcpp_compiler/src/vector-add-usm.cpp $ dpcpp -O2 -g -std=c++17 -o vector-add-usm src/vector-add-usm.cpp $ SYCL_DEVICE_TYPE=GPU ./vector-add-usm Running on device: Intel(R) Graphics Gen9 [0x3ea5] Vector size: 10000 [0]: 0 + 0 = 0 [1]: 1 + 1 = 2 [2]: 2 + 2 = 4 ... [9999]: 9999 + 9999 = 19998 Vector add successfully completed on device. 78 Lab USM ▪ oneapi_demo/dpcpp_compiler/src/usm-lab.cpp ▪ Task: The code uses USM and has three kernels that are submitted to device. The first two kernels modify two different memory objects and the third one has a dependency on the first two. There is no data dependency between the three queue submissions, so the code can be fixed to get desired output of 25. 79 Subgroups ▪ oneapi_demo/dpcpp_compiler/src/sugroup.cpp 80 Useful Links Open source projects oneAPI Data Parallel C++ compiler: github.com/intel/llvm Graphics Compute Runtime: github.com/intel/compute-runtime Graphics Compiler: github.com/intel/intel-graphics-compiler SYCL 2020: tinyurl.com/sycl2020-spec DPC++ Extensions: tinyurl.com/dpcpp-ext Environment Variables: tinyurl.com/dpcpp-env-vars DPC++ book: tinyurl.com/dpcpp-book oneAPI training: colfax-intl.com/training/intel-oneapi-training oneAPI Base Training Modules: devcloud.intel.com/oneapi/get_started/baseTrainingModules/ Code samples: tinyurl.com/dpcpp-tests tinyurl.com/oneapi-samples 81 QUESTIONS? 82 Notices & Disclaimers Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure. Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy. © Intel Corporation. Intel, the Intel logo, Xeon, Core, VTune, OpenVINO, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others. 83 84