Introduction to Heterogenous Programming with Data Parallel C++

Cross-Architecture Programming for Accelerated Compute, Freedom of Choice for Hardware Introduction to heterogenous programming with Data Parallel C++ One Intel Software & Architecture group Intel Architecture, Graphics & Software March 2021 All information provided in this deck is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps. Agenda What is DPC++ ? Intel Compilers “Hello World” Example Basic Concepts: buffer, accessor, queue, kernel, etc. Error Handling Device Selection Compilation and Execution Flow Demo/Lab – part I Unified Shared Memory Sub-groups Demo/Lab – part II 2 What is DPC++ 3 Data Parallel C++ Standards-based, Cross-architecture Language DPC++ = ISO C++ and Khronos SYCL and community extensions Freedom of Choice: Future-Ready Programming Model Direct Programming: ▪ Allows code reuse across hardware targets Data Parallel C++ ▪ Permits custom tuning for a specific accelerator ▪ Open, cross-industry alternative to proprietary language Community Extensions DPC++ = ISO C++ and Khronos SYCL and community Khronos SYCL extensions ▪ Delivers C++ productivity benefits, using common, familiar C and C++ constructs ▪ Adds SYCL from the Khronos Group for data parallelism and heterogeneous programming ISO C++ Community Project Drives Language Enhancements ▪ Provides extensions to simplify data parallel programming ▪ Continues evolution through open and cooperative development The open source and Intel DPC++/C++ compiler supports Intel CPUs, GPUs, and FPGAs. 4 Codeplay announced a DPC++ compiler that targets Nvidia GPUs. Intel® oneAPI DPC++/C++ Compiler Parallel Programming Productivity & Performance Compiler to deliver uncompromised parallel programming oneAPI DPC++/C++ Compiler and Runtime productivity and performance across CPUs and accelerators DPC++ Source Code ▪ Allows code reuse across hardware targets, while permitting custom tuning for a specific accelerator Clang/LLVM ▪ Open, cross-industry alternative to single architecture proprietary language https://github.com/intel/llvm DPC++ Runtime https://github.com/intel/compute-runtime Builds upon Intel’s decades of experience in architecture and high-performance compilers Code samples: CPU GPU FPGA tinyurl.com/dpcpp-tests tinyurl.com/oneapi-samples There will still be a need to tune for each architecture. 5 DPC++ Extensions tinyurl.com/dpcpp-ext tinyurl.com/sycl2020-spec Extension Purpose SYCL 2020 USM (Unified Shared Memory) Pointer-based programming ✓ Sub-groups Cross-lane operations ✓ Reductions Efficient parallel primitives ✓ Work-group collectives Efficient parallel primitives ✓ Pipes Spatial data flow support Argument restrict Optimization Optional lambda name for kernels Simplification ✓ In-order queues Simplification ✓ Class template argument Simplification ✓ deduction and simplification 7 tinyurl.com/openmp-and-dpcpp Intel® Compilers OpenMP Included in OpenMP Intel Compiler Target Offload oneAPI Support Support Toolkit Intel® C++ Compiler, IL0 CPU Yes No HPC icc Intel® oneAPI DPC++/C++ Compiler CPU, GPU, Yes Yes Base dpcpp FPGA* Intel® oneAPI DPC++/C++ Compiler CPU Yes Yes Base icx GPU* Intel® Fortran Compiler, IL0 CPU Yes No HPC ifort Intel® Fortran Compiler Beta CPU, GPU* Yes Yes HPC ifx Cross Compiler Binary Compatible and Linkable! tinyurl.com/oneAPI-Base-download tinyurl.com/oneAPI-HPC-download 8 Codeplay Launched Data Parallel C++ Compiler for Nvidia GPUs ▪ Developers can retarget and reuse code between NVIDIA and Intel compute accelerators from a single source base ▪ Codeplay is the first oneAPI industry contributor to Other News implement a developer tool based on oneAPI Codeplay Brings NVIDIA GPU Support to specifications Industry-Standard Math Library ▪ They leveraged the DPC++ LLVM-based open source Intel Open Sources the oneAPI Math Kernel project that Intel established Library Interface ▪ Codeplay is a key driver of the Khronos SYCL standard, upon with DPC++ is based ▪ More details in the Codeplay blog post ▪ Build DPC++ toolchain with support for NVIDIA CUDA: tinyurl.com/dpcpp-cuda-be tinyurl.com/dpcpp-cuda-be-webinar Other names and brands may be claimed as the property of others. 9 “Hello World” Example 11 DPC++ Basics Host (CPU) DPC++ Application Host Memory Executed on… Host code Device code submits... Device Command Local Local Memory CommandGroup Local Local Memory Local Local Memory GroupCommand Local Local Memory CommandGroup Compute Unit Group (CU) Executed on… Command Private Memory CommandQueue CommandQueue Global Memory Queue 12 Anatomy of a DPC++ Application #include <CL/sycl.hpp> using namespace sycl; int main() { std::vector<float> A(1024), B(1024), C(1024); // some data initialization { Host code buffer bufA {A}, bufB {B}, bufC {C}; queue q; q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); auto B = bufB.get_access(h, read_only); auto C = bufC.get_access(h, write_only); h.parallel_for(1024, [=](auto i){ Accelerator C[i] = A[i] + B[i]; }); device code }); } 13 for (int i = 0; i < 1024; i++) std::cout << "C[" < using namespace sycl; Host code int main() { Graph executes execution std::vector<float> A(1024), B(1024), C(1024); asynchronously // some data initialization { to host program buffer bufA {A}, bufB {B}, bufC {C}; queue q(gpu_selector{}); q.submit([&](handler &cgh) { auto A = bufA.get_access(cgh, access::mode::read ); auto B = bufB.get_access(cgh, access::mode::read ); A Enqueues auto C = bufC.get_access(cgh, access::mode::write); kernel to cgh.parallel_for(1024, [=](auto i){ graph, and C[i] = A[i] + B[i]; Kernel keeps }); going }); } A for (int i = 0; i < 1024; i++) std::cout << C[i] << std::endl; } 1414 Anatomy of a DPC++ Application #include <CL/sycl.hpp> using namespace sycl; int main() { std::vector<float> A(1024), B(1024), C(1024); // some data initialization Application scope { buffer bufA {A}, bufB {B}, bufC {C}; queue q; q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); Command group auto B = bufB.get_access(h, read_only); auto C = bufC.get_access(h, write_only); scope h.parallel_for(1024, [=](auto i){ C[i] = A[i] + B[i]; }); Device scope }); } 15 for (int i = 0; i < 1024; i++) std::cout << "C[" < A(1024), B(1024), C(1024); { Buffers creation via host vectors/pointers buffer bufA {A}, bufB {B}, bufC {C}; queue q; Buffers encapsulate data q.submit([&](handler &h) { in a SYCL application auto A = bufA.get_access(h, read_only); • Across both devices and auto B = bufB.get_access(h, read_only); host! auto C = bufC.get_access(h, write_only); h.parallel_for(1024, [=](auto i){ C[i] = A[i] + B[i]; }); }); } for (int i = 0; i < 1024; i++) 16 std::cout << "C[" < A(1024), B(1024), C(1024); { • buffer bufA {A}, bufB {B}, bufC {C}; A queue submits command groups to queue q; be executed by the q.submit([&](handler &h) { SYCL runtime auto A = bufA.get_access(h, read_only); • Queue is a auto B = bufB.get_access(h, read_only); mechanism where auto C = bufC.get_access(h, write_only); work is submitted to a h.parallel_for(1024, [=](auto i){ device. C[i] = A[i] + B[i]; }); }); } for (int i = 0; i < 1024; i++) 17 std::cout << "C[" < A(1024), B(1024), C(1024); { buffer bufA {A}, bufB {B}, bufC {C}; queue q; • Accessors creation q.submit([&](handler &h) { auto A = bufA.get_access(h, read_only); • Mechanism to access auto B = bufB.get_access(h, read_only); buffer data auto C = bufC.get_access(h, write_only); • Create data h.parallel_for(1024, [=](auto i){ dependencies in the C[i] = A[i] + B[i]; SYCL graph that order kernel executions }); }); } for (int i = 0; i < 1024; i++) 18 std::cout << "C[" < A(1024), B(1024), C(1024); { • Vector addition kernel buffer bufA {A}, bufB {B}, bufC {C}; enqueues a parallel_for queue q; task. q.submit([&](handler &h) { • Pass a function auto A = bufA.get_access(h, read_only); object/lambda to be auto B = bufB.get_access(h, read_only); executed by each work- auto C = bufC.get_access(h, write_only); item h.parallel_for<class vector_add>(1024, [=](auto i){ C[i] = A[i] + B[i]; }); range<1>{1024} id<1> }); } 19 for (int i = 0; i < 1024; i++) std::cout << "C[" <{ num }; buffer<int> A{ R }, B{ R }; dependence resolution! queue Q; A Q.submit([&](handler& h) { accessor out(A, h, write_only); B h.parallel_for(R, [=](id<1> idx) { out[idx] = idx[0]; }); }); Kernel 1 Kernel 1 Q.submit([&](handler& h) { Kernel 3 accessor out(A, h, write_only); A h.parallel_for(R, [=](id<1> idx) { Kernel 2 out[idx] = idx[0]; }); }); Kernel 2 Q.submit([&](handler& h) { accessor out(B, h, write_only); B h.parallel_for(R, [=](id<1> idx) { Kernel 3 A out[idx] = idx[0]; }); }); = data Q.submit([&](handler& h) { dependence accessor in(A, h, read_only); accessor inout(B, h); Kernel 4 h.parallel_for(R, [=](id<1> idx) { inout[idx] *= in[idx]; }); }); Kernel 4 } Program completion 2020 DPC++ Functions for Invoking Kernels h.single_task( [=]() { // kernel function is executed EXACTLY once on a SINGLE work-item }); h.parallel_for( range<3>(1024,1024,1024),

Load more