GPU Programming in C++ with SYCL

GPU programming in C++ with SYCL Gordon Brown Principal Software Engineer, SYCL & C++ C++ Europe 2020 – June 2020 Agenda Why use the GPU? Brief introduction to SYCL SYCL programming model Optimising GPU programs SYCL for Nvidia GPUs SYCL 2020 preview 2 © 2019 Codeplay Software Ltd. Why use the GPU? 3 © 2019 Codeplay Software Ltd. “The end of Moore’s Law” “The free lunch is over” “The future is parallel” 4 © 2019 Codeplay Software Ltd. Take a typical Intel chip Intel Core i7 7th Gen ○ 4x CPU cores ■ Each with hyperthreading ■ Each with support for 256bit AVX2 instructions ○ Intel Gen 9 GPU ■ With 1280 processing elements 5 © 2019 Codeplay Software Ltd. Regular sequential C++ code (non- vectorised) running on a single thread only takes advantage of a very small amount of the available resources of the chip 6 © 2019 Codeplay Software Ltd. Vectorisation allows you to fully utilise a single CPU core 7 © 2019 Codeplay Software Ltd. Multi-threading allows you to fully utilise all CPU cores 8 © 2019 Codeplay Software Ltd. Heterogeneous dispatch allows you to fully utilise the entire chip 9 © 2019 Codeplay Software Ltd. GPGPU programming was once a niche technology ● Limited to specific domain ● Separate source solutions ● Verbose low-level APIs ● Very steep learning curve 10 © 2019 Codeplay Software Ltd. This is not the case anymore ● Almost everything has a GPU now ● Single source solutions ● Modern C++ programming C++AMP models SYCL ● More accessible to the CUDA Agency average C++ developer Kokkos HPX Raja 11 © 2019 Codeplay Software Ltd. Brief introduction to SYCL 12 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms 13 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms 14 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms Host compiler Device compiler • SYCL allows you write Applications both host CPU and SYCL template library device code in the SYCL runtime same C++ source file • This requires two Backend (e.g. OpenCL) Device IR / ISA compilation passes; one (e.g. SPIR) for the host code and one for the device code CPU executable (embedded device binary) 15 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms • SYCL provides high- level abstractions over common boiler- plate code • Platform/device selection Typical SYCL hello world application • Buffer creation • Kernel compilation Typical OpenCL hello world application • Dependency management and scheduling 16 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms array_view<float> a, b, c; extent<2> e(64, 64); std::vector<float> a, b, c; parallel_for_each(e, [=](index<2> idx) restrict(amp) { #pragma parallel_forc[idx] = a[idx] + b[idx]; for(int i}); = 0; i < a.size(); i++) { c[i] = a[i] + b[i]; • SYCL allows you to } __global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; write standard C++ } float *a, *b, *c; • No language extensions vec_add<<<range>>>(a, b, c); • No pragmas • No attributes cgh.parallel_for<class vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx]; })); 17 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms • SYCL can target any device supported by its backend • SYCL can target a number of different backends • Currently the specification is limited to OpenCL • Some implementations support other non- CPU GPU APU Accelerator FPGA DSP standard backends 18 © 2019 Codeplay Software Ltd. SYCL implementations 19 © 2019 Codeplay Software Ltd. 20 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) 21 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL interface is a C++ template library that users and library developers program to • The same interface is used for both the host and device code 22 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL runtime is a library that schedules and executes work • It loads kernels, tracks data dependencies and schedules commands 23 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The host device is an emulated backend that is executed as native C++ code and emulates the SYCL execution and memory model • The host device can be used without backend drivers and for debugging purposes 24 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The backend interface is where the SYCL runtime calls down into a backend in order to execute on a particular device • The standard backend is OpenCL but some implementations have supported others 25 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL device compiler is a C++ compiler which can identify SYCL kernels and compile them down to an IR or ISA • This can be SPIR, SPIR-V, GCN, PTX or any proprietary vendor ISA 26 © 2019 Codeplay Software Ltd. Example SYCL application 27 © 2019 Codeplay Software Ltd. int main(int argc, char *argv[]) { } 28 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { The whole SYCL API is included in the CL/sycl.hpp header file } 29 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { queue gpuQueue{gpu_selector{}}; A queue is used to enqueue work to a device such as a GPU A device selector is a function object which provides a heuristic for selecting a suitable device } 30 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { queue gpuQeueue{gpu_selector{}}; A command group describes a unit work of work to be executed by a device gpuQeueue.submit([&](handler &cgh){ A command group is created by a function object passed to the submit function of the queue }); } 31 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; We initialize three vectors, two gpuQeueue.submit([&](handler &cgh){ inputs and an output }); } 32 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); Buffers take ownership of data and manage it across the host gpuQeueue.submit([&](handler &cgh){ and any number of devices }); } 33 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; { buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); Buffers synchronize on buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); destruction via RAII waiting for gpuQeueue.submit([&](handler &cgh){ any command groups that need to write back to it }); } } 34 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; { Accessors describe the way in buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); which you would like to access a buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); buffer gpuQeueue.submit([&](handler &cgh){ They are also use to access the auto inA = bufA.get_access<access::mode::read>(cgh); data from within a kernel auto inB = bufB.get_access<access::mode::read>(cgh); function auto out = bufO.get_access<access::mode::write>(cgh); }); } } 35 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; Commands such as parallel_for class add; can be used to define kernel int main(int argc, char *argv[]) { functions std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; The first argument here is a { range, specifying the iteration buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); space buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); The second argument is a gpuQeueue.submit([&](handler &cgh){ function object that represents auto inA = bufA.get_access<access::mode::read>(cgh);

GPU Programming in C++ with SYCL

The Importance of Data

Introduction to GPU Computing

Intel® Oneapi Programming Guide

Opencl SYCL 2.2 Specification

Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications

SYCL – Introduction and Hands-On

Under the Hood of SYCL – an Initial Performance Analysis with an Unstructured-Mesh CFD Application

Resyclator: Transforming CUDA C++ Source Code Into SYCL

SYCL Building Blocks for C++ Libraries Maria Rovatsou, SYCL Specification Editor Principal Software Engineer, Codeplay Software

Trisycl Open Source C++17 & Openmp-Based Opencl SYCL Prototype

SYCL for CUDA

Investigation of the Opencl SYCL Programming Model