GPU Programming in C++ with SYCL

GPU programming in C++ with SYCL Gordon Brown Principal Software Engineer, SYCL & C++ C++ Europe 2020 – June 2020 Agenda Why use the GPU? Brief introduction to SYCL SYCL programming model Optimising GPU programs SYCL for Nvidia GPUs SYCL 2020 preview 2 © 2019 Codeplay Software Ltd. Why use the GPU? 3 © 2019 Codeplay Software Ltd. “The end of Moore’s Law” “The free lunch is over” “The future is parallel” 4 © 2019 Codeplay Software Ltd. Take a typical Intel chip Intel Core i7 7th Gen ○ 4x CPU cores ■ Each with hyperthreading ■ Each with support for 256bit AVX2 instructions ○ Intel Gen 9 GPU ■ With 1280 processing elements 5 © 2019 Codeplay Software Ltd. Regular sequential C++ code (non- vectorised) running on a single thread only takes advantage of a very small amount of the available resources of the chip 6 © 2019 Codeplay Software Ltd. Vectorisation allows you to fully utilise a single CPU core 7 © 2019 Codeplay Software Ltd. Multi-threading allows you to fully utilise all CPU cores 8 © 2019 Codeplay Software Ltd. Heterogeneous dispatch allows you to fully utilise the entire chip 9 © 2019 Codeplay Software Ltd. GPGPU programming was once a niche technology ● Limited to specific domain ● Separate source solutions ● Verbose low-level APIs ● Very steep learning curve 10 © 2019 Codeplay Software Ltd. This is not the case anymore ● Almost everything has a GPU now ● Single source solutions ● Modern C++ programming C++AMP models SYCL ● More accessible to the CUDA Agency average C++ developer Kokkos HPX Raja 11 © 2019 Codeplay Software Ltd. Brief introduction to SYCL 12 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms 13 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms 14 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms Host compiler Device compiler • SYCL allows you write Applications both host CPU and SYCL template library device code in the SYCL runtime same C++ source file • This requires two Backend (e.g. OpenCL) Device IR / ISA compilation passes; one (e.g. SPIR) for the host code and one for the device code CPU executable (embedded device binary) 15 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms • SYCL provides high- level abstractions over common boiler- plate code • Platform/device selection Typical SYCL hello world application • Buffer creation • Kernel compilation Typical OpenCL hello world application • Dependency management and scheduling 16 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms array_view<float> a, b, c; extent<2> e(64, 64); std::vector<float> a, b, c; parallel_for_each(e, [=](index<2> idx) restrict(amp) { #pragma parallel_forc[idx] = a[idx] + b[idx]; for(int i}); = 0; i < a.size(); i++) { c[i] = a[i] + b[i]; • SYCL allows you to } __global__ vec_add(float *a, float *b, float *c) { return c[i] = a[i] + b[i]; write standard C++ } float *a, *b, *c; • No language extensions vec_add<<<range>>>(a, b, c); • No pragmas • No attributes cgh.parallel_for<class vec_add>(range, [=](cl::sycl::id<2> idx) { c[idx] = a[idx] + c[idx]; })); 17 © 2019 Codeplay Software Ltd. SYCL is a single-source, high-level, standard C++ programming model, that can target a range of heterogeneous platforms • SYCL can target any device supported by its backend • SYCL can target a number of different backends • Currently the specification is limited to OpenCL • Some implementations support other non- CPU GPU APU Accelerator FPGA DSP standard backends 18 © 2019 Codeplay Software Ltd. SYCL implementations 19 © 2019 Codeplay Software Ltd. 20 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) 21 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL interface is a C++ template library that users and library developers program to • The same interface is used for both the host and device code 22 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL runtime is a library that schedules and executes work • It loads kernels, tracks data dependencies and schedules commands 23 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The host device is an emulated backend that is executed as native C++ code and emulates the SYCL execution and memory model • The host device can be used without backend drivers and for debugging purposes 24 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The backend interface is where the SYCL runtime calls down into a backend in order to execute on a particular device • The standard backend is OpenCL but some implementations have supported others 25 © 2019 Codeplay Software Ltd. SYCL interface SYCL Runtime Kernel Runtime Data dependency SYCL device loader Scheduler tracker compiler Host device Backend interface (e.g. OpenCL API) • The SYCL device compiler is a C++ compiler which can identify SYCL kernels and compile them down to an IR or ISA • This can be SPIR, SPIR-V, GCN, PTX or any proprietary vendor ISA 26 © 2019 Codeplay Software Ltd. Example SYCL application 27 © 2019 Codeplay Software Ltd. int main(int argc, char *argv[]) { } 28 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { The whole SYCL API is included in the CL/sycl.hpp header file } 29 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { queue gpuQueue{gpu_selector{}}; A queue is used to enqueue work to a device such as a GPU A device selector is a function object which provides a heuristic for selecting a suitable device } 30 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { queue gpuQeueue{gpu_selector{}}; A command group describes a unit work of work to be executed by a device gpuQeueue.submit([&](handler &cgh){ A command group is created by a function object passed to the submit function of the queue }); } 31 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; We initialize three vectors, two gpuQeueue.submit([&](handler &cgh){ inputs and an output }); } 32 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); Buffers take ownership of data and manage it across the host gpuQeueue.submit([&](handler &cgh){ and any number of devices }); } 33 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; { buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); Buffers synchronize on buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); destruction via RAII waiting for gpuQeueue.submit([&](handler &cgh){ any command groups that need to write back to it }); } } 34 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; int main(int argc, char *argv[]) { std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; { Accessors describe the way in buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); which you would like to access a buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); buffer gpuQeueue.submit([&](handler &cgh){ They are also use to access the auto inA = bufA.get_access<access::mode::read>(cgh); data from within a kernel auto inB = bufB.get_access<access::mode::read>(cgh); function auto out = bufO.get_access<access::mode::write>(cgh); }); } } 35 © 2019 Codeplay Software Ltd. #include <CL/sycl.hpp> using namespace cl::sycl; Commands such as parallel_for class add; can be used to define kernel int main(int argc, char *argv[]) { functions std::vector<float> dA{ … }, dB{ … }, dO{ … }; queue gpuQeueue{gpu_selector{}}; The first argument here is a { range, specifying the iteration buffer<float, 1> bufA(dA.data(), range<1>(dA.size())); space buffer<float, 1> bufB(dB.data(), range<1>(dB.size())); buffer<float, 1> bufO(dO.data(), range<1>(dO.size())); The second argument is a gpuQeueue.submit([&](handler &cgh){ function object that represents auto inA = bufA.get_access<access::mode::read>(cgh);

GPU Programming in C++ with SYCL

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support