Modern C++ Heterogeneous Programming with SYCL

Modern C++ Heterogeneous Programming with SYCL Michael Wong Distinguished Engineer SYCL WG Chair ISO C++ Directions Group Chair of C++ Machine Learning, Low latency, Games, Embedded, Finance 2021 2 © 2016 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● Past CEO OpenMP ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected] ● [email protected] ● Head of Delegation for C++ Standard for Canada Ported Build ● Chair of Programming Languages for Standards TensorFlow to LLVM-based Council of Canada open standards compilers for using SYCL accelerators Chair of WG21 SG19 Machine Learning Chair of WG21 SG14 Games Dev/Low Implement Latency/Financial Trading/Embedded Releasing ● Editor: C++ SG5 Transactional Memory Technical open-source, OpenCL and open-standards based SYCL for Specification AI acceleration tools: accelerator ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, VisionCpp processors ● MISRA C++ and AUTOSAR ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2016 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group/OpenMP, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. These include Bjarne Stroustrup, Joe Hummel, Botond Ballo, Simon Mcintosh-Smith, as well as many others. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them. 4 © 2016 Codeplay Software Ltd. Legal Disclaimer THIS WORK REPRESENTS THE VIEW OF THE OTHER COMPANY, PRODUCT, AND SERVICE AUTHOR AND DOES NOT NECESSARILY NAMES MAY BE TRADEMARKS OR SERVICE REPRESENT THE VIEW OF CODEPLAY. MARKS OF OTHERS. 5 © 2016 Codeplay Software Ltd. Disclaimers NVIDIA, the NVIDIA logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code 6 © 2016 Codeplay Software Ltd. SYCL 2020 is here! Open Standard for Single Source C++ Parallel Heterogeneous Programming SYCL 2020 is released after 3 years of intense work Significant adoption in Embedded, Desktop and HPC markets Improved programmability, smaller code size, faster performance Based on C++17, backwards compatible with SYCL 1.2.1 Simplify porting of standard C++ applications to SYCL Closer alignment and integration with ISO C++ Multiple Backend acceleration and API independent SYCL 2020 increases expressiveness and simplicity for modern C++ heterogeneous programming This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 7 SYCL 2020 Industry Momentum SYCL support growing from Embedded Systems through https://www.alcf.anl.gov/support-center/aurora/sycl-and-dpc-aurora https://www.embeddedcomputing.com/technology/open-source/risc-v-open-source-ip/nsitexe-kyoto-microcomputer-and-codeplay-software-are-bringing-open-standards-programming-to-risc-v-vector-processor-for-hpc-and-ai-systems Desktops to Supercomputers https://www.nextplatform.com/2021/02/03/can-sycl-slice-into-broader-supercomputing/ https://www.phoronix.com/scan.php?page=news_item&px=hipSYCL-New-Lite-Runtime https://software.intel.com/content/www/us/en/develop/articles/interoperability-dpcpp-sycl-opencl.html https://www.renesas.com/br/en/about/press-room/renesas-electronics-and-codeplay-collaborate-opencl-and-sycl-adas-solutions https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2021/nersc-alcf-codeplay-partner-on-sycl-for-next-generation-supercomputers/ https://research-portal.uws.ac.uk/en/publications/trisycl-for-xilinx-fpga https://www.imaginationtech.com/news/press-release/tensorflow-gets-native-support-for-powervr-gpus-via-optimised-open-source-sycl-libraries/ This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 8 Agenda Challenges of an Accelerator Programming Model SYCL 2020 SYCL in HPC, Embedded, Safety, and Autonomous Vehicles 9 © 2018 Codeplay Software Ltd. Understanding the Challenges of the Heterogeneous Era 10 © 2016 Codeplay Software Ltd. So what are the biggest challenges for heterogeneous computing? ➢ Single Source vs Multiple Source ➢ Heterogeneous offloading ➢ Expressing parallelism ➢ Four Horsemen of Heterogeneous Computing: Data locality, movement, layout, affinity ➢ SPMD Programming model 11 © 2016 Codeplay Software Ltd. Heterogeneous Offloading 12 © 2016 Codeplay Software Ltd. How do we offload code to a heterogeneous device? ➢ This can be answered by looking at the C++ compilation model 13 © 2016 Codeplay Software Ltd. How can we compile source code for a sub architectures? ➢ Separate source ➢ Single source 14 © 2016 Codeplay Software Ltd. Separate Source Compilation Model C++ CPU CPU x86 Source Compile Linker x86 ISA Object CPU File r Device Online Source Compile GPU r float *a, *b, *c; Here we’re using OpenCL as an example … kernel k = clCreateKernel(…, “my_kernel”, …);void my_kernel(__global float *a, __global float clEnqueueWriteBuffer(…, size, a, …); *b, clEnqueueWriteBuffer(…, size, a, …); __global float *c) { clEnqueueNDRange(…, k, 1, {size, 1, 1}, …); int id = get_global_id(0); clEnqueueWriteBuffer(…, size, c, …); c[id] = a[id] + b[id]; } 15 © 2016 Codeplay Software Ltd. Std C++ compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object Std C++ compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object GPU ? SYCL single source compilation model C++ CPU CPU source Linker CPU ISA CPU file compiler object GPU auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ Linker CPU ISA CPU compiler object source file C++ device source GPU auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ Linker CPU ISA CPU compiler object source file C++ device SYCL SYCL doesn’t mandate source SPIR GPU compiler SPIR auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ CPU ISA CPU compiler object source file Linker C++ device SYCL source SPIR GPU compiler auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model CPU CPU C++ CPU compiler object source file CPU ISA (embedded Linker C++ SPIR) device SYCL source SPIR GPU compiler auto aAcc = aBuf.get_access<access::read>(cgh); auto bAcc = bBuf.get_access<access::read>(cgh); auto oAcc = oBuf.get_access<access::write>(cgh); cgh.parallel_for<add>(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); Benefits of Single Source •Device code is written in C++ in the same source file as the host CPU code •Allows compile-time evaluation of device code •Supports type safety across host CPU and device •Supports generic programming •Removes the need to distribute source code 23 © 2016 Codeplay Software Ltd. Describing Parallelism 24 © 2016 Codeplay Software Ltd. How do you represent the different forms of parallelism? ➢ Directive vs explicit parallelism ➢ Task vs data parallelism ➢ Queue vs stream execution 25 © 2016 Codeplay Software Ltd. Directive vs Explicit Parallelism Examples: Examples: • OpenMP, OpenACC • SYCL, CUDA, TBB, Fibers, C++11 Threads Implementation: Implementation: • Compiler transforms code to be • An API is used to explicitly parallel based on pragmas enqueuer one or more threads Here we’re using OpenMP as an example Here we’re using C++ AMP as an example vector<float> a, b, c; array_view<float> a, b, c; extent<2> e(64, 64); #pragma omp parallel for parallel_for_each(e, [=](index<2> idx) for(int i = 0; i < a.size(); i++) { restrict(amp) { c[i] = a[i] + b[i]; c[idx] = a[idx] + b[idx]; } }); 26 © 2016 Codeplay Software Ltd. Task vs Data Parallelism Examples: Examples: • OpenMP, C++11 Threads, TBB • C++ AMP, CUDA, SYCL, C++17 ParallelSTL Implementation: Implementation: • Multiple (potentially different) • The same task is performed tasks are performed in parallel across a large data set Here we’re using TBB as an example Here we’re using CUDA as an example vector<task> tasks = { … }; floatHere *a, we’re *b, using *c; OpenMP as an example cudaMalloc((void **)&a, size); tbb::parallel_for_each(tasks.begin(), cudaMalloc((void **)&b, size); tasks.end(), [=](task &v) { cudaMalloc((void

Load more