Modern C++ Heterogeneous Programming with SYCL

Michael Wong Distinguished Engineer SYCL WG Chair ISO C++ Directions Group Chair of C++ Machine Learning, Low latency, Games, Embedded, Finance

2021 2 © 2016 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● Past CEO OpenMP ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] ● Head of Delegation for C++ Standard for Canada Ported Build ● Chair of Programming Languages for Standards TensorFlow to LLVM-based Council of Canada open standards compilers for using SYCL accelerators Chair of WG21 SG19 Machine Learning Chair of WG21 SG14 Games Dev/Low Implement Latency/Financial Trading/Embedded Releasing ● Editor: C++ SG5 Transactional Memory Technical open-source, OpenCL and open-standards based SYCL for Specification AI acceleration tools: accelerator ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, VisionCpp processors ● MISRA C++ and AUTOSAR ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle

3 © 2016 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++//OpenMP, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. These include Bjarne Stroustrup, Joe Hummel, Botond Ballo, Simon Mcintosh-Smith, as well as many others.

But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.

4 © 2016 Codeplay Software Ltd. Legal Disclaimer

THIS WORK REPRESENTS THE VIEW OF THE OTHER COMPANY, PRODUCT, AND SERVICE AUTHOR AND DOES NOT NECESSARILY NAMES MAY BE TRADEMARKS OR SERVICE REPRESENT THE VIEW OF CODEPLAY. MARKS OF OTHERS.

5 © 2016 Codeplay Software Ltd. Disclaimers

NVIDIA, the logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries

Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code

6 © 2016 Codeplay Software Ltd. SYCL 2020 is here! Open Standard for Single Source C++ Parallel Heterogeneous Programming SYCL 2020 is released after 3 years of intense work Significant adoption in Embedded, Desktop and HPC markets Improved programmability, smaller code size, faster performance Based on C++17, backwards compatible with SYCL 1.2.1 Simplify porting of standard C++ applications to SYCL Closer alignment and integration with ISO C++ Multiple Backend acceleration and API independent

SYCL 2020 increases expressiveness and simplicity for modern C++ heterogeneous programming

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 7 SYCL 2020 Industry Momentum

SYCL support growing from Embedded Systems through https://www.alcf.anl.gov/support-center/aurora/sycl-and-dpc-aurora https://www.embeddedcomputing.com/technology/open-source/risc-v-open-source-ip/nsitexe-kyoto-microcomputer-and-codeplay-software-are-bringing-open-standards-programming-to-risc-v-vector--for-hpc-and-ai-systems Desktops to https://www.nextplatform.com/2021/02/03/can-sycl-slice-into-broader-supercomputing/ https://www.phoronix.com/scan.php?page=news_item&px=hipSYCL-New-Lite-Runtime https://software.intel.com/content/www/us/en/develop/articles/interoperability-dpcpp-sycl-.html https://www.renesas.com/br/en/about/press-room/renesas-electronics-and-codeplay-collaborate-opencl-and-sycl-adas-solutions https://www.nersc.gov/news-publications/nersc-news/nersc-center-news/2021/nersc-alcf-codeplay-partner-on-sycl-for-next-generation-supercomputers/ https://research-portal.uws.ac.uk/en/publications/trisycl-for-xilinx-fpga https://www.imaginationtech.com/news/press-release/tensorflow-gets-native-support-for--gpus-via-optimised-open-source-sycl-libraries/ This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 8 Agenda

Challenges of an Accelerator Programming Model

SYCL 2020

SYCL in HPC, Embedded, Safety, and Autonomous Vehicles

9 © 2018 Codeplay Software Ltd. Understanding the Challenges of the Heterogeneous Era

10 © 2016 Codeplay Software Ltd. So what are the biggest challenges for heterogeneous computing?

➢ Single Source vs Multiple Source ➢ Heterogeneous offloading ➢ Expressing parallelism ➢ Four Horsemen of Heterogeneous Computing: Data locality, movement, layout, affinity ➢ SPMD Programming model

11 © 2016 Codeplay Software Ltd. Heterogeneous Offloading

12 © 2016 Codeplay Software Ltd. How do we offload code to a heterogeneous device?

➢ This can be answered by looking at the C++ compilation model

13 © 2016 Codeplay Software Ltd. How can we compile source code for a sub architectures?

➢ Separate source

➢ Single source

14 © 2016 Codeplay Software Ltd. Separate Source Compilation Model

C++ CPU CPU Source Compile Linker x86 ISA Object CPU File r

Device Online Source Compile GPU r float *a, *b, *c; Here we’re using OpenCL as an example … kernel k = clCreateKernel(…, “my_kernel”, …);void my_kernel(__global float *a, __global float clEnqueueWriteBuffer(…, size, a, …); *b, clEnqueueWriteBuffer(…, size, a, …); __global float *c) { clEnqueueNDRange(…, k, 1, {size, 1, 1}, …); int id = get_global_id(0); clEnqueueWriteBuffer(…, size, c, …); c[id] = a[id] + b[id]; }

15 © 2016 Codeplay Software Ltd. Std C++ compilation model

C++ CPU CPU source Linker CPU ISA CPU file compiler object Std C++ compilation model

C++ CPU CPU source Linker CPU ISA CPU file compiler object

GPU ? SYCL single source compilation model

C++ CPU CPU source Linker CPU ISA CPU file compiler object

GPU

auto aAcc = aBuf.get_access(cgh); auto bAcc = bBuf.get_access(cgh); auto oAcc = oBuf.get_access(cgh); cgh.parallel_for(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model

CPU CPU C++ Linker CPU ISA CPU compiler object source file

C++ device source GPU

auto aAcc = aBuf.get_access(cgh); auto bAcc = bBuf.get_access(cgh); auto oAcc = oBuf.get_access(cgh); cgh.parallel_for(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model

CPU CPU C++ Linker CPU ISA CPU compiler object source file

C++ device SYCL SYCL doesn’t mandate source SPIR GPU compiler SPIR

auto aAcc = aBuf.get_access(cgh); auto bAcc = bBuf.get_access(cgh); auto oAcc = oBuf.get_access(cgh); cgh.parallel_for(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model

CPU CPU C++ CPU ISA CPU compiler object source file Linker C++ device SYCL source SPIR GPU compiler

auto aAcc = aBuf.get_access(cgh); auto bAcc = bBuf.get_access(cgh); auto oAcc = oBuf.get_access(cgh); cgh.parallel_for(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); SYCL single source compilation model

CPU CPU C++ CPU compiler object source file CPU ISA Linker (embedded C++ SPIR) device SYCL source SPIR GPU compiler

auto aAcc = aBuf.get_access(cgh); auto bAcc = bBuf.get_access(cgh); auto oAcc = oBuf.get_access(cgh); cgh.parallel_for(range<1>(a.size()), [=](id<1> idx) { oAcc[i] = aAcc[i] + aAcc[i]; })); Benefits of Single Source •Device code is written in C++ in the same source file as the host CPU code

•Allows compile-time evaluation of device code

•Supports type safety across host CPU and device

•Supports generic programming

•Removes the need to distribute source code

23 © 2016 Codeplay Software Ltd. Describing Parallelism

24 © 2016 Codeplay Software Ltd. How do you represent the different forms of parallelism?

➢ Directive vs explicit parallelism

➢ Task vs

➢ Queue vs stream execution

25 © 2016 Codeplay Software Ltd. Directive vs Explicit Parallelism Examples: Examples: • OpenMP, OpenACC • SYCL, CUDA, TBB, Fibers, C++11 Threads Implementation: Implementation: • Compiler transforms code to be • An API is used to explicitly parallel based on pragmas enqueuer one or more threads

Here we’re using OpenMP as an example Here we’re using C++ AMP as an example vector a, b, c; array_view a, b, c; extent<2> e(64, 64); #pragma omp parallel for parallel_for_each(e, [=](index<2> idx) for(int i = 0; i < a.size(); i++) { restrict(amp) { c[i] = a[i] + b[i]; c[idx] = a[idx] + b[idx]; } });

26 © 2016 Codeplay Software Ltd. Task vs Data Parallelism Examples: Examples: • OpenMP, C++11 Threads, TBB • C++ AMP, CUDA, SYCL, C++17 ParallelSTL Implementation: Implementation: • Multiple (potentially different) • The same task is performed tasks are performed in parallel across a large data set

Here we’re using TBB as an example Here we’re using CUDA as an example vector tasks = { … }; floatHere *a, we’re *b, using *c; OpenMP as an example cudaMalloc((void **)&a, size); tbb::parallel_for_each(tasks.begin(), cudaMalloc((void **)&b, size); tasks.end(), [=](task &v) { cudaMalloc((void **)&c, size); task(); }); vec_add<<<64, 64>>>(a, b, c);

27 © 2016 Codeplay Software Ltd. Queue vs Stream Execution Examples: Examples: • C++ AMP, CUDA, SYCL, C++17 • BOINC, BrookGPU ParallelSTL Implementation: Implementation: • A function is executed on a • Functions are placed in a queue continuous loop on a stream of and executed once per enqueuer data

Here we’re using CUDA as an example Here we’re using BrookGPU as an example

reduce void sum (float a<>, float *a, *b, *c; Here we’re using reduce OpenMP float r<>)as an { example cudaMalloc((void **)&a, size); r += a; cudaMalloc((void **)&b, size); } cudaMalloc((void **)&c, size); float a<100>; float r; vec_add<<<64, 64>>>(a, b, c); sum(a,r);

28 © 2016 Codeplay Software Ltd.

What are the four Horsemen of Heterogeneous Computing?

29 © 2018 Codeplay Software Ltd. Its all about the data: locality, movement, layout, affinity

30 © 2016 Codeplay Software Ltd. The Four Horsemen

Socket 0 Socket 1 Core 0 Core 1 Core 0 Core 1 0 1 2 3 0 51 2 3 0 1 2 3 0 1 2 3 0 4 1 2 6 3 7

31 © 2018 Codeplay Software Ltd. One of the biggest limiting factor in heterogeneous computing

➢ Cost of data movement in time and power consumption

32 © 2016 Codeplay Software Ltd. Cost of Data Movement •It can take considerable time to move data to a device • This varies greatly depending on the architecture •The bandwidth of a device can impose bottlenecks • This reduces the amount of throughput you have on the device •Performance gain from computation > cost of moving data • If the gain is less than the cost of moving the data it’s not worth doing •Many devices have a hierarchy of memory regions • Global, read-only, group, private • Each region has different size, affinity and access latency • Having the data as close to the computation as possible reduces the cost

33 © 2016 Codeplay Software Ltd. Cost of Data Movement

•64bit DP Op: • 20pJ

•4x64bit register read: • 50pJ

•4x64bit move 1mm: • 26pJ

•4x64bit move 40mm: • 1nJ

•4x64bit move DRAM:

Credit: Bill Dally, Nvidia, 2010 • 16nJ

34 © 2016 Codeplay Software Ltd. How do you move data from the host CPU to a device?

➢ Implicit vs explicit data movement

35 © 2016 Codeplay Software Ltd. Implicit vs Explicit Data Movement Examples: Examples: • SYCL, C++ AMP • OpenCL, CUDA, OpenMP, SYCL Implementation: Implementation: • Data is moved to the device implicitly via cross host CPU / • Data is moved to the device via device data structures explicit copy

Here we’re using C++ AMP as an example Here we’re using CUDA as an example

float *h_a = { … }, d_a; array_view ptr; cudaMalloc((voidHere we’re using **)&d_a, OpenMP size); as an example extent<2> e(64, 64); cudaMemcpy(d_a, h_a, size, parallel_for_each(e, [=](index<2> idx) cudaMemcpyHostToDevice); restrict(amp) { vec_add<<<64, 64>>>(a, b, c); ptr[idx] *= 2.0f; cudaMemcpy(d_a, h_a, size, }); cudaMemcpyDeviceToHost);

36 © 2016 Codeplay Software Ltd. Row-major vs column-major

int x = globalId[0]; int y = globalId[1]; int stride = 4; out[(x * stride) + y] = in[(y * stride) + x];

37 © 2018 Codeplay Software Ltd. AoS vs SoA 2x load operations struct str { float f[N]; f f f f f f f f i i i i i i i i int i[N]; }; str s;

… = s.f[globalId];

… = s.i[globalId];

38 © 2018 Codeplay Software Ltd. Socket 0 Socket 1

Core 0 Core 1 Core 0 Core 1

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 4 5 6 7

{ auto exec = execution::execution_context{execRes}.executor();

auto affExec = execution::require(exec, execution::bulk, execution::bulk_execution_affinity.compact);

affExec.bulk_execute([](std::size_t i, shared s) { func(i); }, 8, sharedFactory); }

39 © 2018 Codeplay Software Ltd. this_system::get_resources()

System-level Place where std:: executes GPU resources

Package

Numa 0 Numa 1 Work Groups

Core 0 Core 1 Core 2 Core 3 Processing Elements

relativeLatency = affinity_query(core2, numa0) > affinity_query(core3, numa0)

40 © 2018 Codeplay Software Ltd. Task vs data parallelism

Task Data parallelism parallelism

Task parallelism: ● Few large tasks with different operations / control flow ● Optimized for latency Data parallelism: ● Many small tasks with same operations on multiple data ● Optimized for throughput

Flynn’s Taxonomy

• Distinguishes multi-processor computer architectures along the two independent dimensions

• Instruction and Data

• Each dimension can have one state: Single or Multiple • SISD: Single Instruction, Single Data

• Serial (non-parallel) machine • SIMD: Single Instruction, Multiple Data

• Processor arrays and vector machines • MISD: Multiple Instruction, Single Data (weird) • MIMD: Multiple Instruction, Multiple Data

• Most common parallel computer systems Assuming power is the constraint What kind of processors are we building? • CPU • complex control hardware • Increasing flexibility + performance • Expensive in power • GPU • Simpler control structure • More HW per computation • Potentially more efficient in ops/watt • More restrictive programming model Multicore CPU vs Manycore GPU

• Each core optimized for a • Cores optimized for aggregate throughput, deemphasizing single thread individual performance • Fast serial processing • Scalable parallel processing • Must be good at • Assumes workload is highly parallel • Maximize throughput of all threads everything – Lots of big ALUs • Minimize latency of 1 – Multithreading can hide thread latency, no big caches – Simpler control, cost amortized – Lots of big on chip caches over ALUs via SIMD – Sophisticated controls The SPMD programming model

Many heterogeneous languages and models like SYCL use an SPMD programming model

This model can be applied both to: ● SIMD CPUs ● GPUs ● Many other heterogeneous devices What is SPMD good at?

SPMD execution is very efficient at launching a large number of work-items ● Unlike a where launching threads is expensive, SPMD launches thousands of work-items

SPMD is optimised for throughput ● You’re not getting the full benefit of a GPU or SIMD CPU unless you are using as many work-items as possible What is SPMD bad at?

SPMD execution is bad at divergent control flow ● Due to lock-step execution, divergent control flow can be very inefficient

GPUs also have some restrictions in what you can do within a kernel ● You cannot use dynamic allocation (i.e. non-placement new) ● You cannot use recursion ● You cannot use function pointers or virtual functions How do you write an SPMD program?

Single instruction single data (SISD) Single program multiple data (SPMD)

void calc(int *in, int *out) { void calc(int *in, int *out, int id) for (int i = 0; i < 1024; i++) { { out[i] = in[i] * in[i]; out[id] = in[id] * in[id]; } } }

calc(in, out); /* specify */ parallel_for(calc, in, out, 1024); Agenda

Challenges of an Accelerator Programming Model

SYCL 2020

SYCL in HPC, Embedded, Safety, and Autonomous Vehicles

50 © 2018 Codeplay Software Ltd. SYCL 2020 Major Features • Unified (USM) • Code with pointers can work naturally without buffers or accessors • Simplifies porting from most code (e.g. CUDA, C++) • Parallel Reductions • Added built-in reduction operation to avoid boilerplate code and achieve maximum performance on hardware with built-in reduction operation acceleration. • Work group and subgroup algorithms • Efficient parallel operations between work items • Class template argument deduction (CTAD) and template deduction guides • Simplified class template instantiation • Simplified use of Accessors with a built-in reduction operation • Reduces boilerplate code and streamlines the use of C++ software design patterns • Expanded interoperability • Efficient acceleration by diverse backend acceleration APIs • SYCL atomic operations are now more closely aligned to standard C++ atomics • Enhances parallel programming freedom

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 51 Parallel Industry Initiatives

C++11 C++14 C++17 C++20 C++23

SYCL 1.2 SYCL 1.2.1 SYCL 2020 SYCL 202X C++11 Single source C++11 Single source C++17 Single source C++20 Single source programming programming programming programming Many backend options Many backend options

OpenCL 1.2 OpenCL 2.1 OpenCL 2.2 OpenCL 3.0 OpenCL C Kernel SPIR-V in Core Language

2011 2015 2017 2020 202X

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 52 SYCLCon 2021 Talks and Events ● SYCL, DPC++, SPUs, oneAPI - a View from Intel by James Reinders ● oneAPI Developer Summit Monday Apr 26, Biagio Cosenza, Peter Zuzek, Steffen Larsen ● Hands on SYCL Tutorial Tuesday Apr 27 by Rod Burns and SYCL team SYCL Evolution ● Sylkan: Towards a Vulkan Compute Target Platform for SYCL by Peter Thorman ● Performance-Portable Distributed K-Nearest neighbours using Locality-Sensitive Hashing and SYCL by Marcel Breyer ● Toward Performance Portability of Highly Parametrizable TRSM Algorithm Using SYCL by Thales SYCL 2020 compared with SYCL 1.2.1 Sabino ● On Measuring the Maturity of SYCL implementations by Tracking Historical Performance Easier to integrate with C++17 (CTAD, Deduction Guides...) improvements by Wei-Chen Lin Less verbose, smaller code size, simplify patterns ● Experiences Supporting DPC++ in AMRex by Sravani Konda ● Developing Medical Imaging Applications Across GPU, FPGA, and CPU using oneAPi Backend independent ● hipSYCL in 2021: Peculiarities, Unique Features and SYCL 2020 by Aksel Alpay Multiple object archives aka modules simplify interoperability ● Experiences with Adding SYCL Support to GROMACS by Andrewy Alekseenko Ease porting C++ applications to SYCL ● Extending DPC++ with SUpport for Huawei Ascend AI Chipset ● Toward a Better SYCL Memory by Ben Ashb Enable capabilities to improve programmability ● Bringing SYCL to A100 Ampere Architecture on Perlmutter Steffen Larsen and LBNL Backwards compatible but minor API break based on user feedback ● SYCL and OpenCL Meet Challenges of Functional Safety by illya Rudkin ● Enabling OpenCL and SYCL for RISC-V processors by Colin Davidson, Aidan Dodds ● SYCL Panel Thursday Apr 29

SYCL Future Roadmap (MAY CHANGE) Integration of successful Extensions plus new Core Improving Software Ecosystem NEXT functionality SYCL 2020 Books, Tutorials, Tool, libraries, GitHub Over 40 Selected Expanding Implementation Conformance Tests Converge SYCL with ISO Features for SYCL 2020 DPC++ C++ and continue to Unified Shared Memory) ComputeCpp support OpenCL to deploy Parallel Reductions adds a built in reduction Working on operation triSYCL on more devices Work-group and sub-group algorithms Implementations Improvements to atomic operations hipSYCL CPU Class template argument deduction (CTAD) and neoSYCL GPU deduction guides FPGA Simplification of accessors Regular Maintenance Updates Future SYCL NEXT Expanded interoperability with different Spec clarifications, formatting and bug fixes AI processors backends Proposals Custom Processors Extension mechanism https://www.khronos.org/registry/SYCL/ Address spaces Vector rework ... Specialization Constants Repeat The Cycle every 1.5-3 years ... This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 53 SYCL Implementations in Development

SYCL, OpenCL and SPIR-V, as open industry SYCL enables Khronos to influence standards, enable flexible integration and ISO C++ to (eventually) support deployment of multiple acceleration technologies Source Code heterogeneous compute

DPC++ ComputeCpp triSYCL neoSYCL hipSYCL Uses LLVM/Clang Multiple Open source SX-AURORA Part of oneAPI Backends test bed CUDA and HIP/ROCm TSUBASA

Experimental

Any CPU Any CPU NVIDIA Any CPU Any CPU NVIDIA GPUs GPUs NVIDIA GPUs TBB Experimental Any CPU VEO Intel CPUs Level Zero NEC VEs Intel CPUs Level Zero Intel GPUs Intel GPUs Intel FPGAs XILINX FPGAs AMD GPUs Intel GPUs AMD GPUs POCL Intel CPUs (depends on driver (open-source OpenCL stack) supporting CPUs and NVIDIA Multiple Backends in Development Intel GPUs Arm Mali GPUs and more) There is development on supporting SYCL on Intel FPGAs IMG PowerVR even more low-level frameworks. Renesas R-Car For more information: http://sycl.tech

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 54 SYCL user and developer Growth

10X growth over 6 years

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 55 SYCL Ecosystem, Research and Benchmarks

Machine Learning Implementations Research Linear Algebra Benchmarks/Books Libraries and Parallel Libraries Direct Acceleration Frameworks neoSYCL SX-AURORA TSUBASA Programming BLAS FFT Math RAND Benchmark oneAPI SYCLBLAS oneMKL oneMKL oneMKL oneMKL

SOLVER SPARSE TENSOR STL

oneMKL oneMKL SYCL-DNN SYCL Eigen Parallel STL oneDNN oneDPL SYCL-Bench

Working Group Members This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 56 Agenda

Challenges of an Accelerator Programming Model

SYCL 2020

SYCL in HPC, Embedded, Safety, and Autonomous Vehicles

57 © 2018 Codeplay Software Ltd. SYCL in Embedded Systems, Automotive, and AI

Open industry standards, enable Networks trained on high-end Neural Network Training Data flexible integration and deployment of desktop and cloud systems Training multiple acceleration technologies

Trained Compilation Networks Ingestion Applications link to compiled Compiled Vision / Inferencing C++ Application inferencing code or call Code Engine Code vision/inferencing API

Hardware Acceleration APIs Sensor Data

Diverse Embedded Hardware DSP Multi-core CPUs, GPUs Dedicated Hardware DSPs, FPGAs, Tensor Cores FPGA * Vulkan only runs on GPUs GPU

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 58 Safety Critical API Evolution

New Generation Safety Critical APIs for Graphics, Compute and Display

OpenCL and SYCL SC work will minimize API surface area , reduce ambiguity, UB, increase determinism Rendering Compute Display

Industry Need for GPU Acceleration APIs designed to ease system safety certification is increasing ISO 26262 / ASIL-D Embedded/Automotive/AI/Safety

“Xilinx is excited about the progress achieved with SYCL 2020,” said Ralph Wittig, fellow, Xilinx.

“For Renesas, SYCL is a key enabler for “Imagination recognises the benefit of SYCL automotive ADAS/AD software developers across multiple markets. Our software ….,” said Cyril Cordoba, Director of ADAS stacks have been designed to improve Segment Marketing Department, SYCL performance, enabling a Renesas. straightforward path to exploit the teraflops of compute performance in our latest IP,” said Mark Butler, Vice President of Software Engineering, . “NSITEXE supports the SYCL 2020 technology, which is gaining attention in embedded applications,” said Hideki SYCL support from Sugimoto, CTO, NSITEXE, Inc. “ embedded systems, through desktops to supercomputers This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 60 SYCL in HPC/Supercomputers

Simulation Data Learning Three Pillars of HPC Languages Productivity Languages Productivity Languages Science Problem Solver Libraries, Parallel RT Big Data Stack, Stats Lib, Databases Deep Learning, Linear Alg, ML

Today’s Supercomputing Development Workflow needs knowledge of Need Languages that allow system architecture and control of these Data Issues tools that control data OpenMP for C CUDA/pthreads/ C++ Application uses Set Data affinity, Data Layout, Data

and Fortran OpenACC/OpenCL SYCL, Kokkos, Raja movement, Data Locality, highly Parameterized Code and dynamically compose Choose the algorithms (C++ templates, parallel STL, inlining and fusion, abstractions) Algorithm for target

Libraries augment compiler Math, ML, Data Libraries; C++ Std, C, Python Libraries optimizations for Performance Portable programs Implement 2021 2020 2021 2021 and Test Use open standards to run Algorithm Performance Portable code on new generation, or different vendor’s, hardware with compiler optimization, explicit parametrization and Optimize dynamically composed algorithm Algorithm

Based on IWOCL/SYCLCon 2020 keynote Hal FInkel: https://www.iwocl.org/wp-content/uploads/iwocl-syclcon-2020-finkel-keynote-slides.pdf This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 61 Exascale computing

“Our users will benefit from features in the SYCL 2020 “At Cineca, based on our experience, we confirm the value that specification. New features, such as support for unified SYCL is bringing to the development of high-performance memory (USM) and reductions, are important capabilities computing in a hybrid environment. ...” said Sanzio Bassini, for programming high-performance-computing hardware. director of supercomputing, Application Innovation Dept, ...” said Nevin Liber, computer scientist, Argonne Cineca. National Laboratory’s Leadership Computing Facility

SYCL support from embedded systems, through desktops to supercomputers This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 62 HPC Computing

“... we see modern C++ language-based approaches to accelerator programming, such as SYCL, as an important component of our programming environment offering for users of Perlmutter,” said Brandon Cook, application performance specialist at NERSC. . “...As co-developers of the Celerity project, together with the University of Salerno, we are welcoming these changes and look “The SYCL 2020 final specification brings forward to applying them within significant features to the industry that distributed-memory research and industry enable C++ developers to more productively applications, for example as part of the build high-performance heterogeneous recently launched EuroHPC LIGATE applications with unified programming project.” said Thomas Fahringer, head of SYCL support from across XPU architectures,” said Jeff the Distributed and Parallel Systems embedded systems, McVeigh, Intel vice president, Datacenter Group at the University of Innsbruck through desktops to XPU Products and Solutions. supercomputers

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 63 What now?

Deep Dive into HPC future

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2021 - Page 64 When I was OpenMP CEO, I learned

• HPC, exascale computing requires programming models that endures for future workloads, last 20 years • But Hardware change frequently, constant improvement • Mandate Sharing diverse hardware across a consortium • Programming models, have to be stable but also support latest HW, open, covers multiple architecture with multiple implementations

OpenMP is great for C ISO base languages OpenMP for C Open Acceleration and Fortran and Fortran Languages SYCL is great for modern C++, AI, Automotive Here are some opportunities for DSP Dedicated HPC growth across Europe, Asia © 2020 Codeplay Software Ltd. 65 Hardware FPGA GPU What about Europe? EPI, ARM and RISC-V RVV

66 © 2020 Codeplay Software Ltd. SYCL as a universal programming model for HPC Starting with US National Labs Across Europe, Asia are many Petascale and pre-exascale systems • With many variety of CPUs GPUs FPGAs, custom devices • Often with interconnected usage agreements • Let’s look at Europe first: 3 Pre-exasacale

67 © 2020 Codeplay Software Ltd. HPCAsia 2021: neoSYCL thanks to Hiroyuki Takizawa

Open standard for offload programming = SYCL Programming with SYCL Leads to lower Code BFS using Rodina Benchmark at HPC Asia 2021 Complexity No loss in performance between using SYCL and VEO

68 © 2020 Codeplay Software Ltd. Final words

• SYCL can be a part of a standard programming model for all HPC including Europe/Asia/NA • HPC is now used in Embedded and Automotive • SYCL is home grown EU, UK company led its development since 2012, now open standard with multiple company contributions, lots of European/Asia projects • Celerity from the University of Innsbruck and Salerno, CINECA Bologna, neoSYCL • Moves with ISO C++, updates every 1.5-3 years • Part of oneAPI • Adapts to HPC hardware changes, moving towards safety critical • Adapted by ECP for first Exascale computer in Aurora, now also in the Perlmutter, and we hope in European and Asia HPC

69 © 2020 Codeplay Software Ltd. Enabling Industry Engagement • SYCL working group values industry feedback - https://community.khronos.org/c/sycl - https://sycl.tech Open to all! • SYCL FAQ https://community.khronos.org/www.khr.io/slack https://app.slack.com/client/TDMDFS87M/CE9UX4CHG - https://www.khronos.org/blog/sycl-2020-what-do-you-need-to-know https://community.khronos.org/c/sycl/ https://stackoverflow.com/questions/tagged/sycl https://www.reddit.com/r/sycl • What features would you like in future SYCL versions? https://github.com/codeplaysoftware/syclacademy https://sycl.tech/

• Advisory Panel Chaired by Tom Deakin of U of Public contributions to Specification, Bristol Khronos SYCL Forums, Slack Channels, Conformance Tests and software Stackoverflow, reddit, and SYCL.tech https://github.com/KhronosGroup/SYCL-CTS • SYCL Advisory Panel https://github.com/KhronosGroup/SYCL-Docs meeting here at https://github.com/KhronosGroup/SYCL-Shared https://github.com/KhronosGroup/SYCL-Registry Khronos GitHub IWOCL/SYCLCon https://github.com/KhronosGroup/SyclParallelSTL Contribute to SYCL open source specs, CTS, tools and ecosystem • Regular meetings to give feedback on Invited Experts SYCL Advisory roadmap and draft https://www.khronos.org/advisors/ SYCL Panels specifications Working Group Khronos members https://www.khronos.org/members/ https://www.khronos.org/registry/SYCL/

This work is licensed under a Creative Commons Attribution 4.0 International License © The Khronos® Group Inc. 2020 - Page 70

We’re Hiring!

codeplay.com/careers/

Thanks

[email protected] @codeplaysoft codeplay.com m