The landscape of Parallel Programming Models Part 1: Performance, Portability & Productivity

Michael Wong Codeplay Software Ltd. Distinguished Engineer

IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong

● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected][email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle

3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++//OpenMP, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. These in clude Bjarne Stroustrup, Joe Hummel, Botond Ballo, Simon Mcintosh-Smith, as well as many others.

But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them.

4 © 2020 Codeplay Software Ltd. Legal Disclaimer

THIS WORK REPRESENTS THE VIEW OF THE OTHER COMPANY, PRODUCT, AND SERVICE AUTHOR AND DOES NOT NECESSARILY NAMES MAY BE TRADEMARKS OR SERVICE REPRESENT THE VIEW OF CODEPLAY. MARKS OF OTHERS.

5 © 2020 Codeplay Software Ltd. Disclaimers

NVIDIA, the NVIDIA logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries

Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code

6 © 2020 Codeplay Software Ltd. 3 Act Play

1. Performance, Portability, Productivity 2. The four horsemen of heterogeneous programming 3. C++, OpenCL, OpenMP, SYCL

7 © 2020 Codeplay Software Ltd. Act 1

Performance, Portability, Productivity

8 © 2020 Codeplay Software Ltd. Isn’t Parallel/Concurrent/Heterogeneous Programming Hard?

Houston, we have a software crisis!

9 © 2020 Codeplay Software Ltd. So What are the Goals? Goals of Parallel Programming over and above sequential programming 1. Performance

2. Productivity Portability 3. Portability Performance

The Iron Triangle of Parallel Programming language nirvana Productivity

10 © 2020 Codeplay Software Ltd. Why does productivity make the list?

11 © 2020 Codeplay Software Ltd. Performance • Broadly includes scalability and efficiency • If not for performance why not just write sequential program? • parallel programming is primarily a performance optimization, and, as such, it is one potential optimization of many.

© 2020 Codeplay Software Ltd. Diagram thanks to Paul McKenney12 Are there no cases where parallel programming is about something other than performance?

13 © 2020 Codeplay Software Ltd. Productivity Perhaps at one time, the sole purpose of parallel software was performance. Now, however, productivity is gaining the spotlight.

© 2020 Codeplay Software Ltd. 14 Diagram thanks to Paul McKenney Given how cheap parallel systems have become, how can anyone afford to pay people to program them?

15 © 2020 Codeplay Software Ltd. Iron Triangle of Parallel Programming Language Nirvana

Portability

Performance

Productivity

16 © 2020 Codeplay Software Ltd. Performance Portability Productivity

OpenCL

OpenMP Portability CUDA SYCL

Iron Triangle of Parallel Programming Nirvana is about making engineering tradeoffs

17 © 2020 Codeplay Software Ltd. Concurrency vs Parallelism

What makes parallel or concurrent programming harder than serial programming? What’s the difference? How much of this is simply a new mindset one has to adopt?

18 © 2020 Codeplay Software Ltd. Which one to choose?

19 © 2020 Codeplay Software Ltd. 20 © 2020 Codeplay Software Ltd. Heterogeneous Devices

Accelerator CPU GPU

DSP APU FPGA

21 © 2020 Codeplay Software Ltd. Hot Chips

22 © 2020 Codeplay Software Ltd. Fundamental Parallel Architecture Types • Uniprocessor • Shared Memory • Scalar Multiprocessor (SMP) • Shared memory address processor space • Bus-based memory system memory • Single Instruction processor … processor Multiple Data (SIMD) bus

memory processor … • Interconnection network memory

processor … processor

• Single Instruction network Multiple Thread (SIMT) … memory

2323 © 2020 Codeplay Software Ltd. SIMD vs SIMT

24 © 2020 Codeplay Software Ltd. Distributed and network Parallel Architecture Types • Distributed Memory • Cluster of SMPs Multiprocessor • Shared memory • Message passing addressing within SMP between nodes node • Message passing between memory memory SMP nodes … M M

processor processor …

… P … P P P interconnection network network interface interconnection network processor processor … P … P P … P memory memory … • Massively Parallel Processor M M (MPP) • Can also be regarded as Introduction to Parallel • Many, many processors MPP if processor number Computing, University of is large Oregon, IPCC 2525 © 2020 Codeplay Software Ltd. Modern Parallel Architecture  Multicore Manycore • Heterogeneous: CPU+Manycore CPU M M

 Manycore vs Multicore CPU …

… P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P  Heterogeneous: CPU + GPU memory … M M processor Heterogeneous: Multicore SMP+GPU Cluster PCI • M M …

memory

… P … P P P  Heterogeneous: “Fused” CPU + GPU

processor interconnection network

… … memory P P P P

… 2626 M M © 2020 Codeplay Software Ltd. Modern Parallel Programming model • Heterogeneous: CPU+Manycore CPU: OpenCL, OpenMP, SYCL,  Multicore Manycore C++11/14/17/20, TBB, Cilk, pthread M M  Manycore vs Multicore CPU: OpenCL, OpenMP, SYCL, …

C++11/14/17/20, TBB, Cilk, pthread

… P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P  Heterogeneous: CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, memory OpenACC, CUDA, hip, RocM, C++ AMP, Intrinsics, OpenGL, Vulkan, … M CUDA, DirectX M processor PCI • Heterogeneous: Multicore SMP+GPU Cluster: OpenCL, OpenMP, SYCL, C++17/20 M M … memory

 “ ” Heterogeneous: Fused CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, hip, … P … P P P RocM, Intrinsics, OpenGL, Vulkan, DirectX

processor interconnection network

… … memory P P P P

… 2727 M M © 2020 Codeplay Software Ltd. Which Programming model works on all the Architectures?Is there a pattern?

• Heterogeneous: CPU+Manycore CPU: OpenCL, OpenMP, SYCL,  Multicore Manycore C++11/14/17/20, TBB, Cilk, pthread M M  Manycore vs Multicore CPU: OpenCL, OpenMP, SYCL, …

C++11/14/17/20, TBB, Cilk, pthread

… P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P  Heterogeneous: CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, memory OpenACC, CUDA, hip, RocM, C++ AMP, Intrinsics, OpenGL, Vulkan, … M CUDA, DirectX M processor PCI • Heterogeneous: Multicore SMP+GPU Cluster: OpenCL, OpenMP, SYCL, C++17/20 M M … memory

 “ ” Heterogeneous: Fused CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, hip, … P … P P P RocM, Intrinsics, OpenGL, Vulkan, DirectX

processor interconnection network

… … memory P P P P

… 2828 M M © 2020 Codeplay Software Ltd. To support all the different parallel architectures

• With a single source • You really only have a code base few choices • And if you also want it to be an International Open Specification • And if you want it to be growing with the architectures

29 © 2020 Codeplay Software Ltd. Act 2

The four horsemen of heterogeneous programming

30 © 2020 Codeplay Software Ltd. Simon Mcintosh-Smith annual language citations

31 © 2020 Codeplay Software Ltd. The Reality

Performance

Productivity Portability

32 © 2020 Codeplay Software Ltd. Long Answer Yes

➢ The right programming model can allow you to express a problem in a way which adapts to different architectures

33 © 2020 Codeplay Software Ltd. Use the right abstraction now Abstraction How is it supported Cores C++11/14/17 threads, async HW threads C++11/14/17 threads, async Vectors Parallelism TS2 Atomic, Fences, lockfree, futures, counters, C++11/14/17 atomics, Concurrency TS1, transactions Transactional Memory TS1 Parallel Loops Async, TBB:parallel_invoke, C++17 parallel algorithms, for_each Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos, Raja, CUDA Distributed HPX, MPI, UPC++ Caches C++17 false sharing support Numa OpenMP/ACC? TLS ?

Exception handling in concurrent environment 34? © 2020 Codeplay Software Ltd. The Four Horsemen

Socket 0 Socket 1 Core 0 Core 1 Core 0 Core 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 4 1 5 2 6 3 7

35 © 2020 Codeplay Software Ltd. Cost of Data Movement • 64bit DP Op: • 20pJ • 4x64bit register read: • 50pJ • 4x64bit move 1mm: • 26pJ • 4x64bit move 40mm: • 1nJ • 4x64bit move DRAM: Credit: Bill Dally, Nvidia, • 16nJ 2010

36 © 2020 Codeplay Software Ltd. Implicit vs Explicit Data Movement Examples: Examples: • SYCL, C++ AMP • OpenCL, CUDA, OpenMP Implementation: Implementation: • Data is moved to the device implicitly via cross host CPU / • Data is moved to the device via device data structures explicit copy APIs

Here we’re using C++ AMP as an Here we’re using CUDA as an example example array_view ptr; float *h_aHere = we’re{ … }, using d_a; OpenMP as an cudaMalloc((void **)&d_a, size); extent<2> e(64, 64); example parallel_for_each(e, [=](index<2> idx) cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); restrict(amp) { vec_add<<<64, 64>>>(a, b, c); ptr[idx] *= 2.0f; cudaMemcpy(d_a, h_a, size, }); cudaMemcpyDeviceToHost);

37 © 2020 Codeplay Software Ltd. Row-major vs column-major

int x = globalId[0]; int y = globalId[1]; int stride = 4; out[(x * stride) + y] = in[(y * stride) + x];

38 © 2020 Codeplay Software Ltd. AoS vs SoA 2x load operations struct str { float f[N]; f f f f f f f f i i i i i i i i int i[N]; }; str s;

… = s.f[globalId];

… = s.i[globalId];

39 © 2020 Codeplay Software Ltd. Socket 0 Socket 1

Core 0 Core 1 Core 0 Core 1

0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

0 1 2 3 4 5 6 7

{ auto exec = execution::execution_context{execRes}.executor();

auto affExec = execution::require(exec, execution::bulk, execution::bulk_execution_affinity.compact);

affExec.bulk_execute([](std::size_t i, shared s) { func(i); }, 8, sharedFactory); }

40 © 2020 Codeplay Software Ltd. this_system::get_resources()

System-level Place where std::thread executes GPU resources

Package

Numa 0 Numa 1 Work Groups

Core 0 Core 1 Core 2 Core 3 Processing Elements

relativeLatency = affinity_query(core2, numa0) > affinity_query(core3, numa0)

41 © 2020 Codeplay Software Ltd. Act 3

C++, OpenCL, OpenMP, SYCL

42 © 2020 Codeplay Software Ltd. 43 © 2020 Codeplay Software Ltd. Example: SAXPY single-precision a times x plus y vector addition y = αx + y

44 © 2020 Codeplay Software Ltd. 45Introduction to Parallel Computing, University of Oregon, IPCC Serial SAXPY Implementation

45 © 2020 Codeplay Software Ltd. Kokkos OpenCL HPX OpenMP Raja SYCL Boost.Compute CUDA The Quiet Revolution

46 © 2020 Codeplay Software Ltd. Parallel/concurrency before C++11 (C++98) Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous (GPUs, accelerators, FPGA, embedded AI processors) summary tasks that run operations on groups of avoid races and Dispatch/offload to other independently and things, exploit parallelism synchronizing objects in nodes (including communicate via in data and algorithm shared memory distributed) messages structures examples GUI,background printing, trees, quicksorts, locked data(99%), lock- Pipelines, reactive disk/net access compilation free libraries (wizards), programming, offload,, atomics (experts) target, dispatch key metrics responsiveness throughput, many core race free, lock free Independent forward scalability progress,, load-shared requirement isolation, messages low overhead composability Distributed, heterogeneous today's abstractions POSIX threads, win32 openmp, TBB, PPL, locks, lock hierarchies, OpenCL, CUDA threads, OpenCL, vendor OpenCL, vendor intrinsic vendor atomic instructions, intrinsic vendor intrinsic

47 © 2020 Codeplay Software Ltd. Parallel/concurrency after C++11 Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous (GPUs, accelerators, FPGA, embedded AI processors) summary tasks that run independently and operations on groups of avoid races and Dispatch/offload to other communicate via messages things, exploit parallelism synchronizing objects in nodes (including distributed) in data and algorithm shared memory structures examples GUI,background printing, trees, quicksorts, locked data(99%), lock-free Pipelines, reactive disk/net access compilation libraries (wizards), atomics programming, offload,, (experts) target, dispatch key metrics responsiveness throughput, many core race free, lock free Independent forward scalability progress,, load-shared requirement isolation, messages low overhead composability Distributed, heterogeneous

today's abstractions C++11: thread,lambda function, C++11: Async, packaged C++11: locks, memory C++11: lambda TLS, Async tasks, promises, futures, model, mutex, condition atomics variable, atomics, static init/term

48 © 2020 Codeplay Software Ltd. Parallel/concurrency after C++14 Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous summary tasks that run independently operations on groups of avoid races and Dispatch/offload to other and communicate via things, exploit parallelism in synchronizing objects in nodes (including distributed) messages data and algorithm shared memory structures examples GUI,background printing, trees, quicksorts, compilation locked data(99%), lock-free Pipelines, reactive disk/net access libraries (wizards), atomics programming, offload,, (experts) target, dispatch key metrics responsiveness throughput, many core race free, lock free Independent forward scalability progress,, load-shared requirement isolation, messages low overhead composability Distributed, heterogeneous today's abstractions C++11: thread,lambda C++11: Async, packaged C++11: locks, memory C++11: lambda function, TLS, async tasks, promises, futures, model, mutex, condition atomics, variable, atomics, static C++14: none C++14: generic lambda init/term,

C++ 14: shared_lock/shared_timed_ mutex, OOTA, 49 atomic_signal_fence, © 2020 Codeplay Software Ltd. Parallel/concurrency after C++17 Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous (GPUs, accelerators, FPGA, embedded AI processors)

summary tasks that run independently and operations on groups avoid races and Dispatch/offload to other nodes (including communicate via messages of things, exploit synchronizing objects distributed) parallelism in data and in shared memory algorithm structures today's abstractions C++11: thread,lambda function, TLS, C++11: Async, C++11: locks, memory C++11: lambda async packaged tasks, model, mutex, promises, futures, condition variable, C++14: generic lambda C++14: generic lambda atomics, atomics, static init/term, C++17: progress guarantees, TOE, C++ 17: ParallelSTL, C++ 14: execution policies control false sharing shared_lock/shared_ti med_mutex, OOTA, atomic_signal_fence, C++ 17: scoped _lock, shared_mutex, ordering of memory models, progress guarantees, TOE, execution policies

50 © 2020 Codeplay Software Ltd. x Example:y saxpy z • Saxpy == Scalar Alpha X Plus Y – Scalar multiplication and vector addition for (int i=0; i

int start = …; Parallel int end = …; for (int t=0; t void { for (int i = start; i < end; i++) z[i] = a * x[i] + y[i]; } );

start += …; end += …; } 51 © 2020 Codeplay Software Ltd. 5 1 Demo #3: a complete example

•Matrix multiply…

Going Parallel with C++11 53 © 2020 Codeplay Software Ltd. Sequential version…

// // Naïve, triply-nested sequential solution: // for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { C[i][j] = 0.0;

for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } }

54 © 2020 Codeplay Software Ltd. 5 4 Structured ("fork-join") parallelism • A common pattern when creating multiple threads Sequentia #include l std::vector threads; fork int cores = std::thread::hardware_concurrency(); for (int i=0; i

join for (std::thread& t : threads) // new range-based for: Sequentia t.join(); l

55 © 2020 Codeplay Software Ltd. 5 5 Parallel solution

// 1 thread per core: numthreads = thread::hardware_concurrency(); int rows = N / numthreads; int extra = N % numthreads; int start = 0; // each thread does [start..end) int end = rows; vector workers; for (int t = 1; t <= numthreads; t++) { if (t == numthreads) // last thread does extra rows: end += extra;

workers.push_back( thread([start, end, N, &C, &A, &B]() { for (int i = start; i < end; i++) for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } }));

start = end; end = start + rows; } for (thread& t : workers) t.join(); 56 © 2020 Codeplay Software Ltd. 5 6 C++ Directions Group: P2000

57 © 2020 Codeplay Software Ltd. P2000:Modern hardware

We need better support for modern hardware, such as executors/execution context, affinity support in C++ leading to heterogeneous/distributed computing support, ...

58 © 2020 Codeplay Software Ltd. Parallel/concurrency aiming for C++20 Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous/Distributed today's C++11: thread,lambda C++11: Async, packaged tasks, C++11: locks, memory model, mutex, C++11: lambda abstractions function, TLS, async promises, futures, atomics, condition variable, atomics, static init/term, C++14: generic lambda C++ 20: Jthreads C++ 17: ParallelSTL, control false +interrupt _token, sharing C++ 14: C++17: , progress coroutines shared_lock/shared_timed_mutex, guarantees, TOE, execution C++ 20: Vec execution policy, OOTA, atomic_signal_fence, policies Algorithm un-sequenced policy, C++ 17: scoped _lock, shared_mutex, span ordering of memory models, progress C++20: atomic_ref,, span guarantees, TOE, execution policies C++20: atomic_ref, Latches and barriers, atomic Atomics & padding bits Simplified atomic init Atomic C/C++ compatibility Semaphores and waiting Fixed gaps in memory model , Improved atomic flags, Repair memory model

59 © 2020 Codeplay Software Ltd. Parallel/Concurrency beyond C++20: C++23

Asynchronus Agents Concurrent collections Mutable shared state Heterogeneous/DIstributed today's C++11: thread,lambda C++11: Async, packaged tasks, C++11: … C++17: , progress abstractions function, TLS, async promises, futures, atomics, C++ 14: … guarantees, TOE, execution C++ 17: … policies C++14: generic lambda C++ 17: ParallelSTL, control false sharing C++20: atomic_ref, Latches and C++20: atomic_ref, mdspan, C++ 20: Jthreads +interrupt barriers _token C++ 20: Vec execution policy, atomic C++23: SIMD,affinity, Algorithm un-sequenced policy Atomics & padding bits pipelines, EALS, C++23: networking, span Simplified atomic init freestanding/embedded asynchronous algorithm, Atomic C/C++ compatibility support well specified, reactive programming, Semaphores and waiting mapreduce, ML/AI, reactive EALS, async2, executors C++23: SMD,new futures, Fixed gaps in memory model , programming executors, concurrent vector,task blocks, Improved atomic flags , Repair mdspan unordered associative containers, two- memory model way executors with lazy sender- receiver models, concurrent exception C++23: hazard_pointers, handling, executors, mdspan rcu/snapshot, concurrent queues, counters, upgrade lock, TM lite, more lock-free data structures, asymmetric fences

60 © 2020 Codeplay Software Ltd. C++23: continue C++20 • Library support for coroutines • Further Conceptifying Standard Library • Further Range improvements (e.g., application of ranges to parallel algorithms and operations on containers and integration with coroutines) • A modular standard library

61 © 2020 Codeplay Software Ltd. After C++20

• Much more libraries • Machine learning support • Audio • Executors • Linear Algebra Networking • Graph data structures • • Tree Data structures • Pattern Matching • Task Graphs • Better support for C++Tooling • Differentiation ecosystem • Reflection • Light-weight transactional • Further support for locks heterogeneous programming • A new future and/or a new • Graphics async Better definition of freestanding • Statistics Library • • Array style programming • Education dependency through mdspan curriculum

62 © 2020 Codeplay Software Ltd. After C++23 • Reflection • Pattern matching • C++ ecosystem • What about Contracts?

63 © 2020 Codeplay Software Ltd. What have we achieved so far for C++20?

64 © 2020 Codeplay Software Ltd. SYCL Ecosystem ● ComputeCpp - https://codeplay.com/products/computesuite/computecpp ● triSYCL - https://github.com/triSYCL/triSYCL ● SYCL - http://sycl.tech ● SYCL ParallelSTL - https://github.com/KhronosGroup/SyclParallelSTL ● VisionCpp - https://github.com/codeplaysoftware/visioncpp ● SYCL-BLAS - https://github.com/codeplaysoftware/sycl-blas ● TensorFlow-SYCL - https://github.com/codeplaysoftware/tensorflow ● Eigen http://eigen.tuxfamily.org

65 © 2020 Codeplay Software Ltd. Eigen Linear Algebra Library SYCL backend in mainline Focused on Tensor support, providing support for machine learning/CNNs Equivalent coverage to CUDA Working on optimization for various hardware architectures (CPU, desktop and mobile GPUs) https://bitbucket.org/eigen/eigen/

66 © 2020 Codeplay Software Ltd. TensorFlow SYCL backend support for all major CNN operations Complete coverage for major image recognition networks GoogLeNet, Inception-v2, Inception-v3, ResNet, …. Ongoing work to reach 100% operator coverage and optimization for various hardware architectures (CPU, desktop and mobile GPUs) https://github.com/tensorflow/tensorflow

TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.

67 © 2020 Codeplay Software Ltd. SYCL Ecosystem • Single-source heterogeneous programming using STANDARD C++ - Use C++ templates and lambda functions for host & device code - Layered over OpenCL • Fast and powerful path for bring C++ apps and libraries to OpenCL - C++ Kernel Fusion - better performance on complex software than hand-coding - Halide, Eigen, Boost.Compute, SYCLBLAS, SYCL Eigen, SYCL TensorFlow, SYCL GTX - Clang, triSYCL, ComputeCpp, VisionCpp, ComputeCpp SDK … • More information at http://sycl.tech

Developer Choice The development of the two specifications are aligned so code can be easily shared between the two approaches

C++ Kernel Language Single-source C++ Low Level Control Programmer Familiarity ‘GPGPU’-style separation of Approach also taken by device-side kernel source C++ AMP and OpenMP code and host code

68 © 2020 Codeplay Software Ltd. Codeplay Standards Open Presentati bodies Research source ons Company

• HSA Foundation: Chair of • Members of EU research • HSA LLDB Debugger • Building an LLVM back-end • Based in Edinburgh, Scotland software group, spec editor of consortiums: PEPPHER, • SPIR-V tools • Creating an SPMD Vectorizer for • 57 staff, mostly engineering runtime and debugging LPGPU, LPGPU2, CARP • RenderScript debugger in AOSP OpenCL with LLVM • License and customize • Khronos: chair & spec editor of • Sponsorship of PhDs and EngDs • LLDB for Qualcomm Hexagon • Challenges of Mixed-Width technologies for semiconductor SYCL. Contributors to OpenCL, for heterogeneous programming: • TensorFlow for OpenCL Vector Code Gen & Scheduling companies Safety Critical, Vulkan HSA, FPGAs, ray-tracing in LLVM • C++ 17 Parallel STL for SYCL • ComputeAorta and • ISO C++: Chair of Low Latency, • Collaborations with academics • C++ on Accelerators: Supporting ComputeCpp: implementations • VisionCpp: C++ performance- Embedded WG; Editor of SG1 • Members of HiPEAC Single-Source SYCL and HSA of OpenCL, Vulkan and SYCL Concurrency TS portable programming model for vision • LLDB Tutorial: Adding debugger • 15+ years of experience in • EEMBC: members support for your target heterogeneous systems tools

Codeplay build the software platforms that deliver massive performance

69 © 2020 Codeplay Software Ltd. What our ComputeCpp users say about us

Benoit Steiner – Google WIGNER Research Centre ONERA Hartmut Kaiser - HPX TensorFlow engineer for Physics

“We at Google have been working “We work with royalty-free SYCL “My team and I are working with It was a great pleasure this week for closely with Luke and his Codeplay because it is hardware vendor Codeplay's ComputeCpp for almost a us, that Codeplay released the colleagues on this project for almost agnostic, single-source C++ year now and they have resolved ComputeCpp project for the wider 12 months now. Codeplay's programming model without platform every issue in a timely manner, while audience. We've been waiting for this contribution to this effort has been specific keywords. This will allow us to demonstrating that this technology can moment and keeping our colleagues tremendous, so we felt that we should easily work with any heterogeneous work with the most complex C++ and students in constant rally and let them take the lead when it comes processor solutions using OpenCL to template code. I am happy to say that excitement. We'd like to build on this down to communicating updates develop our complex algorithms and the combination of Codeplay's SYCL opportunity to increase the awareness related to OpenCL. … we are ensure future compatibility” implementation with our HPX runtime of this technology by providing sample planning to merge the work that has system has turned out to be a very codes and talks to potential users. been done so far… we want to put capable basis for Building a We're going to give a lecture series on together a comprehensive test Heterogeneous Computing Model for modern scientific programming infrastructure” the C++ Standard using high-level providing field specific examples.“ abstractions.”

70 © 2020 Codeplay Software Ltd. Further information

• OpenCL https://www.khronos.org/opencl/ • OpenVX https://www.khronos.org/openvx/ • HSA http://www.hsafoundation.com/ • SYCL http://sycl.tech • OpenCV http://opencv.org/ • Halide http://halide-lang.org/ • VisionCpp https://github.com/codeplaysoftware/visioncpp

71 © 2020 Codeplay Software Ltd. Community Edition Available now for free!

Visit: computecpp.codeplay.com

72 © 2020 Codeplay Software Ltd. • Open source SYCL projects: • ComputeCpp SDK - Collection of sample code and integration tools • SYCL ParallelSTL – SYCL based implementation of the parallel algorithms • VisionCpp – Compile-time embedded DSL for image processing • Eigen C++ Template Library – Compile-time library for machine learning All of this and more at: http://sycl.tech

73 © 2020 Codeplay Software Ltd. Thanks

[email protected] @codeplaysoft codeplay.com m