The Landscape of Modern Parallel Programming Using Open

The landscape of Parallel Programming Models Part 1: Performance, Portability & Productivity Michael Wong Codeplay Software Ltd. Distinguished Engineer IXPUG 2020 2 © 2020 Codeplay Software Ltd. Distinguished Engineer Michael Wong ● Chair of SYCL Heterogeneous Programming Language ● C++ Directions Group ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● [email protected] ● [email protected] Ported ● Head of Delegation for C++ Standard for Canada Build LLVM- TensorFlow to based ● Chair of Programming Languages for Standards open compilers for Council of Canada standards accelerators Chair of WG21 SG19 Machine Learning using SYCL Chair of WG21 SG14 Games Dev/Low Latency/Financial Trading/Embedded Implement Releasing open- ● Editor: C++ SG5 Transactional Memory Technical source, open- OpenCL and Specification standards based AI SYCL for acceleration tools: ● Editor: C++ SG1 Concurrency Technical Specification SYCL-BLAS, SYCL-ML, accelerator ● MISRA C++ and AUTOSAR VisionCpp processors ● Chair of Standards Council Canada TC22/SC32 Electrical and electronic components (SOTIF) ● Chair of UL4600 Object Tracking ● http://wongmichael.com/about We build GPU compilers for semiconductor companies ● C++11 book in Chinese: Now working to make AI/ML heterogeneous acceleration safe for https://www.amazon.cn/dp/B00ETOV2OQ autonomous vehicle 3 © 2020 Codeplay Software Ltd. Acknowledgement and Disclaimer Numerous people internal and external to the original C++/Khronos group/OpenMP, in industry and academia, have made contributions, influenced ideas, written part of this presentations, and offered feedbacks to form part of this talk. These in clude Bjarne Stroustrup, Joe Hummel, Botond Ballo, Simon Mcintosh-Smith, as well as many others. But I claim all credit for errors, and stupid mistakes. These are mine, all mine! You can’t have them. 4 © 2020 Codeplay Software Ltd. Legal Disclaimer THIS WORK REPRESENTS THE VIEW OF THE OTHER COMPANY, PRODUCT, AND SERVICE AUTHOR AND DOES NOT NECESSARILY NAMES MAY BE TRADEMARKS OR SERVICE REPRESENT THE VIEW OF CODEPLAY. MARKS OF OTHERS. 5 © 2020 Codeplay Software Ltd. Disclaimers NVIDIA, the NVIDIA logo and CUDA are trademarks and/or registered trademarks of NVIDIA Corporation in the U.S. and/or other countries Codeplay is not associated with NVIDIA for this work and it is purely using public documentation and widely available code 6 © 2020 Codeplay Software Ltd. 3 Act Play 1. Performance, Portability, Productivity 2. The four horsemen of heterogeneous programming 3. C++, OpenCL, OpenMP, SYCL 7 © 2020 Codeplay Software Ltd. Act 1 Performance, Portability, Productivity 8 © 2020 Codeplay Software Ltd. Isn’t Parallel/Concurrent/Heterogeneous Programming Hard? Houston, we have a software crisis! 9 © 2020 Codeplay Software Ltd. So What are the Goals? Goals of Parallel Programming over and above sequential programming 1. Performance 2. Productivity Portability 3. Portability Performance The Iron Triangle of Parallel Programming language nirvana Productivity 10 © 2020 Codeplay Software Ltd. Why does productivity make the list? 11 © 2020 Codeplay Software Ltd. Performance • Broadly includes scalability and efficiency • If not for performance why not just write sequential program? • parallel programming is primarily a performance optimization, and, as such, it is one potential optimization of many. © 2020 Codeplay Software Ltd. Diagram thanks to Paul McKenney12 Are there no cases where parallel programming is about something other than performance? 13 © 2020 Codeplay Software Ltd. Productivity Perhaps at one time, the sole purpose of parallel software was performance. Now, however, productivity is gaining the spotlight. © 2020 Codeplay Software Ltd. 14 Diagram thanks to Paul McKenney Given how cheap parallel systems have become, how can anyone afford to pay people to program them? 15 © 2020 Codeplay Software Ltd. Iron Triangle of Parallel Programming Language Nirvana Portability Performance Productivity 16 © 2020 Codeplay Software Ltd. Performance Portability Productivity OpenCL OpenMP Portability CUDA SYCL Iron Triangle of Parallel Programming Nirvana is about making engineering tradeoffs 17 © 2020 Codeplay Software Ltd. Concurrency vs Parallelism What makes parallel or concurrent programming harder than serial programming? What’s the difference? How much of this is simply a new mindset one has to adopt? 18 © 2020 Codeplay Software Ltd. Which one to choose? 19 © 2020 Codeplay Software Ltd. 20 © 2020 Codeplay Software Ltd. Heterogeneous Devices Accelerator CPU GPU DSP APU FPGA 21 © 2020 Codeplay Software Ltd. Hot Chips 22 © 2020 Codeplay Software Ltd. Fundamental Parallel Architecture Types • Uniprocessor • Shared Memory • Scalar processor Multiprocessor (SMP) • Shared memory address processor space • Bus-based memory system memory • Single Instruction processor … processor Multiple Data (SIMD) bus memory processor … • Interconnection network memory processor … processor • Single Instruction network Multiple Thread (SIMT) … memory 2323 © 2020 Codeplay Software Ltd. SIMD vs SIMT 24 © 2020 Codeplay Software Ltd. Distributed and network Parallel Architecture Types • Distributed Memory • Cluster of SMPs Multiprocessor • Shared memory • Message passing addressing within SMP between nodes node • Message passing between memory memory SMP nodes … M M processor processor … … P … P P P interconnection network network interface interconnection network processor processor … P … P P … P memory memory … • Massively Parallel Processor M M (MPP) • Can also be regarded as Introduction to Parallel • Many, many processors MPP if processor number Computing, University of is large Oregon, IPCC 2525 © 2020 Codeplay Software Ltd. Modern Parallel Architecture Multicore Manycore • Heterogeneous: CPU+Manycore CPU M M Manycore vs Multicore CPU … … P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P Heterogeneous: CPU + GPU memory … M M processor Heterogeneous: Multicore SMP+GPU Cluster PCI • M M … memory … P … P P P Heterogeneous: “Fused” CPU + GPU processor interconnection network … … memory P P P P … 2626 M M © 2020 Codeplay Software Ltd. Modern Parallel Programming model • Heterogeneous: CPU+Manycore CPU: OpenCL, OpenMP, SYCL, Multicore Manycore C++11/14/17/20, TBB, Cilk, pthread M M Manycore vs Multicore CPU: OpenCL, OpenMP, SYCL, … C++11/14/17/20, TBB, Cilk, pthread … P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P Heterogeneous: CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, memory OpenACC, CUDA, hip, RocM, C++ AMP, Intrinsics, OpenGL, Vulkan, … M CUDA, DirectX M processor PCI • Heterogeneous: Multicore SMP+GPU Cluster: OpenCL, OpenMP, SYCL, C++17/20 M M … memory “ ” Heterogeneous: Fused CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, hip, … P … P P P RocM, Intrinsics, OpenGL, Vulkan, DirectX processor interconnection network … … memory P P P P … 2727 M M © 2020 Codeplay Software Ltd. Which Programming model works on all the Architectures?Is there a pattern? • Heterogeneous: CPU+Manycore CPU: OpenCL, OpenMP, SYCL, Multicore Manycore C++11/14/17/20, TBB, Cilk, pthread M M Manycore vs Multicore CPU: OpenCL, OpenMP, SYCL, … C++11/14/17/20, TBB, Cilk, pthread … P … P P P C C C C C C C C C C C C cores can be m m m m m m m m m m m m hardware network multithreaded interface C C C C C C C C (hyperthread) interconnection network m m m m m m m m memory P … P P … P Heterogeneous: CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, memory OpenACC, CUDA, hip, RocM, C++ AMP, Intrinsics, OpenGL, Vulkan, … M CUDA, DirectX M processor PCI • Heterogeneous: Multicore SMP+GPU Cluster: OpenCL, OpenMP, SYCL, C++17/20 M M … memory “ ” Heterogeneous: Fused CPU + GPU: OpenCL, OpenMP, SYCL, C++17/20, hip, … P … P P P RocM, Intrinsics, OpenGL, Vulkan, DirectX processor interconnection network … … memory P P P P … 2828 M M © 2020 Codeplay Software Ltd. To support all the different parallel architectures • With a single source • You really only have a code base few choices • And if you also want it to be an International Open Specification • And if you want it to be growing with the architectures 29 © 2020 Codeplay Software Ltd. Act 2 The four horsemen of heterogeneous programming 30 © 2020 Codeplay Software Ltd. Simon Mcintosh-Smith annual language citations 31 © 2020 Codeplay Software Ltd. The Reality Performance Productivity Portability 32 © 2020 Codeplay Software Ltd. Long Answer Yes ➢ The right programming model can allow you to express a problem in a way which adapts to different architectures 33 © 2020 Codeplay Software Ltd. Use the right abstraction now Abstraction How is it supported Cores C++11/14/17 threads, async HW threads C++11/14/17 threads, async Vectors Parallelism TS2 Atomic, Fences, lockfree, futures, counters, C++11/14/17 atomics, Concurrency TS1, transactions Transactional Memory TS1 Parallel Loops Async, TBB:parallel_invoke, C++17 parallel algorithms, for_each Heterogeneous offload, fpga OpenCL, SYCL, HSA, OpenMP/ACC, Kokkos, Raja, CUDA Distributed HPX, MPI, UPC++ Caches C++17 false sharing support Numa OpenMP/ACC? TLS ? Exception handling in concurrent environment 34? © 2020 Codeplay Software Ltd. The Four Horsemen Socket 0 Socket 1 Core 0 Core 1 Core 0 Core 1 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 4 1 5 2 6 3 7 35 © 2020

Load more