Cross-Platform Performance Portability Using Highly

Cross-Platform Performance Portability Using Highly Parametrized SYCL Kernels John Lawson Mehdi Goli Duncan McBain [email protected] [email protected] [email protected] Codeplay Software Ltd. Codeplay Software Ltd. Codeplay Software Ltd. Daniel Soutar Louis Sugy [email protected] [email protected] Codeplay Software Ltd. Codeplay Software Ltd. ABSTRACT 1 INTRODUCTION Over recent years heterogeneous systems have become more preva- The power and speed of modern GPUs coupled with their increased lent across HPC systems, with over 100 supercomputers in the programmability through programming models such as SYCL [12], TOP500 incorporating GPUs or other accelerators. These hardware OpenCL [34] and CUDA [10] have enabled many applications to platforms have different performance characteristics and optimiza- achieve previously insurmountable performance. This has been one tion requirements. In order to make the most of multiple accelera- of the main driving factors behind the rise in machine learning tors a developer has to provide implementations of their algorithms since 2012. In the TOP500 [15], 3 of the top 5 supercomputers in tuned for each device. Hardware vendors provide libraries target- the world in terms of pure performance utilize GPUs, as do the top ing their devices specifically, which provide good performance but 4 supercomputers in the Green500 [4], where flops per watt are frequently have different API designs, hampering portability. considered. The SYCL programming model allows users to write heteroge- Many hardware vendors provide highly specialized libraries for neous programs using completely standard C++, and so develop- their platforms to extract optimal performance from their hardware. ers have access to the power of C++ templates when developing As more companies and computing centers move to support larger compute kernels. In this paper we show that by writing highly heterogeneous systems with more varied hardware this means parameterized kernels for matrix multiplies and convolutions we users have to write applications which handle multiple different achieve performance competitive with vendor implementations libraries for each available accelerator. This adds complexity both across different architectures. Furthermore, tuning for new devices for developers and system maintainers. amounts to choosing the combinations of kernel parameters that perform best on the hardware. 2 RELATED WORKS CCS CONCEPTS Highly specialized libraries—such as ARM Compute Library [2], • Mathematics of computing → Mathematical software per- cuBLAS [8], cuDNN [19], CUTLASS [27], MIOpen [5], Intel MKL [35] formance; • Computing methodologies → Massively parallel and MKL-DNN [6]—provide optimal performance on the targeted algorithms; Parallel programming languages; Neural networks; Ma- hardware, however typically each of these libraries are restricted chine learning algorithms. to the vendor’s hardware. A common way of achieving performance across various plat- KEYWORDS forms is to specialize certain routines in a library or framework SYCL, OpenCL, parallel computing, portability, GPGPU, perfor- targeting a particular architecture. There are various high performance mance vendor-specified libraries tuned for particular architectures. For ARM devices, ARM Compute Library [2] provides a set of arXiv:1904.05347v1 [cs.PF] 10 Apr 2019 ACM Reference Format: John Lawson, Mehdi Goli, Duncan McBain, Daniel Soutar, and Louis Sugy. low-level, optimized primitives for machine-learning and linear 2018. Cross-Platform Performance Portability Using Highly Parametrized algebra. While for NVIDIA devices, cuBLAS [8] and cuDNN [19] SYCL Kernels. In Woodstock ’18: ACM Symposium on Neural Gaze Detection, provide highly optimized implementations of standard linear al- June 03–05, 2018, Woodstock, NY. ACM, New York, NY, USA, 11 pages. gebra subroutines and deep neural network routines, respectively. https://doi.org/10.1145/1122445.1122456 NVIDIA also provides a higher level CUTLASS [27] which takes advantage of CUDA C++ templates meta programming to deliver Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed a high-performance GEMM computations allowing developers to for profit or commercial advantage and that copies bear this notice and the full citation specialize their matrix multiplies inside their application. on the first page. Copyrights for components of this work owned by others than the Similarly AMD provides MIOpen [5], a deep learning accelera- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission tion framework developed to deliver highly optimized DNN and and/or a fee. Request permissions from [email protected]. GEMM routines for AMD GPUs. For their multi-core and many- Woodstock ’18, June 03–05, 2018, Woodstock, NY core systems, Intel provides MKL [35] for linear algebra routines, © 2018 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9999-9/18/06...$15.00 and MKL-DNN [6] as an open source, high performance library for https://doi.org/10.1145/1122445.1122456 accelerating deep learning frameworks. Woodstock ’18, June 03–05, 2018, Woodstock, NY Lawson, Goli, et al. Each of these libraries are optimized specifically for their tar- 2.2 Performance Metrics get devices, however do not provide performance portability. The Tuning the performance of accelerated compute kernels is not a limitations of these libraries highlight two main key factors in pro- trivial task requiring expertise in both parallel algorithms and hard- viding this: the parallel programming framework used to develop ware architectures. Having said that, many hardware accelerators the library and the performance metrics of devices. share common factors affecting performance, allowing common algorithmic techniques for improving performance across devices. 2.1 Parallel programming frameworks We have studied common hardware-related optimization im- provements on the following devices: Intel CPUs and GPUs, ARM There are many different frameworks available to provide access to Mali GPUs, AMD GPUs and the Renesas V3H and V3M acceler- hardware parallelism, with different approaches and programming ators. These metrics are used to develop parametric convolution styles. Most are based on C or C++, with variations on the language and matrix multiply algorithms, two of the most computationally to allow a developer to specify how tasks can be parallelized. intense operations used in neural networks. The Intel Thread Building Blocks (TBB) library [31] is a C++ template library for parallel programming on multi-core processors. 2.2.1 Thread reusability. The number of individual compute units TBB only provides parallelism on CPUs so applications developed available in hardware varies significantly, from 2 in embedded on top of TBB are not portable to GPU or accelerated platforms. systems to 84 in desktop GPUs e.g NVIDIA Tesla series. Depending On the other hand, CUDA [10] is used by NVIDIA as a low on the number of compute units available the size of workload per level framework to deliver performance on NVIDA GPUs and ac- thread can be parametrized in order to achieve maximal device celerators, but not CPUs. Although CUDA C++ supports template occupancy. However, there is a trade off between increasing the metaprogramming features which is necessary for performance number of work-groups and ensuring a sufficient workload per portability, the lack of CUDA support on non-NVIDIA architecture thread. For a given problem, creating more work groups or more prevents more widespread adoption. threads will naturally decrease the amount of computation that any A cross-platform framework like OpenCL [34] can be supported one thread will have to do. by various vendors and platforms, however the low level interface 2.2.2 Memory transactions. In most accelerators, when a set of of OpenCL hampers development of easily tunable kernels that threads within a work-group loads data, an entire block of data enable performance portability. Although OpenCL 2.1 supports called a cache-line is fetched from memory. This is based on the template metaprogramming within kernels, the kernel interface assumption that threads within a work-group access contiguous el- itself cannot be templated, which limits the use of this feature. ements of data specified in the fetched block. Assuming the threads SYCL [12, 32] is an open standard maintained by the Khronos within a work-group access the same address, the permutation, Group, a collaboration of companies, universities and individuals masking or prediction has no effect on the loaded block of data, dedicated to developing open standards, known for maintaining which can lead to more optimizations in the compiler. Moreover OpenCL, Vulkan, OpenGL and many others alongside SYCL. The loading a block of data will reduce the number of memory transac- standard is a royalty-free, cross-platform abstraction layer that tions, improving performance [9]. enables developers to utilize heterogeneous systems with varied hardware while writing completely standard C++. 2.2.3 Data reusability. Memory accesses tend to be the bottleneck SYCL provides a highly parallel programming model designed in GPU programming, despite graphics memory typically being to abstract away much of

Cross-Platform Performance Portability Using Highly

Analysis of GPU-Libraries for Rapid Prototyping Database Operations

The Importance of Data

Introduction to GPU Computing

Delivering Heterogeneous Programming in C++

Concurrent Cilk: Lazy Promotion from Tasks to Threads in C/C++

Programmers' Tool Chain

Intel® Oneapi Programming Guide

Scalable and High Performance MPI Design for Very Large

Opencl SYCL 2.2 Specification

DNA Scaffolds Enable Efficient and Tunable Functionalization of Biomaterials for Immune Cell Modulation

Of the Threading Building Blocks Flow Graph API, a C++ API for Expressing Dependency, Streaming and Data Flow Applications

A Cilk Implementation of LTE Base-Station Up- Link on the Tilepro64 Processor