From CUDA to Opencl: Towards a Performance-Portable Solution for Multi-Platform GPU Programming

From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming✩,✩✩ Peng Dua, Rick Webera, Piotr Luszczeka, Stanimire Tomova, Gregory Petersona, Jack Dongarraa,b aUniversity of Tennessee Knoxville bUniversity of Manchester Abstract In this work, we evaluate OpenCL as a programming tool for developing performance-portable applications for GPGPU. While the Khronos group developed OpenCL with programming portability in mind, performance is not necessarily portable. OpenCL has required performance-impacting initializations that do not exist in other languages such as CUDA. Understanding these implications allows us to provide a single library with decent performance on a variety of platforms. We choose triangular solver (TRSM) and matrix multiplication (GEMM) as representative level 3 BLAS routines to implement in OpenCL. We profile TRSM to get the time distribution of the OpenCL runtime system. We then provide tuned GEMM kernels for both the NVIDIA Tesla C2050 and ATI Radeon 5870, the latest GPUs offered by both companies. We explore the benefits of using the texture cache, the performance ramifications of copying data into images, discrepancies in the OpenCL and CUDA compilers’ optimizations, and other issues that affect the performance. Experimental results show that nearly 50% of peak performance can be obtained in GEMM on both GPUs in OpenCL. We also show that the performance of these kernels is not highly portable. Finally, we propose using auto-tuning to better explore these kernels’ parameter space using search heuristics. Keywords: hardware accelerators, portability, auto-tuning 1. Introduction of NVIDIA’s CUDA [7, 8] (Compute Unified Device Archi- tecture), ushered a new era of improved performance for many People associated Graphics Processing Units (GPUs) with applications as programming GPUs became simpler: archaic fast image rendering until the turn of the century. This is when terms such as texels, fragments, and pixels were superseded the science community turned their attention to the hardware with threads, vector processing, data caches and shared mem- predominantly discussed in the computer gaming circles. One ory. of the first attempts of non-graphical computations on a GPU Further changes occuring in ATI’s and NVIDIA’s offerings was a matrix-matrix multiply [1]. In 2001, low-end graphics made GPU acceleration even more pertinent to the scientific cards had no floating-point support; floating-point color buffers community. ATI’s FireStream and NVIDIA’s Fermi architec- did not arrive until 2003 [2]. For the gaming industry, sup- ture added support for Fused Multiply-Add (FMA): a more ac- port for floating-point meant more realistic game-play; rich ad- curate version of the former MAD (Multiply-Add) instruction. vanced lighting effects no longer suffered from banding effects With only a single rounding step, this new instructions bring common in older generations of hardware that only allowed a GPUs even closer to compliance with the IEEE 754 standard for single byte per color channel. For the scientific community, floating-point arithmetic. Additionally, reworking of the cache the addition of floating point meant that overflow associated hierarchy helped with some of the performance issues of the with fixed-point arithmetic was no longer a problem. Many re- past. Finally, Error Correction Codes (ECC) are now used to search publications thoroughly analyzed the new breed of GPU protect the GPU device’s memory as its capacity grows to the hardware using languages borrowed from the graphics com- point of being vulnerable to errors induced by nature, such as munity [3, 4, 5]. It goes without saying that these computa- cosmic radiation. tional advances would not be possible if it weren’t for the pro- In our project called Matrix Algebra on GPU and Multicore grammable shaders that broke the rigidity of the fixed graphics Architectures [9] (MAGMA), we mainly focus on dense matrix pipeline. LU factorization with partial pivoting on a GPU was routines for numerical linear algebra, similar to those available one of the first common computational kernels that ran faster in LAPACK [10]. While CUDA is only available for NVIDIA than an optimized CPU implementation [6]. The introduction GPUs, there are other existing frameworks that allow platform- independent programming for GPUs: ✩This work was supported by the SCALE-IT fellowship through grant num- 1. DirectCompute from Microsoft, ber OR11907-001. 2. OpenGL Shading Language (GLSL), and ✩✩This work was supported by NVIDIA, Microsoft, the U.S. National Sci- ence Foundation, and the U.S. Department of Energy. 3. OpenCL. Preprint submitted to Parallel Computing August 31, 2010 DirectCompute allows access to graphics cards from mul- is harder, even impossible, and naturally, many authors claim tiple vendors. However, it is specific to Microsoft Windows that OpenCL does not provide performance portability. This, and therefore it is not portable between host Operating Sys- along with the fact that GPUs are quickly evolving in complex- tems (OS). ity, has made tuning numerical libraries for them challenging. The OpenGL Shading language [11] is portable across both One approach (that we explore) to systematically resolve these GPU hardware and the host OS. However, it is specifically geared issues is the use of auto-tuning, a technique that in the context of towards programming new graphics effects – GLSL does not OpenCL would involve collecting and generating multiple ker- have a scientific focus. nel versions, implementing the same algorithm optimized for OpenCL [12] has been designed for general purpose com- different architectures, and heuristically selecting the best per- puting on GPUs (GPGPU). It is an open standard maintained forming one. Auto-tuning has been used intensively on CPUs by the Khronos group with the backing of major graphics hard- in the past to address these challenges to automatically gener- ware vendors as well as large computer industry vendors inter- ate near optimal numerical libraries, e.g., ATLAS [19, 20] and ested in off-loading computations to GPUs. As such, there ex- PHiPAC [21] used it to generate highly optimized BLAS. Work ist working OpenCL implementations for graphics cards and, on auto-tuning CUDA kernels for NVIDIA GPUs [22, 23] has multi-core processors; OpenCL offers portability across GPU shown that the technique is a very practical solution to easily hardware, OS software, and multicore processors. In this work, port existing algorithmic solutions on quickly evolving GPU BLAS routines are used to evaluate OpenCL’s usefulness in cre- architectures and to substantially speed up even highly tuned ating a portable high performance GPU numerical linear alge- hand-written kernels. bra library. In [24], the authors1 examined performance portability in The rest of the paper is organized as follows: Section 2 talks OpenCL. In their study, they compared CUDA and OpenCL about related work. Section 3 details OpenCL’s programming implementations of a Monte Carlo Chemistry application run- considerations. This involves a comparison between CUDA ning on an NVIDIA GTX285. They also compared the same and OpenCL as programming language for GPU, and the profil- application written in ATI’s now defunct Brook+ to an OpenCL ing analysis of the run-time of each component in OpenCL pro- version on a Firestream 9170 and Radeon 4870 respectively. Fi- gram. Sections 4 and 5 give an evaluation to both NVIDIA and nally they compared OpenCL to a C++ implementation running ATI GPU platforms by implementing fast GEMM and analyz- on multi-core Intel processors. The paper showed that while ing performance. Section 6 discusses cross-platform issues and OpenCL does provide code portability, it doesn’t necessarily section 7 lays out the basic structure of an auto-tuning system provide performance portability. Furthermore, they showed that for cross-platform GPU math library design. Section 8 shows platform-specific languages often, but not always, outperformed the performance result and section 9 concludes the paper. In OpenCL. the text we use the terms ”ATI card” and Radeon 5870; ”Fermi card” and Tesla C2050 interchangeably. 3. OpenCL as A Programming Tool 2. Related Work To evaluate OpenCL as a programming tool for implementing high performance linear algebra routines, we pick the tri- Synthetic benchmarking of GPUs was used extensively to angular solver (TRSM) routine from BLAS and profile each gain understanding of the graphics accelerators when the tech- component of the program: environment setup, program com- nical details of the hardware remains an industrial secret [13]. pilation, kernel extraction, and execution. In the context of scientific applications, such benchmarking ef- TRSM solves the linear equation Ax = b where A is an upper forts lead to algorithms that provide significant performance or lower triangular matrix and b is a known matrix of solutions. improvements [14]. Its implementation involves a blocking algorithm in which the A performance-oriented study of NVIDIA’s PTX (Parallel diagonal triangular blocks are inverted (TRTRI) in parallel fol- Thread eXecution) architecture [15] was done in the context of lowed by a series of matrix multiplications (GEMM). Porting the Ocelot project [16]. Code transformations of CUDA ker- these routines from CUDA to OpenCL requires some transla- nels were done by the MCUDA [17] project. The transforma- tion. tions enabled CUDA code to run efficiently on multicore CPUs. CUDA and OpenCL have many conceptual similarities but The inverse operation – porting multicore code to GPUs – was they diverge on terminology. Table 1 shows the correspond- difficult [16]. ing terms in both frameworks while 1 highlights differences in Work on optimizing CUDA implementations of basic lin- the CUDA and OpenCL software stacks. Similarly, ATI and ear algebra kernels has demonstrated the importance on “how NVIDIA GPUs have analogous platform definitions as shown sensitive the performance of a GPU is to the formulation of in Table 2. Table 3 shows the platform details of two different your kernel” [18] and that an enormous amount of well thought NVIDIA GPUs and one GPU from ATI/AMD.

From CUDA to Opencl: Towards a Performance-Portable Solution for Multi-Platform GPU Programming

Other Apis What’S Wrong with Openmp?

Opencl on Shared Memory Multicore Cpus

Heterogeneous Task Scheduling for Accelerated Openmp

AMD Accelerated Parallel Processing Opencl Programming Guide

Implementing FPGA Design with the Opencl Standard

Threads Chapter 4

Lecture 10: Introduction to Openmp (Part 2)

The Problem with Threads

Opencl SYCL 2.2 Specification

An Introduction to Gpus, CUDA and Opencl

Opencl in Scientific High Performance Computing

The Continuing Renaissance in Parallel Programming Languages