Matrix Multiplication Beyond Auto-Tuning: Rewrite-Based GPU Code Generation

Matrix Multiplication Beyond Auto-Tuning: Rewrite-based GPU Code Generation Michel Steuwer Toomas Remmelg Christophe Dubach University of Edinburgh {michel.steuwer, toomas.remmelg, christophe.dubach}@ed.ac.uk ABSTRACT APIs such as OpenCL or RenderScript are now supported on Graphics Processing Units (GPUs) are used as general pur- most mobile GPUs and new types of mobile applications are pose parallel accelerators in a wide range of applications. emerging, such as real-time 3D scene reconstruction [17]. They are found in most computing systems, and mobile de- However, producing high-performance GPU code is no- vices are no exception. The recent availability of program- toriously hard. Low-level hardware features are directly ex- ming APIs such as OpenCL for mobile GPUs promises to posed to programmers, requiring expert knowledge to achieve open up new types of applications on these devices. high performance. In addition, each type of devices comes However, producing high performance GPU code is ex- with its own performance characteristics, requiring differ- tremely difficult. Subtle differences in device characteristics ent optimizations. This problem is further exacerbated with can lead to large performance variations when different opti- mobile GPUs since optimizations benefitial for desktop GPUs mizations are applied. As we will see, this is especially true (e.g., AMD, Nvidia GPUs) can negatively impact performance for a mobile GPU such as the ARM Mali GPU which has a on mobile GPUs, as we will see later in this paper. very different architecture than desktop-class GPUs. Code Auto-tuners have been proposed to address performance optimized and tuned for one type of GPUs is unlikely to portability issues on GPUs. They are generally based on a achieve the performance potential on another type of GPUs. specialized parametric implementation of a computational Auto-tuners have traditionally been an answer to this per- kernel, such as matrix multiplication, and the tuning process formance portability challenge. For instance, they have been explores the performance space on the targeted hardware. successful on CPUs for matrix operations, which are used However, auto-tuners have two major drawbacks. First, writ- as building blocks in many high-performance applications. ing the parametric implementation for a given kernel requires However, they are much harder to design for different classes non-negligible effort from the programmer. Secondly, and of GPUs, given the wide variety of hardware characteristics. more importantly, the implementation is limited by a finite In this paper, we take a different perspective and show how set of parameters which might not be good at expressing com- performance portability for matrix multiplication is achieved plex composition of optimizations. As we will see, this can using a compiler approach. This approach is based on a result in far from optimal performance when the parametric recently developed generic technique that combines a high- implementation is run on a device it was not originally de- level programming model with a system of rewrite rules. Pro- signed for. In other words, auto-tuning alone is not sufficient grams are automatically rewritten in successive steps, where to solve the performance portability challenge. optimizations decision are made.This approach is truly per- We argue that achieving true performance portability re- formance portable, resulting in high-performance code for quires a more generic mechanism that expresses combina- very different types of architectures such as desktop and mo- tions of optimizations beyond a fixed parametric space. We bile GPUs. In particular, we achieve a speedup of 1.7x over a advocate the use of a recently developed new high perfor- state-of-the-art auto-tuner on the ARM Mali GPU. mance code generation technique based on rewrite rules [23]. Programs are expressed in a high-level functional programming model which shields the programmer from hardware 1. INTRODUCTION peculiarities.The compiler is then free to automatically ex- Graphics Processing Units (GPUs) have emerged as power- plore the optimization space using a system of rewrite rules. ful general-purpose parallel accelerators. They have revolu- These rules encode algorithmic transformations as well as tionized the high-performance computing landscape and are hardware-specific low-level optimizations. Recent work [22] about to bring big changes to mobile devices. Programming has shown that this generic compiler approach leads to high performance for desktop-class GPUs from AMD and Nvidia. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed In this paper, we demonstrate that this compiler-based for profit or commercial advantage and that copies bear this notice and the full citation technique is able to succeed where auto-tuners fail to deliver, on the first page. Copyrights for components of this work owned by others than the using matrix multiplication as a use-case. Matrix multiplica- author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or tion is a well studied and useful primitive found at the heart republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. of many numerical codes and algorithms in areas such as CASES ’16, October 01-07 2016, Pittsburgh, PA, USA machine-learning. In addition, there exist high-performance c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. reference implementations and specialized auto-tuners, which ISBN 978-1-4503-4482-1/16/10. $15.00 allow for meaningful comparison. DOI: http://dx.doi.org/10.1145/2968455.2968521 Using the ARM Mali GPU as an example, we show that Desktop GPU Desktop GPU Mobile GPU (Nvidia) (AMD) (ARM) an auto-tuner designed primarily for desktop-class GPUs is 2500 3000 17.5 unable to achieve the full performance potential, resulting 15.0 2000 2500 12.5 in a 40% performance loss. In contrast, our compiler-based 2000 1500 10.0 approach delivers performance on par with the best hand- 1500 1000 7.5 tuned version on each of the three platforms tested. This is GFLOPS 1000 5.0 500 possible due to the generic nature of the rewrite-based code 500 2.5 0 0 0.0 generation technique, which allows us to encode generic op- CLBlast CLBlast CLBlast Hand MAGMA clBLAS timizations that are combined during the exploration process. +CLTune +CLTune +CLTune optimized This includes vectorization and the use of built-in functions, which are highly beneficial for the Mali GPU. Figure 1: Performance comparison between auto-tuned (left To summarize, this paper makes the following contributions: bar) and hand-optimized (right bar) code. Higher is better. We demonstrate the limitations of auto-tuning when • applied on a different class of GPUs; tually not difficult to realise what needs to be done to reach We present how generic optimizations beneficial for the a higher-level of performance for some specific machine, it is • Mali GPU are expressed in a rewrite-based generator; extremely hard to write a parametric kernel which exposes Our experimental results show that a rewrite-based ap- these choices as a finite set of parameters. Especially given • proach is performance portable and even outperforms that a library enabled for auto-tuning, such as CLBlast, is al- hand-tuned code on Mali. ready quite complex with more than 1500 lines of parametric The rest of the paper is organized as follows. The next OpenCL code just for matrix multiplication. section shows that performance is far from portable between What is needed is an approach that easily combines opti- different classes of GPUs. Section 3 presents characteristics of mizations and produce a search space that includes the best the Mali GPU. Section 4 introduces the high-level functional performing implementations for different types of hardware. language and the rewrite-based code generator we adopted. In this paper we propose to use a generic rewrite-based ap- Section 5 discusses optimizations for matrix multiplication, proach [23], which is not specific to matrix multiplication, how they are represented functionally and how they are en- and we show that it succeeds where the auto-tuner fails. This coded as rewrite rules. Sections 6 and 7 present our exper- approach is also simpler to use since the compiler input is a imental setup and show results on how we automatically high-level functional program. For instance, matrix multipli- achieve high performance from a portable, high-level rep- cation is expressed in just five lines of code. resentation of matrix multiplication. Finally, sections 8–10 discuss our work, related work and conclude the paper. 3. MALI GPU CHARACTERISTICS 2. MOTIVATION ARM Mali-T628 GPU. Matrix multiplication is probably one of the most studied The Mali-T628 GPU is a mobile GPU implementing ARMs kernels in the high-performance community. Automatic tun- second generation Midgard micro-architecture. Each core has ing techniques have been applied quite successfully to this two arithmetic pipelines, each of which processes 128-bits of benchmark for over 20 years starting with PHiPAC [4] and data at a time using SIMD operations. A single core can ATLAS [25]. However, auto-tuners rely on a parametric im- simultaneously manage up to 256 threads in hardware, de- plementation (or a parametric code generator) that is highly pending on the amount of registers required by each thread. specialized to the target machine. This approach is well- This large number of threads is used to hide memory laten- suited in cases where little variation exists between different cies, as stalled threads waiting for memory can be overtaken processing units but falls short when the target processing by other threads. units exhibit significant variations.

Load more