A Pattern Language for Writing Efficient Kernels on GPU Architectures

UNIVERSITÀ DEGLI STUDI DI ROMA “TOR VERGATA” This dissertation is submitted for the degree of Computer Science and Automation Engineering Doctorate XXVI Cycle SIMPL: A Pattern Language for writing Efficient Kernels on GPU architectures Davide Barbieri A.A. 2013 / 2014 Advisors: Valeria Cardellini, Salvatore Filippone Coordinator: Giovanni Schiavon ©Copyright 2015 by Davide Barbieri. All rights reserved. Reproduction in whole or in part is prohibited without the written consent of the copyright owner. ACKNOWLEDGEMENTS I would like to thank my advisors, Valeria Cardellini and Salvatore Filippone. My graduate studies would not have started at all without their support and persuad- ing skills. I would like to thank them for teaching me about pipelines, instruction- level parallelism, memory hierarchies and computational science during my under- graduate studies, and about doing research with passion and curiosity during my doctorate. I would like to thank Daniel Pierre Bovet, Marco Cesati and Emiliano Betti for letting me teach CUDA for three times during these years and the students of the CUDA course to have followed my lectures with interest. I gratefully acknowledge the support we have received from CASPUR for the PSBLAS-GPU project, under the HPC Grant 2011 on GPU cluster, from CINECA for project IsC14 HyPSBLAS, under the ISCRA grant programme for 2014, and from Amazon with the AWS in Education Grant programme 2014. Thanks are also due to Nvidia Corporation for making a platform which provides a great tool to learn massive parallel programming and to play good-looking videogames. I thank my parents, my grandparents, my brother Domenico, and Eleonora to give me a reason to improve my skills and achieve goals in life, to support me during stressful periods (especially near deadlines) and for all their love. ABSTRACT From embedded systems to desktop computers, and of course HPC (High Per- formance Computing) solutions, computing resources today are most of all based on multi-core / many-core architectures. While the presence of parallel hardware is ubiquitous, applications that exploit its full potential are still difficult to write. One particular mention is to Graphics Processing Units (GPUs) programming. Thanks to its data-parallel oriented architecture, a GPU can achieve a higher throughput in terms of floating point operations in time unit and memory bandwidth compared to an off-the-shelf CPU with similar power consumption and cost. Nevertheless, a GPU naïve implementation could be so inefficient as to lose orders of magnitude of performance compared to its optimized counterpart. For this reason, it is fundamen- tal to have enough experience on the reference architecture to provide an optimal solution and make the switch from CPU to GPU advantageous. A pattern language defines a structured collection of design practices within a field of expertise. Inthe past, pattern languages were proven to be an effective way to communicate experience and help researchers and developers to reduce the learning curve over a particular expertise field. In the field of parallel programming, much work has beendone to provide a composable set of patterns that could be used to design an algorithm in a way that makes it completely hardware-agnostic and flawlessy integrable inside algorithmic skeleton frameworks, which actually care of producing optimized code for a target architecture or a heterogenous platform. While algorithmic skeleton frameworks are in many cases portable and efficient, a number of common appli- vi cations had to be retrofitted to provide good performance on GPUs; this shows the need for the novice developer to get well acquainted with the details of the platform. In this dissertation we present a new pattern language, SIMPL (SIMt Pattern Lan- guage), that is solely dedicated to the development of optimized code on a SIMT (single-instruction multiple-thread) architecture, which models a modern GPU. To the best of our knowledge, this is the first pattern language exclusively dedicated to General Purpose computing on GPUs (GPGPU). This language is currently made by 16 patterns, structured into 5 categories, and gathers the experience we made on this platform so far, presenting it in a reusable form. Among those patterns, we place particular emphasis on the original approaches that constitute our main contribution to the research field. We discuss in detail a set of case studies which involve the application of our pattern language. Specifically, we describe the implementation of the sparse matrix-vector multiply routine, reviewing the available literature and discussing our own approach to the problem, together with pointers to available software. As our main contribution, we propose three novel matrix storage formats, ELL-G and HLL which were derived from ELL, and HDIA for matrices having mostly a diagonal sparsity pattern. We compare the performance of the proposed formats to the results provided by the state-of-the-art formats with exper- iments realized on different GPU platforms and test matrices coming from various application domains. Furthermore, we implement the reversal of MD5 and SHA1 hash functions on a cluster of Nvidia GPUs. Our CUDA implementation achieves comparable or even better average performance results when compared to other pop- ular password cracking software, reaching near-maximal throughput over different GPU architectures. Finally, we present the GPU implementation of a broad-phase collision detection algorithm for particles simulation, which uses a uniform grid as spatial partitioning scheme. In some tests our original approach achieves a speedup vii of 2 compared to the fastest known method supporting a fixed maximum number of elements per cell, and a speedup of 7 compared with the fastest method without such a constraint. TABLE OF CONTENTS Table of contents ix List of figures xv List of tables xix 1 Introduction1 1.1 Introduction and motivation . .1 1.2 Contributions . .4 1.3 Organization . .6 2 Background9 2.1 Basic parallel laws . 11 2.1.1 Amdahl’s law . 11 2.1.2 Gustafson’s law . 12 2.1.3 Little’s law . 17 2.2 The work-time paradigm . 18 2.3 The PRAM model . 19 2.3.1 Brent’s theorem . 19 2.4 Pattern-based design . 21 2.5 Skeleton-based parallel programming . 22 x Table of contents 3 General-purpose computing on GPU 25 3.1 Evolution of the GPU . 28 3.1.1 The first GPUs and the fixed pipeline . 28 3.1.2 Shader cores and shader model . 29 3.1.3 Unified shader model . 30 3.1.4 From unified shader model to CUDA . 31 3.2 Compute Unified Device Architecture . 31 4 Pattern language 37 4.1 Overview . 37 4.2 Related work . 37 4.3 Pattern template . 39 4.4 Language context . 40 4.5 Language forces . 40 4.6 Taxonomy . 41 4.7 Underlying architecture model . 42 4.7.1 Device utilization . 48 4.7.2 Memory model . 50 4.7.3 Relashionship with PRAM model . 51 5 Mapping patterns 53 5.1 Vectorize pattern . 53 5.2 Enumerate pattern . 55 5.3 Load Remap pattern . 60 6 Consistency patterns 63 6.1 Double Buffering pattern . 63 6.2 Ghost Cell pattern . 65 Table of contents xi 6.3 Wave pattern . 68 7 Transformation patterns 73 7.1 Cascading pattern . 73 7.2 Reduce pattern . 75 7.3 Scan pattern . 84 8 Construction patterns 93 8.1 Count And Allocate pattern . 93 8.2 Atomic Add Insertion pattern . 95 8.3 Sort And Pack pattern . 98 8.4 Atomic Concatenate pattern . 101 8.5 Atomic Traversal pattern . 103 9 Tuning patterns 107 9.1 Scale pattern . 107 9.2 Anti Camping pattern . 111 10 Case study: sparse matrix vector multiply 115 10.1 Overview . 115 10.2 Storage formats for sparse matrices . 118 10.2.1 COOrdinate . 120 10.2.2 Compressed Sparse Rows . 121 10.2.3 Compressed Sparse Columns . 123 10.2.4 Storage formats for vector computers . 123 10.3 Formats for sparse matrices on GPU . 126 10.4 Related work . 127 10.4.1 COO variants . 129 xii Table of contents 10.4.2 CSR variants . 131 10.4.3 CSC variants . 134 10.4.4 ELLPACK variants . 135 10.4.5 DIA variants . 140 10.4.6 Hybrid variants . 140 10.4.7 New GPU-specific storage formats . 143 10.4.8 Automated tuning and performance optimization . 144 10.5 Formats for sparse matrices on SIMT architectures . 146 10.5.1 GPU ELLPACK . 147 10.5.2 Hacked ELLPACK . 150 10.5.3 DIA and Hacked DIA . 153 10.6 Experimental results . 155 11 Case study: exhaustive key search 175 11.1 Related work . 178 11.2 Password cracking on GPU . 178 11.2.1 GPU kernel . 181 11.3 GPU optimizations . 182 11.3.1 CUDA multiprocessor throughput . 183 11.3.2 The main bottleneck . 185 11.4 Experimental results . 190 11.4.1 Reference hardware . 190 11.4.2 Performance results . 191 12 Case study: interacting particles simulation 195 12.1 Uniform grids on GPU . 196 12.2 Atomic Concatenate implementation . 200 Table of contents xiii 12.3 Experimental results . 200 12.3.1 Performance analysis . 202 13 Conclusions 207 13.1 Future directions . 209 References 211 LIST OF FIGURES 2.1 Amdahl’s Law for different parallel fractions . 13 2.2 Speedup in data-parallel programs . 16 2.3 PRAM machine diagram . 19 3.1 Floating-Point Operations per Second for CPU and GPU . 26 3.2 Memory Bandwidth for CPU and GPU . 27 3.3 A 2D grid of threads . 33 4.1 Single-instruction multiple-threads Model: host and devices . 43 4.2 Single-instruction multiple-threads Model: a multi-processor . 44 6.1 Ghost Cell pattern: simulate and copy . 66 6.2 Ghost Cell pattern: double buffering .

A Pattern Language for Writing Efficient Kernels on GPU Architectures

Lecture 7: Synchronous Sequential Logic

Performance Evaluation of a Signal Processing Algorithm with General-Purpose Computing on a Graphics Processing Unit

Lecture 34: Bonus Topics

SEQUENTIAL LOGIC  Combinational:  Output Depends Only on Current Inputs  Sequential:  Output Depend on Current Inputs Plus Past History  Includes Memory Elements

CPE 323 Introduction to Embedded Computer Systems: Introduction

Single-Cycle

Sequential Logic .Sel(F[0]), .Out(Addmux Out));

Object-Oriented Development for Reconfigurable Architectures

Exploring Applications in CUDA Michael Kubacki Computer Science and Engineering, University of South Florida

Sequential Code Parallelization for Multi-Core Embedded Systems: a Survey of Models, Algorithms and Tools

Sequential Logic – Each Circuit Element Used at Most Once Sequential Circuits

Designing Sequential Logic Circuits