Array Abstractions for GPU Programming

Array abstractions for GPU programming Martin Dybdal July 13, 2017 i Abstract The shift towards massively parallel hardware platforms for high- performance computing tasks has introduced a need for improved programming models that facilitate ease of reasoning for both users and compiler optimization. A promising direction is the field of functional data-parallel programming, for which functional invariants can be utilized by optimizing compilers to perform large program transformations automatically. However, the previous work in this area allow users only limited ability to reason about the performance of algorithms. For this reason, such languages have yet to see wide industrial adoption. We present two programming languages that attempt at both support- ing industrial applications and providing reasoning tools for hierarchical data-parallel architectures, such as GPUs. First, we present TAIL, an array based intermediate language and compiler framework for compiling a large subset of APL, a language which have been used in the financial industry for decades. The TAIL language is a typed functional intermediate language that allows compilation to data- parallel platforms, thereby providing high-performance at the fingertips of APL programmers. Second, we present FCL, a purely functional data-parallel language, that allows for expressing data-parallel algorithms in a fashion where users at a low-level can reason about data-movement through the memory hierarchy and control fusion will and will not happen. We demonstrate through a number of micro benchmarks that FCL compiles to efficient GPU code. ii Resumé Overgangen til massivt parallel hardware, som platform for opgaver der kræver high-performance computing, har introduceret et behov for forbedrede programmeringsmodeller, der bedre faciliterer ræsonering for både brugere og optimerende oversættere. En lovende retning er forskningsområdet der omhandler funktionelle data-parallelle programmeringssprog. For sådanne sprog kan funktionelle invarianter benyttes af oversættere til automatisk at udføre optimerende programtransformationer. Tidligere arbejde indenfor dette område har dog ofte kun tilladt brugere begrænset mulighed for at ræsonere om ef- fektiviteten af deres algoritmer, netop på grund af disse automatiske op- timeringer. Derfor har sådanne sprog endnu ikke set stor udbredelse i industrielle sammenhænge. Vi præsenterer to programmeringssprog der forsøger både at understøtte industrielle anvendelser og giver mulighed for ræsonering omkring effektivitet, når algoritmerne afvikles på hierarkiske data-parallele hardware arkitekturer, som for eksempel GPU’er. Vi præsenterer først TAIL, et array baseret mellemsprog og oversæt- tersystem, der gør det muligt at oversætte en stor delmængde af APL, et sprog der har været brugt i den finansielle industri i flere årtier. Sproget TAIL er et typestærkt funktionelt mellemsprog, der tillader oversættelse til data-parallel platforme og dermed giver APL programmører adgang til at generere effektiv kode for opgaver der kræver high-performance computing. Dernæst præsenterer vi FCL, et funktionelt data-parallelt sprog, der tillader brugeren at udtrykke data-parallele algoritmer i en form hvor brugere kan ræsonere om hvornår data overføres mellem de forskellige hukommelseslag. Dette gøres blandt andet ved at give brugeren kontrol over hvornår array kombinatorer vil eller ikke vil blive fusioneret. En- delig demonstrerer vi at et antal mindre benchmarks skrevet i FCL kan oversættes til effektive GPU programmer. Contents 1 Introduction 1 1.1 Landscape of functional array languages for GPUs . 2 1.2 Thesis . 3 1.3 Compiling APL into a typed array language . 4 1.4 Hierarchical functional data-parallelism . 5 1.5 Contributions . 6 1.6 Structure of the dissertation . 7 2 Functional array language design 8 2.1 A core functional language with arrays . 9 2.2 Array operations . 17 2.3 Fusion . 23 2.4 Iteration and recursion . 33 2.5 Nested data-parallelism . 34 3 TAIL: A Typed Array Intermediate Language for Compiling APL 38 3.1 A Typed Array Intermediate Language . 41 3.2 Compiling the Inner and Outer Products . 53 3.3 APL Explicit Type Annotations . 54 3.4 TAIL Compilation . 56 3.5 Benchmarks . 59 3.6 Related Work . 67 3.7 Conclusion and Future Work . 69 4 GPU architecture and programming 70 4.1 GPU architecture . 71 4.2 GPU programming . 73 4.3 Optimisation of GPU programs . 75 5 FCL: Hierarchical data-parallel GPU programming 78 5.1 Obsidian . 79 5.2 Case Studies in FCL . 80 5.3 Formalisation . 86 5.4 Larger examples . 103 5.5 Performance . 110 iii CONTENTS iv 5.6 Related work . 112 5.7 Discussion and future work . 113 5.8 Conclusion . 115 6 Conclusion 116 Bibliography 118 Chapter 1 Introduction The society is increasingly relying on large-scale computing for infrastructure as well as for industrial and scientific developments. In recent years, the number and scale of data processing tasks have increased dramatically. Simultaneously, hardware for high-performance computing have become cheaper than ever, which has made previously untractable problems tractable. However, while hardware for high-performance computing have become a commodity, taking efficient advantage of the new hardware architectures requires specialists with intimate knowledge of the particular hardware platforms. Decreasing development time and cost for high-performance computing systems will enable organisations to move into new areas faster, cheaper and with a lower barrier for entry. In some domains quick responses are often necessary, and shortening development time thus becomes even more important. The financial sector is an example, where sudden changes in markets necessi- tate timely reactions embodied as updated financial models and software. Graphical processing units (GPUs) are a cost-effective technology for many problems in high-performance computing. Traditional CPUs are equipped with automatically managed caches and control logic such as branch predictors or speculative execution, providing low latency processors that can quickly switch between tasks. GPUs, in contrast, are designed for tasks where high throughput is more important than low latency, and trades in the transistors used for control logic and caches for additional compute units (ALUs). GPUs are thus equipped with thousands of individual compute cores and a complex hierarchical memory systems managed by the user. To take advantage, a programmer needs thousands of simultaneous threads, data parallel algorithms with enough similar, but independent tasks, where control-divergence and synchronisation between threads are limited, as well as careful utilisation of the deep memory hierarchies. This dissertation explores how the use of purely functional programming languages can enable improvements in programmer productivity and performance of data-parallel programs written for GPUs. Purely functional programming languages delivers a programming model, where programs are constructed 1 CHAPTER 1. INTRODUCTION 2 from operators acting on arrays in bulk. By being pure, the languages provides referential transparency of operations, which allows equational reasoning by users as well as compilers. This in turn enables a wealth of optimisations to be performed. 1.1 Landscape of functional array languages for GPUs The HIPERFIT research center was establish at University of Copenhagen in the aftermath of the 2008 financial crisis, with the following research goals: “HIPERFIT seeks to contribute to effective high-performance modelling by domain specialists, and to functional programming on highly parallel com- puter architectures in particular, by pursuing a research trajectory informed by the application domain of finance, but without limiting its research scope, generality, or applicablity, to finance” [15]. To initiate efforts in the direction stated by the HIPERFIT research goals, the first project I did as part of my Ph.D. was to evaluate the state of the art in functional data-parallel languages for GPUs, by implementing a few algorithms from financial engineering. We surveyed the languages Accelerate[28], Nikola[77], NESL/GPU [11], Copperhead [26], and Thrust [83]. Our conclu- sions from the study was that, current functional languages for data-parallel GPU programming were limited in several ways.1 We found that the existing data-parallel languages for GPU programming, provided limited control when implementing high-performance algorithms, which is necessary for certain algorithms to implemented efficiently. For instance, data-layout and data-access patterns are important consideration for obtaining good performance on GPU systems. Being unable to reason about the data-layout or data-access after applying a built-in operation, for example a transposition, made it hard to optimise algorithms. We also found that certain loop structures could not be expressed, in many of the languages. For instance some algorithms involves construction of small arrays, or parts of arrays small enough to be constructed in a single thread. The languages did provide the necessary sequential loop contructs, necessary for expressing such algorithms. Languages often tend to provide a single version of each operation for example, transposing an array, performing matrix multiplication, et cetera. However, when arrays are small it may be faster to let all threads perform identical operations, instead of parallelising a 3-by-3 matrix inversion. The surveyed languages did not provide the flexibility of choosing between these different algorithmic

Load more