Automatic GPU Optimization Through Higher-Order Functions in Functional Languages

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 Automatic GPU optimization through higher-order functions in functional languages JOHN WIKMAN KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE Automatic GPU optimization through higher-order functions in functional languages JOHN WIKMAN Master in Computer Science Date: September 25, 2020 Supervisor: David Broman Examiner: Cyrille Artho School of Electrical Engineering and Computer Science Swedish title: Automatisk GPU-optimering genom funktioner av högre ordning i funktionella språk iii Abstract Over recent years, graphics processing units (GPUs) have become pop- ular devices to use in procedures that exhibit data-parallelism. Due to high parallel capability, running procedures on a GPU can result in an execution time speedup ranging from a couple times faster to several orders of magnitude faster, compared to executing serially on a central processing unit (CPU). Interfaces such as CUDA and OpenCL flexibly exposes the parallel capabilities of the GPU to the programmer, while at the same time putting a lot of responsibility on the programmer to handle aspects such as thread synchronization and memory management. A different approach to GPU optimization is to enable it through higher-order functions with known data-parallelism, using the semantics of the higher-order function to determine the parallel execution. This approach has in practice been integrated into existing languages through libraries or been integrated directly into languages themselves. However, higher-order functions do not address when it is beneficial to execute on a GPU. Due to the GPU being a separate device, effects such as latency and memory transfer can cause a slowdown for small input values. In this thesis, a set of commonly used higher-order functions are GPU enabled as compiler intrinsics in a small functional language. These higher-order functions are also equipped with the option of automatically deciding at runtime if to execute on GPU or CPU. Results show that running higher-order functions on GPU yields a speedup for larger computations. However, the performance does not match existing solutions that provide additional higher-order functions for optimizing the parallelization. The selected approach for automatically deciding whether to run a higher-order function on GPU or on CPU results in the faster option a majority of cases. Though the most notable benefit of automatic deci- sions was for procedures that use multiple higher-order function invoca- tions, which ran faster compared to when executing only on GPU or only on CPU. iv Sammanfattning På senare år har grafikprocessorer blivit populära enheter att använda för att köra program med hög dataparallellism. Den höga parallella ka- paciteten hos grafikprocessorer gör att exekveringstiden för program kan bli betydligt snabbare än om programmet exekveras seriellt på en vanlig processor. Gränssnitt som CUDA och OpenCL möjliggör flexibel paral- lellprogrammering för grafikprocessorer, samtidigt som de gränssnitten lägger mycket ansvar på programmeraren att hantera aspekter som tråd- synkronisering och minneshantering. Ett annat tillvägagångssätt är att genom högre ordningens funktioner med dataparallella egenskaper optimera program för grafikprocessorer, där semantiken hos de funktionerna styr hur den parallella exekvering- en ser ut. Den här metoden har i praktiken integrerats i existerande programmeringsspråk som bibliotek eller integrerats direkt i själva pro- grammeringsspråken. Funktioner av högre ordning adresserar dock inte när det är lönsamt att exekvera på en grafikprocessor. Till följd av att en grafikprocessor är en separat enhet kan effekter som latens och minnesöverföringar för små indata orsaka en längre exekveringstid. I det här examensarbetet tillhandahålls en känd uppsättning funktioner av högre ordning som inbyggda funktioner i en kompilator för ett litet funktionellt språk, med stöd för att exekveras på en grafikprocessor. De här funktionerna av högre ordning har även stöd för att automatiskt under exekvering besluta om de ska exekvera på en grafikprocessor eller på en vanlig processor. Resultaten visar att funktioner av högre ordning som exekveras på en grafikprocessor har kortare exekveringstid för större beräkningar. Pre- standan motsvarar dock inte den hos existerande lösningar som tillhan- dahåller en större uppsättning funktioner av högre ordning med stöd för parallell optimering. Det valda tillvägagångssättet för att automatiskt besluta om en högre ordningens funktion ska exekveras på en grafikprocessor eller på en vanlig processor väljer i en majoritet av fallen det snabbare alternativet. Den mest nämnvärda fördelen med det automatiska beslutstagandet var dock för program med flera användningar av funktioner av högre ordning, där exekveringstiden blev snabbare än om funktionerna exekverats enbart på en grafikprocessor eller enbart på en vanlig processor. Acknowledgments I would like to thank my supervisor David Broman for his feedback, guidance, and the interesting discussions. I would also like to thank the other members of the Miking group for their useful insights and comments. I would also like to thank my friends and family for their unending support and understanding during what has been a long period writing this thesis. v Contents 1 Introduction 1 1.1 Parallelization Through Higher-Order Functions . .3 1.2 Decision Points for Parallelization . .5 1.3 Problem Statement . .7 1.4 Delimitations . .8 1.5 Contributions . 10 1.6 Research Method . 11 1.7 Ethics and Sustainability . 12 1.8 Outline . 13 2 Background 14 2.1 Miking . 14 2.1.1 MCore . 14 2.2 Algorithmic Skeletons . 18 2.3 Lambda lifting . 19 2.4 Microanalysis . 20 2.5 CUDA . 21 2.5.1 cuBLAS . 23 3 Related work 24 3.1 Lift . 24 3.2 Other Algorithmic Skeleton Implementations . 25 3.2.1 Non-GPU Based Implementations . 26 3.2.2 GPU Based Implementations . 28 3.2.3 MapReduce . 28 3.2.4 AnyDSL . 29 4 Compiler Design 32 4.1 Intrinsic Functions . 33 4.1.1 Data-parallel Intrinsics . 33 vi CONTENTS vii 4.1.2 Other Intrinsics . 36 4.2 Code Generation . 38 4.2.1 Lambda Lifting . 38 4.2.2 OCaml Code Generation . 39 4.2.3 CUDA Code Generation . 40 4.3 Automatic Parallelization Decision Points . 43 4.3.1 The Heuristic . 44 4.3.2 Cost Profiles . 46 4.3.3 Calculating Execution Cost . 47 5 Evaluation 51 5.1 Method . 51 5.1.1 Tuning the Cost Profiles . 52 5.1.2 Benchmarks . 55 5.1.3 Number of elements per thread . 58 5.1.4 Measurements . 59 5.1.5 Threats of validity . 59 5.2 Tuning Results . 61 5.3 Benchmark Results . 65 5.3.1 Elements per Thread . 65 5.3.2 Lift Benchmarks . 67 5.3.3 SMC Airplane Benchmark . 71 5.3.4 cuBLAS Benchmarks . 73 6 Discussion 76 6.1 Design Choices for Code Generation . 76 6.2 Automatic Parallelization Method . 79 6.3 Benchmark Results . 82 7 Concluding Remarks 85 7.1 Conclusions . 85 7.2 Future work . 86 Bibliography 87 A Training Program Implementations 94 A.1 vecpo . 94 A.2 gcdsum . 95 A.3 matmaxsum . 96 viii CONTENTS B Benchmark Implementations 97 B.1 ATAX . 97 B.2 Convolution . 97 B.3 Matrix-Vector Multiplication (GEMV) . 98 B.4 Matrix Addition (GEAM) . 98 B.5 Matrix Multiplication (GEMM) . 98 B.6 Nearest Neighbour (NN) . 98 B.7 Rank-1 Update (GER) . 98 B.8 Saxpy . 99 B.9 Saxpy Single . 99 B.10 SMC Airplane . 100 B.11 Vector Addition (VecAdd) . 100 B.12 Vector Scaling (VecScal) . 100 C Benchmark Results 102 C.1 Lift Benchmarks . 102 C.1.1 ATAX . 102 C.1.2 Convolution . 105 C.1.3 Matrix-Vector Multiplication (GEMV) . 108 C.1.4 Matrix Multiplication (GEMM) . 111 C.1.5 Nearest Neighbour (NN) . 114 C.2 cuBLAS Benchmarks . 117 C.2.1 Matrix Addition (GEAM) . 117 C.2.2 Rank-1 Update (GER) . 119 C.2.3 Saxpy . 121 C.2.4 Saxpy Single . 123 C.2.5 Vector Addition (VecAdd) . 125 C.2.6 Vector Scaling (VecScal) . 127 Glossary Device A name commonly used to refer to the GPU. Code executed on this platform is also referred to as device code. 9–11, 21–23, 32–36, 38, 40–42, 50–55, 57, 58, 62, 67–69, 71–74, 76–78, 81–84, 86, 102, 105, 108, 109, 111, 114, 115, 117–127 Higher-order function A function that can take a function as an ar- gument or that can return a function. This thesis only focuses on higher-order functions that take other functions as arguments. iii, 3, 4, 6–12, 18, 21, 24, 25, 29, 30, 33, 35, 36, 40–42, 55, 78–80, 84–86, 103, 105 Host The plaform that manages the GPU. Code executed on this platform is referred to as host code. This is usually the same as CPU code. 9–11, 21–23, 32–36, 39, 40, 42, 50, 52–55, 62, 67, 68, 71–73, 76, 81–84, 86, 102, 105, 108, 111, 114, 115, 117–119, 121, 123–127 Kernel In the context of parallel programming, the name kernel is used to denote the entry point routine of a parallel computation, which all threads starts executing from. 2, 22, 23, 41, 42, 57, 58 Macroanalysis Asymptotic measurement of program behavior. 5, 6, 20 Microanalysis Fine-grained measurement of program behavior, not dis- carding constants or non-dominant terms in the analysis. 5–7, 9, 11, 20, 21, 32, 43, 81, 85 ix Acronyms API application programming interface. 4, 26, 28, 50, 66, 67, 84, 108, 111, 118, 119, 123–126 AST abstract syntax tree. 11, 19, 32, 38, 39, 41, 47, 49 CPU central processing unit. iii, 2, 5, 7, 8, 22, 28, 50, 85 DSL domain-specific language. 4, 25 GPU graphics processing unit. iii, 2–8, 11–13, 21, 22, 24–26, 28–30, 44, 46, 49, 50, 55–58, 60, 78, 80, 85, 102, 105, 112, 114 IR intermediary representation.

Load more