Fumero Alfonso2017.Pdf (4.146Mb)
Total Page:16
File Type:pdf, Size:1020Kb
This thesis has been submitted in fulfilment of the requirements for a postgraduate degree (e.g. PhD, MPhil, DClinPsychol) at the University of Edinburgh. Please note the following terms and conditions of use: This work is protected by copyright and other intellectual property rights, which are retained by the thesis author, unless otherwise stated. A copy can be downloaded for personal non-commercial research or study, without prior permission or charge. This thesis cannot be reproduced or quoted extensively from without first obtaining permission in writing from the author. The content must not be changed in any way or sold commercially in any format or medium without the formal permission of the author. When referring to this work, full bibliographic details including the author, title, awarding institution and date of the thesis must be given. Accelerating Interpreted Programming Languages on GPUs with Just-In-Time Compilation and Runtime Optimisations Juan José Fumero Alfonso I V N E R U S E I T H Y T O H F G E R D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of Informatics University of Edinburgh 2017 Abstract Nowadays, most computer systems are equipped with powerful parallel devices such as Graphics Processing Units (GPUs). They are present in almost every com- puter system including mobile devices, tablets, desktop computers and servers. These parallel systems have unlocked the possibility for many scientists and companies to process significant amounts of data in shorter time. But the usage of these parallel systems is very challenging due to their programming complexity. The most common programming languages for GPUs, such as OpenCL and CUDA, are created for expert programmers, where developers are required to know hardware details to use GPUs. However, many users of heterogeneous and parallel hardware, such as economists, biologists, physicists or psychologists, are not necessarily expert GPU programmers. They have the need to speed up their applications, which are often written in high- level and dynamic programming languages, such as Java, R or Python. Little work has been done to generate GPU code automatically from these high-level interpreted and dynamic programming languages. This thesis presents a combination of a program- ming interface and a set of compiler techniques which enable an automatic translation of a subset of Java and R programs into OpenCL to execute on a GPU. The goal is to reduce the programmability and usability gaps between interpreted programming languages and GPUs. The first contribution is an Application Programming Interface (API) for program- ming heterogeneous and multi-core systems. This API combines ideas from functional programming and algorithmic skeletons to compose and reuse parallel operations. The second contribution is a new OpenCL Just-In-Time (JIT) compiler that auto- matically translates a subset of the Java bytecode to GPU code. This is combined with a new runtime system that optimises the data management and avoids data transfor- mations between Java and OpenCL. This OpenCL framework and the runtime system achieve speedups of up to 645x compared to Java within 23% slowdown compared to the handwritten native OpenCL code. The third contribution is a new OpenCL JIT compiler for dynamic and interpreted programming languages. While the R language is used in this thesis, the developed techniques are generic for dynamic languages. This JIT compiler uniquely combines a set of existing compiler techniques, such as specialisation and partial evaluation, for OpenCL compilation together with an optimising runtime that compile and execute R code on GPUs. This JIT compiler for the R language achieves speedups of up to 1300x compared to GNU-R and 1.8x slowdown compared to native OpenCL. iii Lay Summary of Thesis Nowadays, most computer systems are equipped with powerful parallel devices such as Graphics Processing Units (GPUs). They are present in almost every com- puter system including mobile devices, tablets, desktop computers and servers. These parallel systems have unlocked the possibility for many scientists and companies to process significant amounts of data in shorter time. But the usage of these parallel systems is very challenging due to their programming complexity. The most common programming languages for GPUs, such as OpenCL and CUDA, are created for expert programmers, where developers are required to know hardware details to use GPUs. However, many users of heterogeneous and parallel hardware, such as economists, biologists, physicists or psychologists, are not necessarily expert GPU programmers. They have the need to speed up their applications, which are often written in high- level and dynamic programming languages such as Java, R or Python. Little work has been done to generate GPU code automatically from these high-level interpreted and dynamic programming languages. This thesis presents a combination of a novel programming interface and a set of compiler techniques which enable an automatic translation of a subset of Java and R programs into GPU code. At the same time, it offers comparable performance com- pared to native implementations. iv Acknowledgements Many people have contributed in one way or another during my PhD. Firstly, I would like to thank my supervisors, Christophe Dubach and Michel Steuwer for the great supervision and help during my PhD. Secondly, I would like to thank the Oracle Labs compiler group in Linz, Austria, for the constant technical support. In particular Thomas Würthinger, Lukas Stadler, Roland Schatz, Andreas Wöß, Gilles Duboscq, Christian Wirth, Stefan Marr, Josef Eisl, Manuel Rigger, David Leopoldseder, Matthias Grimmer and the Italian guys, Daniele Bonetta and Danilo Ansaloni. It was great to work with them during my internship, where one of the main ideas of this thesis was born. I would like to thank Sebastian Ertel and Andres Goens from the University of Dresden for the feedback and testing one of the contributions presented in this thesis. I want to thank Murray Cole and Michael O’Boyle for the great feedback and discus- sions during the review panel meetings, as well as my colleagues Thibaut Lutz, Erik Tomusk and Yuan Wen and my friends Rosa Filgueira, Michael Rovatsos, José Cano, Alberto Magni, Chris Margiolas, Chris Vasiladiotis and Dimitrios Papadopoulos. I am also grateful to have discussed much of the material presented in this thesis with David Kaeli and Christos Kotselidis. I would also like to thank my parents, Eva and Armando, and my sisters, Carmen and Anabel, my source of inspiration. I do not have enough words to thank Ruymán and Yurena for their constant presence and for hosting me in Edinburgh like their own child. Finally, I am grateful to Anna for her unconditional support and comprehension. v Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. Some of the material used in this thesis has been published in the following papers: • Juan J. Fumero, Michel Steuwer, Christophe Dubach. A Composable Array Function Interface for Heterogeneous Computing in Java. In Proceedings of ACM SIGPLAN International Workshop on Libraries, Languages, and Compil- ers for Array Programming, June 2014. • Juan J. Fumero, Toomas Remmelg, Michel Steuwer, Christophe Dubach. Run- time Code Generation and Data Management for Heterogeneous Computing in Java. In Proceedings of ACM SIGPLAN International Conference on Principles and Practices of Programming on the Java Platform, September 2015. • Juan Fumero, Michel Steuwer, Lukas Stadler, Christophe Dubach. OpenCL JIT Compilation for Dynamic Programming Languages. MoreVMs Workshop on Modern Language Runtimes, Ecosystems, and VMs, April 2017. • Juan Fumero, Michel Steuwer, Lukas Stadler, Christophe Dubach. Just-In-Time GPU Compilation for Interpreted Languages with Partial Evaluation. In Pro- ceedings of ACM SIGPLAN/SIGOPS International Conference on Virtual Exe- cution Environments (VEE), April 2017. (Juan José Fumero Alfonso) vi Contents 1 Introduction 3 1.1 The Problem . .4 1.2 The Solution . .5 1.3 Contributions . .6 1.4 Thesis Outline . .8 I STATE OF THE ART 11 2 Background 13 2.1 Heterogeneous Architectures . 13 2.2 OpenCL . 16 2.2.1 Platform Model . 16 2.2.2 Execution Model . 17 2.2.3 Programming Model . 19 2.2.4 Memory Model . 20 2.3 Structured Parallel Patterns . 21 2.4 Managed Languages and Runtimes . 23 2.4.1 Java . 23 2.4.2 Java Virtual Machines . 24 2.4.3 Graal . 27 2.4.4 Truffle . 30 2.5 Summary . 33 3 Related Work 35 3.1 Languages for Heterogeneous Programming . 35 3.1.1 Directive-based Heterogeneous Programming . 36 3.1.2 Explicit Parallel Programming Languages . 39 vii 3.1.3 Intermediate Representations for Heterogeneous Computing . 41 3.2 Algorithmic Skeletons . 43 3.3 GPGPU libraries . 45 3.4 Java-Specific GPU JIT Compilers . 46 3.4.1 Sumatra . 46 3.4.2 Aparapi . 47 3.4.3 RootBeer . 49 3.4.4 JaBEE . 50 3.4.5 JaCC . 51 3.4.6 IBM J9 GPU . 52 3.4.7 Summary . 53 3.5 Language Implementations for Truffle . 54 3.6 GPU Programming for Dynamic Programming Languages . 55 3.6.1 Library-based R GPU Support . 55 3.6.2 Parallelism Exploitation in R . 56 3.6.3 GPU Acceleration for other JIT Programming Languages . 57 3.7 Summary . 58 II CONTRIBUTIONS 59 4 A Composable and Reusable Array API for Heterogeneous Computing 61 4.1 Motivation . 62 4.2 JPAI: Array Programming in Java . 64 4.2.1 Example: Dot product . 65 4.2.2 JPAI Design . 67 4.2.3 Map Operations . 69 4.2.4 Reduce Operations . 72 4.2.5 Copy Operations . 73 4.2.6 Identity Operation . 74 4.3 Array Function Composition . 74 4.4 Array Function Execution .