Effective Extensible Programming: Unleashing Julia on Gpus
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 Effective Extensible Programming: Unleashing Julia on GPUs Tim Besard, Christophe Foket and Bjorn De Sutter, Member, IEEE Abstract—GPUs and other accelerators are popular devices for accelerating compute-intensive, parallelizable applications. However, programming these devices is a difficult task. Writing efficient device code is challenging, and is typically done in a low-level programming language. High-level languages are rarely supported, or do not integrate with the rest of the high-level language ecosystem. To overcome this, we propose compiler infrastructure to efficiently add support for new hardware or environments to an existing programming language. We evaluate our approach by adding support for NVIDIA GPUs to the Julia programming language. By integrating with the existing compiler, we significantly lower the cost to implement and maintain the new compiler, and facilitate reuse of existing application code. Moreover, use of the high-level Julia programming language enables new and dynamic approaches for GPU programming. This greatly improves programmer productivity, while maintaining application performance similar to that of the official NVIDIA CUDA toolkit. Index Terms—Graphics processors, very high-level languages, code generation F 1 INTRODUCTION the implementation but hinders long-term maintainability To satisfy ever higher computational demands, hardware when the host language gains features or changes semantics. vendors and software developers look at accelerators, spe- It also forces users to learn and deal with the inevitable cialized processors that are optimized for specific, typically divergence between individual language implementations. parallel workloads, and perform much better at them than This paper presents a vision in which the high-level general-purpose processors [1], [2], [3], [4], [5]. Multiple hard- language compiler exposes interfaces to alter the compilation ware vendors are working on such accelerators and release process (Section2). Implementations of the language for many new devices every year. These rapid developments other platforms can use these interfaces together with other make it difficult for developers to keep up and gain sufficient extensible programming patterns to ensure that source code experience programming the devices. This is exacerbated by is compiled to compatible and efficient accelerator machine the fact that many vendors only provide low-level toolchains, code. To demonstrate the power of this approach, we have such as CUDA or OpenCL, which offer full control to reach added such interfaces to the reference compiler of the Julia peak performance at the cost of developer productivity [6]. language (Section4), and used it to add support for NVIDIA To improve developer productivity, programmers com- GPUs (Section5). We show that the resulting toolchain makes monly use high-level programming languages. However, it possible to write generic and high-level GPU code, while these languages often rely on techniques and functionality performing similar to low-level CUDA C (Section6). All code that are hard to implement or even incompatible with implementing this framework is available as open-source execution on typical accelerators, such as interpretation, software on GitHub, and can be easily installed using the tracing Just-in-Time (JIT) compilation, or reliance on a Julia package manager. Our contributions are as follows: managed runtime library. To remedy this, implementations • We introduce interfaces for altering the compilation of high-level languages for accelerators generally target a arXiv:1712.03112v1 [cs.PL] 8 Dec 2017 process, and implement them in the Julia compiler. derived version of the language, such as a restricted subset • We present an implementation of the Julia language or an embedded Domain Specific Language (DSL), in which for NVIDIA GPUs, using the introduced interfaces. incompatible features have been redefined or adjusted. • We analyze the performance of CUDA benchmarks Modern extensible languages offer the means to realize from the Rodinia benchmark suite, ported to Julia. such programming language derivatives [7]. For example, We show that code generated by our GPU compiler Lisp-like languages feature powerful macros for processing performs similar to CUDA C code compiled with syntax, Python’s decorators make it possible to change the NVIDIA’s reference compiler. behavior of functions and methods, and the Julia program- • We demonstrate high-level programming with this ming language supports introspection of each of its Inter- toolchain, and show that Julia GPU code can be highly mediate Representations (IRs). However, these facilities do generic and flexible, without sacrificing performance. not encourage reuse of existing compiler functionality. Most derived languages use a custom compiler, which simplifies • T. Besard, C. Foket and B. De Sutter are with the Department of Electronics and Information Systems, Ghent University, Belgium. {tim.besard,christophe.foket}@ugent.be Corresponding author: [email protected] This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2 Host extensive Device Host minimal Device Source code code reuse Source code Python code code reuse nopython code Main compiler Device package CPython Numba Front-end Parser Interpreter Front-end extensions High-level High-level High-level AST Numba IR IR optims optims Middle-end Codegen Codegen Middle-end extensions High-level Low-level Low-level Low-level Bytecode LLVM IR optims optims IR optims CPU GPU Back-end Back-end interpreter codegen Figure 1. Abstract overview of the proposed toolchain. Figure 2. CPython and Numba compilation processes for host and device. 2 VISION Our proposed compiler interfaces make it possible to Our proposed solution to the difficulty in integrating high- share functionality between an existing language imple- level languages and accelerators is a set of interfaces to mentation and external derivatives, avoiding needless reim- the high-level language’s general purpose compiler, that plementation of compiler functionality by reconfiguring provide fine-grained access to the different IRs and to the the existing compiler to generate code that is compatible processes that generate and optimize those IRs. With these with the platform at hand. This not only facilitates external interfaces, developers can influence the existing language language implementations, but also improves compatibility implementation and, e.g., improve compatibility with new with existing code as it avoids the inevitable differences hardware or run-time environments without the need for a between individual compiler implementations. custom compiler or an embedded language subset. In some cases, even more reuse of existing infrastructure Figure1 shows an overview of the proposed toolchain. than suggested in Figure1 is possible. When the back-end An external device package uses the introduced interfaces compiler used in the general-purpose tool flow can also target to add support for new hardware, without modifying the accelerators, there is no need to reimplement a device back existing language implementation. For example, it could end in the device package. Instead, that existing back-end refuse to generate code for certain language features, such as compiler can then be used for host and device code. Even if exceptions or dynamic memory allocations, or replace their this is not the case, it might not be necessary to reimplement code with compatible or optimized alternatives. a full device back end in the device package: If third-party Such a setup has multiple advantages. For one, it keeps device code generators can be reused, the device back end the existing language implementation stable, while new only has to translate the low-level IR code to an IR accepted implementations can be developed independently as external by that third-party code generator. packages. This makes it easier to experiment, as these Conceptually, the compiler interfaces shown in Figure1 packages do not need to meet the support, quality, or are generally applicable. Their actual instantiation, however, licensing requirements of the existing implementation. It will be specific to the host language and accelerator at hand. also makes it easier to cope with the rapid development pace We expect further research into such interfaces to generalize of accelerator hardware, providing the means for vendors to the design and improve reusability across languages and contribute more effectively to the language ecosystem. accelerators. For now, we will design the interfaces around a Another important advantage is the ability to reuse single language and accelerator platform. the existing language implementation, whereas current For this work, we chose to target Graphics Processing implementations of high-level languages for accelerators Units (GPUs), massively parallel accelerators with a dis- often reimplement large parts of the compiler. For example, tinctive enough architecture to make optimization worth- Numba is a JIT compiler for Python, building on the while, yet broadly usable for many kinds of applications CPython reference language implementation. As Figure2 (as explained in Section 3.1). Specifically, we focus on shows, the Numba compiler takes Python bytecode and NVIDIA GPUs with the Compute