Modular Array-Based GPU Computing in a Dynamically-Typed Language

Modular Array-Based GPU Computing in a Dynamically-Typed Language Matthias Springer Peter Wauligmann Hidehiko Masuhara Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Japan [email protected] [email protected] [email protected] Abstract Ruby Interpreter Nowadays, GPU accelerators are widely used in areas with large Execute in Array Access Ruby Interpreter Command Result ... require "ikra" data-parallel computations such as scientific computations or neu- (Symbolic Execution) (Tree) Ruby Array ral networks. Programmers can either write code in low-level pmap, pstencil, ... CUDA/OpenCL code or use a GPU extension for a high-level programming language for better productivity. Most extensions fo- Retrieve Ruby Type Generate C++/CUDA Compile cus on statically-typed languages, but many programmers prefer Source Code Inference Soure Code (nvcc) dynamically-typed languages due to their simplicity and flexibility. This paper shows how programmers can write high-level modu- Convert Transfer Transfer Convert Run Host Section lar code in Ikra, a Ruby extension for array-based GPU computing. Data Data Back Data Data Programmers can compose GPU programs of multiple reusable par- Generated C++/CUDA Code allel sections, which are subsequently fused into a small number of GPU kernels. We propose a seamless syntax for separating code Figure 1: High-level Overview of Compilation Process regions that extensively use dynamic language features from those that are compiled for efficient execution. Moreover, we propose symbolic execution and a program analysis for kernel fusion to achieve for expressing parallelism. Ikra provides parallel versions of map, performance that is close to hand-written CUDA code. reduce and a construct for stencil computations. When using Ikra, we encourage a dynamic programming style that is governed by the CCS Concepts • Software and its engineering → Source code following two concepts. generation; • Computing methodologies → Parallel program- Integration of Dynamic Language Features Code in parallel sec- ming languages;• Theory of computation → Type theory tions is limited to a restricted set of types and operations (dy- Keywords GPGPU, CUDA, Ruby, kernel fusion namic typing and object-oriented programming is allowed). All Ruby features (incl. metaprogramming) may still be used in other parts. Therefore, programmers can still use external li- 1. Introduction braries (e.g., I/O or GUI libraries). In recent years, one area of research in GPU computing focuses on Modularity [11] While optimized low-level programs typically high-level languages, making the performance gap between highly consist of a small number of kernels performing a variety of optimized low-level programs and high-level programs closer and operations, Ikra allows programmers to compose a program from closer. A variety of tools emerged that let programmers write par- multiple reusable, smaller kernels. allel programs for execution on GPUs in a high-level language. Most tools are extensions or libraries for existing high-level lan- Due to dynamic language features, whole-program static (ahead guages [2, 9, 19]. Their goal is not to reach peak performance. With of time) analysis is difficult. Therefore, Ikra generates CUDA pro- suffient expert knowledge about CUDA/OpenCL and the underlying grams at runtime (just in time) when type information is known. hardware platform, it is possible to write highly optimized low-level Moreover, Ikra optimizes GPU programs using two techniques that programs that perform better. However, writing code in a high-level are well-known in statically-typed languages but not in dynamically- language is easier and more productive [15]. typed languages such as Ruby. First, it fuses multiple kernels [16, 3, Ikra1 is a language extension for data-parallel and scientific com- 12] into a single kernel. Such code can be faster because data can putations in Ruby on Nvidia GPUs. It uses arrays as an abstraction remain in registers and does not have to be transferred from/into slow global memory. Second, loops surrounding parallel code are 1 https://prg-titech.github.io/ikra-ruby/ compiled to C++ code and not executed in the Ruby interpreter. Microbenchmarks show that both techniques together achieve per- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed formance that is comparable to a single hand-written kernel. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM Compilation Process Ikra is a RubyGem (Ruby library) that pro- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, vides parallel versions of commonly used array operations. The no- to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. tation and API for these operations is similar to Ruby’s counterparts Copyright is held by the owner/author(s). Publication rights licensed to ACM. but method names are prefixed with p for parallel. Figure 1 gives a high-level overview of the Ikra’s compilation ARRAY’17, June 18, 2017, Barcelona, Spain ACM. 978-1-4503-5069-3/17/06...$15.00 process. When one of Ikra’s parallel operations is invoked in the http://dx.doi.org/10.1145/3091966.3091974 Ruby interpreter, Ikra executes that operation symbolically. The 48 and Ikra’s ArrayCommand include both mixins Enumerable and Ikra ParallelOperations. The first mixin provides standard collection <<mixin>> <<mixin>> ArrayIdentityCommand API functionality. The second mixin provides parallel operations ::Enumerable ParallelOperations -dimensions : Fixnum[] ... +pcombine() which are executed on the GPU. +pmap() ArrayIndexCommand +pstencil() -dimensions : Fixnum[] +preduce() 2. Parallel Operations +pzip() ArrayCombineCommand <<instantiate>> This section gives an overview of operations that are provided input by Ikra. All operations can handle multidimensional Ikra arrays, 0..* ArrayStencilCommand making code more readable if data is inherently multidimensional ::Array ArrayCommand (e.g., images), but we use only one dimension for most operations +to_command() -input : ArrayCommand[] in this section for presentation reasons. If an operation performs +pnew() -result : Object[] ArrayZipCommand +to_command() a computation, then the size of the first argument determines the +with_index() number of CUDA threads that are allocated. +to_a() ArrayReduceCommand +each() Array Identity This operation creates an Ikra array (command) +[](index) from an ordinary Ruby array A (denoted by id(A)). It can be used to load an external Ruby array A (not computed on the GPU) and Figure 2: Integration of Ikra in Ruby make it available in Ikra. Array identity is applied implicitly where required. For example, when a map operation is applied to a Ruby array, Ikra applies this operation automatically. However, it is useful result is an array command object. Such an object contains all if programmers want to convert a one dimensional Ruby array to a information required for CUDA code generation and execution. An multidimensional Ikra array. It is exposed to Ruby programmers as array command can be used like a normal Ruby array. However, to_command, taking an optional parameter for dimensions. only when its contents are accessed for the first time, Ikra generates _ CUDA/C++ source code, compiles it using the Nvidia compiler and A.to command() _ runs the generated C++ program. The generated program copies A.to command(dimensions: [15, 20]) data to the GPU, launches the parallel sections, copies the result _ 2 Ikra arrays can be converted back to Ruby arrays with to a, which back to the host memory and returns the result . is recommended for performance reasons if large parts of the Ikra Instead of defining single parallel sections, programmers can array are read randomly. also define host sections. A host section is a block of Ruby code that contains a more complex program with multiple parallel sections. Combine This operation is used to map over one or more arrays In such a case, the entire block is translated to C++ code, avoiding Ai of same size m and dimensions. It takes as input n arrays and a switching from the Ruby interpreter to external C++ programs block (anonymous function) f taking n scalar values. It applies f multiple times. The former case can be seen as a host section which to every element of the input and retains the original shape of the directly returns the result of a single parallel section. Thus, we only input, regardless of dimensions. mention the general case “Run Host Section” in Figure 1. f(A1[0],...,An[0]) . Symbolic Execution During symbolic execution, Ikra retrieves combine(A1,...,An,f)= . the source code of parallel sections (e.g., the body of a pmap oper- f(A1[m − 1],...,An[m − 1]) ation), generates abstract syntax trees and infers types. The result of symbolic execution is an array command. Ikra currently supports Ikra allocates m CUDA threads, i.e., every thread processes one various primitive types (Fixnum, Float, booleans, NilClass), user- tuple. This will likely change in future versions of Ikra. This defined classes and polymorphic types. A polymorphic type is rep- operation is exposed to Ruby programmers as pcombine: resented by a pair of type/class ID and the actual value (union type). A1.pcombine(A2, ..., An, &f) Programmers can invoke other methods inside parallel sections, including method calls on objects. After Ikra has determined the receiver type(s) of a method call during type inference, the target Map This operation is a special case of combine with one input method(s) are added to a work list of methods to be processed next. array. It corresponds to an ordinary map operation but is executed in parallel. Array Commands A command object in the Command Design Pattern [5] is an object that contains all information that is necessary map(A1,f)= combine(A1,f)=[f(A1[0]),...,f(A1[m−1])] to perform an action at a later point of time.

Load more