Modular Array-Based GPU Computing in a Dynamically-Typed Language

Matthias Springer Peter Wauligmann Hidehiko Masuhara Department of Mathematical and Computing Sciences, Tokyo Institute of Technology, Japan [email protected] [email protected] [email protected]

Abstract Ruby Interpreter Nowadays, GPU accelerators are widely used in areas with large Execute in Array Access Ruby Interpreter Command Result ... require "ikra" data-parallel computations such as scientific computations or neu- (Symbolic Execution) (Tree) Ruby Array ral networks. Programmers can either write code in low-level pmap, pstencil, ... CUDA/OpenCL code or use a GPU extension for a high-level programming language for better productivity. Most extensions fo- Retrieve Ruby Type Generate C++/CUDA Compile cus on statically-typed languages, but many programmers prefer Source Code Inference Soure Code (nvcc) dynamically-typed languages due to their simplicity and flexibility.

This paper shows how programmers can write high-level modu- Convert Transfer Transfer Convert Run Host Section lar code in Ikra, a Ruby extension for array-based GPU computing. Data Data Back Data Data Programmers can compose GPU programs of multiple reusable par- Generated C++/CUDA Code allel sections, which are subsequently fused into a small number of GPU kernels. We propose a seamless syntax for separating code Figure 1: High-level Overview of Compilation Process regions that extensively use dynamic language features from those that are compiled for efficient execution. Moreover, we propose sym- bolic execution and a program analysis for kernel fusion to achieve for expressing parallelism. Ikra provides parallel versions of map, performance that is close to hand-written CUDA code. reduce and a construct for stencil computations. When using Ikra, we encourage a dynamic programming style that is governed by the CCS Concepts • Software and its engineering → Source code following two concepts. generation; • Computing methodologies → Parallel program- Integration of Dynamic Language Features Code in parallel sec- ming languages;• Theory of computation → Type theory tions is limited to a restricted set of types and operations (dy- Keywords GPGPU, CUDA, Ruby, kernel fusion namic typing and object-oriented programming is allowed). All Ruby features (incl. metaprogramming) may still be used in other parts. Therefore, programmers can still use external li- 1. Introduction braries (e.g., I/O or GUI libraries). In recent years, one area of research in GPU computing focuses on Modularity [11] While optimized low-level programs typically high-level languages, making the performance gap between highly consist of a small number of kernels performing a variety of optimized low-level programs and high-level programs closer and operations, Ikra allows programmers to compose a program from closer. A variety of tools emerged that let programmers write par- multiple reusable, smaller kernels. allel programs for execution on GPUs in a high-level language. Most tools are extensions or libraries for existing high-level lan- Due to dynamic language features, whole-program static (ahead guages [2, 9, 19]. Their goal is not to reach peak performance. With of time) analysis is difficult. Therefore, Ikra generates CUDA pro- suffient expert knowledge about CUDA/OpenCL and the underlying grams at runtime (just in time) when type information is known. hardware platform, it is possible to write highly optimized low-level Moreover, Ikra optimizes GPU programs using two techniques that programs that perform better. However, writing code in a high-level are well-known in statically-typed languages but not in dynamically- language is easier and more productive [15]. typed languages such as Ruby. First, it fuses multiple kernels [16, 3, Ikra1 is a language extension for data-parallel and scientific com- 12] into a single kernel. Such code can be faster because data can putations in Ruby on Nvidia GPUs. It uses arrays as an abstraction remain in registers and does not have to be transferred from/into slow global memory. Second, loops surrounding parallel code are 1 https://prg-titech.github.io/ikra-ruby/ compiled to C++ code and not executed in the Ruby interpreter. Microbenchmarks show that both techniques together achieve per- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed formance that is comparable to a single hand-written kernel. for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM Compilation Process Ikra is a RubyGem (Ruby library) that pro- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, vides parallel versions of commonly used array operations. The no- to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. tation and API for these operations is similar to Ruby’s counterparts Copyright is held by the owner/author(s). Publication rights licensed to ACM. but method names are prefixed with p for parallel. Figure 1 gives a high-level overview of the Ikra’s compilation ARRAY’17, June 18, 2017, Barcelona, Spain ACM. 978-1-4503-5069-3/17/06...$15.00 process. When one of Ikra’s parallel operations is invoked in the http://dx.doi.org/10.1145/3091966.3091974 Ruby interpreter, Ikra executes that operation symbolically. The

48 and Ikra’s ArrayCommand include both mixins Enumerable and Ikra ParallelOperations. The first mixin provides standard collection <> <> ArrayIdentityCommand API functionality. The second mixin provides parallel operations ::Enumerable ParallelOperations -dimensions : Fixnum[] ... +pcombine() which are executed on the GPU. +pmap() ArrayIndexCommand +pstencil() -dimensions : Fixnum[] +preduce() 2. Parallel Operations +pzip() ArrayCombineCommand <> This section gives an overview of operations that are provided input by Ikra. All operations can handle multidimensional Ikra arrays, 0..* ArrayStencilCommand making code more readable if data is inherently multidimensional ::Array ArrayCommand (e.g., images), but we use only one dimension for most operations +to_command() -input : ArrayCommand[] in this section for presentation reasons. If an operation performs +pnew() -result : Object[] ArrayZipCommand +to_command() a computation, then the size of the first argument determines the +with_index() number of CUDA threads that are allocated. +to_a() ArrayReduceCommand +each() Array Identity This operation creates an Ikra array (command) +[](index) from an ordinary Ruby array A (denoted by id(A)). It can be used to load an external Ruby array A (not computed on the GPU) and Figure 2: Integration of Ikra in Ruby make it available in Ikra. Array identity is applied implicitly where required. For example, when a map operation is applied to a Ruby array, Ikra applies this operation automatically. However, it is useful result is an array command object. Such an object contains all if programmers want to convert a one dimensional Ruby array to a information required for CUDA code generation and execution. An multidimensional Ikra array. It is exposed to Ruby programmers as array command can be used like a normal Ruby array. However, to_command, taking an optional parameter for dimensions. only when its contents are accessed for the first time, Ikra generates _ CUDA/C++ source code, compiles it using the Nvidia and A.to command() _ runs the generated C++ program. The generated program copies A.to command(dimensions: [15, 20]) data to the GPU, launches the parallel sections, copies the result _ 2 Ikra arrays can be converted back to Ruby arrays with to a, which back to the host memory and returns the result . is recommended for performance reasons if large parts of the Ikra Instead of defining single parallel sections, programmers can array are read randomly. also define host sections. A host section is a block of Ruby code that contains a more complex program with multiple parallel sections. Combine This operation is used to map over one or more arrays In such a case, the entire block is translated to C++ code, avoiding Ai of same size m and dimensions. It takes as input n arrays and a switching from the Ruby interpreter to external C++ programs block () f taking n scalar values. It applies f multiple times. The former case can be seen as a host section which to every element of the input and retains the original shape of the directly returns the result of a single parallel section. Thus, we only input, regardless of dimensions. mention the general case “Run Host Section” in Figure 1. f(A1[0],...,An[0]) . Symbolic Execution During symbolic execution, Ikra retrieves combine(A1,...,An,f)=  .  the source code of parallel sections (e.g., the body of a pmap oper- f(A1[m − 1],...,An[m − 1]) ation), generates abstract syntax trees and infers types. The result   of symbolic execution is an array command. Ikra currently supports Ikra allocates m CUDA threads, i.e., every thread processes one various primitive types (Fixnum, Float, booleans, NilClass), user- tuple. This will likely change in future versions of Ikra. This defined classes and polymorphic types. A polymorphic type is rep- operation is exposed to Ruby programmers as pcombine: resented by a pair of type/class ID and the actual value (union type). A1.pcombine(A2, ..., An, &f) Programmers can invoke other methods inside parallel sections, including method calls on objects. After Ikra has determined the receiver type(s) of a method call during type inference, the target Map This operation is a special case of combine with one input method(s) are added to a work list of methods to be processed next. array. It corresponds to an ordinary map operation but is executed in parallel. Array Commands A command object in the Command Design Pattern [5] is an object that contains all information that is necessary map(A1,f)= combine(A1,f)=[f(A1[0]),...,f(A1[m−1])] to perform an action at a later point of time. An array command in This operation is exposed to Ruby programmers as pmap: Ikra is an object that contains all information required for code gen- eration and launching a parallel section/host section. For example, A1.pmap(&f) a stencil array command contains the typed AST of the computa- tion (block), a reference to the input array command, an array of Index This operation generates an array of size m of consecutive neighborhood indices, and an out of bounds value. indices starting from 0 and ending with m − 1. An array command can be seen as a special Ikra array. It has all methods that an ordinary Ruby array has, but its content is computed index(m)=[0, 1, 2,...,m − 1] once it is accessed. Figure 2 gives an overview of Ikra’s integration In a multidimensional case, index takes arguments mi (d is the in Ruby and the design of array commands. There are a variety of number of dimensions), where mi is the size of the i-th dimen- subclasses of ArrayCommand corresponding to parallel operations sion. Every value in the resulting array is then an array of size d that are supported in Ikra (Section 2). The standard Ruby Array containing the indices for every dimension.

2 [0, 0],..., [0,m1 − 1],..., If object-oriented programming is used inside a kernel, data is first con- index(m1,m2)= [m2 − 1, 0],..., [m2 − 1,m1 − 1] verted to a memory coalescing-friendly structure of arrays layout [14].  

49 This operation is not directly exposed to programmers. Similar to Ruby notation, programmers must invoke the method with_index # Passing a block (function) after a parallel operation (the parameter m is provided implicitly) A.preduce(&f) or use Array.pnew (see below). # Symbols are also possible as shortcuts # E.g.: A.preduce do |a, b| a + b; end A1.pmap.with_index(&f) A.preduce(:+)

New This operation is a combination of index and map. It creates 3. Kernel Fusion a new array of size m and initializes it using the block (anonymous function) f. It is a parallel version of Ruby’s Array.new. All array commands except for index and array identity have at least one input array command (input in Figure 2). E.g., the input new(m,f)= map(index(m),f)=[f(0),...,f(m − 1)] for a map operation is the array that is being mapped over. During Similiar to index, this operation takes multiple arguments in a multi- code generation, Ikra traverses the tree of such dependent (input) dimensional case. This operation is exposed to Ruby programmers commands. Depending on the access pattern of dependent input, as pnew: Ikra may merge multiple parallel operations into a single kernel.

Array.pnew(m, &f) Command Input Access Pattern combine same location (for all inputs) Stencil (Convolution) This operation takes as arguments an input stencil multiple (fixed pattern) array A, an array of relative indices I (neighborhood) of size k, a reduce multiple block f, and an out-of-bounds value o. It creates an array of same zip same location (for all inputs) _ size and dimensionality where every value i is initialized using f, (with index) same location passing the values in the neighborhood of A[i] as arguments to f. Simple examples of stencil computations are image filtering kernels. Figure 3: Input Access Patterns for Array Commands The following formula is used to calculate the value in the resulting array at position i. If all indices are within bounds (case Figure 3 lists access patterns for dependent computations for all 1), i.e., 0 ≤ i + I[j]

50 _ 1 A1 = [1, 2, 3]; A2 = [10, 20, 30] # Ruby arrays 1 require "image library" 2 a = A1.pmap.with_index do |e, idx| ... end # combine 2 _ _ 3 b = a.pcombine(A2) do |e1, e2| ... end # combine 3 tt = ImgLib.load png("tokyo tower.png") 4 c = b.pstencil([-1, 0, 1], 0). # stencil 4 for i in 0...3 _ 5 with_index do |values, idx| ... end 5 tt = tt.apply filter(ImgLib::Filters.blur) 6 d = c.preduce do |r1, r2| ... end # reduce 6 end 7 (a) Source Code 8 sun = ImgLib.load_png("sunset.png") 9 combined = tt.apply_filter( 10 ImgLib::Filters.blend(sun, 0.3)) index combine 11 [a] combine 12 forest = ImgLib.load_png("forest.png") id [A1] [b] 13 forest = forest.apply_filter( stencil reduce (root of id [A2] [c] [d] tree) 14 ImgLib::Filters.invert) 15 combined = combined.apply_filter( 16 ImgLib::Filters.overlay( index 17 forest, ImgLib::Masks.circle(tt.height / 4))) 18 (b) Resulting Kernel Fusion. Gray boxes are kernels. 19 ImgLib::Output.render(combined) # Draw pixels

Figure 4: Example: Kernel Fusion (a) Ruby Source Code

load (id) blur (stencil) blur (stencil) blur (stencil) blend (combine) ImgLib <> Ikra::ParallelOperations Filter ImgLib +apply_filter(filter) -block : Proc +load_png(filename) +apply_to(cmd) load (id) invert (map) overlay (combine) blend (combine) +blur() +blend(other, ratio) return fiter. apply_to(self) (b) Image Rendering. Gray boxes indicate generated GPU kernels. CombineFilter StencilFilter Figure 6: Example: Image Manipulation Library Usage -args : ArrayCommand[] -neighborhood : int[] return cmd.pstencil( +apply_to(cmd) -out_of_bounds_value neighborhood, +apply_to(cmd) out_of_bounds_value, &block) 1 input = [10, 20, 30, 40, 50, 60] (a) Architecture of Image Manipulation Library 2 result = Ikra.host_section do 3 arr = input.to_command(dimensions: [2, 3]) 4 for i in 0...10 module ImageLibrary::Filters 5 if arr.preduce(:+)[0] % 2 == 0 def self.load_png(filename) 6 arr = arr.pmap do |i| i + 1; end # map image = read_png(filename) A 7 else return image.pixels.to_command( 8 arr = arr.pmap do |i| i + 2; end # map dimensions: [image.height, image.width]) B 9 end end 10 arr = arr.pmap do |i| i + 3; end # mapC def self.blend(other, ratio) 11 end return CombineFilter.new(other) do |p1, p2| 12 arr pixel_add( # Helper functions 13 end pixel_scale(p1, 1.0 - ratio), pixel_scale(p2, ratio)) end Figure 7: Example: Iterative Computation in Host Section end def self.blur return StencilFilter.new( library and can be arbitrarily combined. However, only when the neighborhood: STENCIL_2, final result is accessed (rendered) in the last line, Ikra generates a out_of_bounds_value: 0) do |v| single GPU program with four kernels and executes it. r = v[-1][-1][0] + ... + v[1][1][0] The filters are provided by the image manipulation library g = v[-1][-1][1] + ... + v[1][1][1] (Figure 5) and implemented using parallel map or stencil opera- b = v[-1][-1][2] + ... + v[1][1][2] tions. The image manipulation library defines an extension method _ [r / 9, g / 9, b / 9] apply filter for applying image filters using double dispatch. end 4.2 Iterative Computation end end Many scientific computations (e.g., numerical partial differential equations) exhibit an iterative structure where an array or matrix is (b) Example: Definition of Filters updated for a fixed number of times or until convergence. The fol- lowing source code snippet does not compute any meaningful result Figure 5: Implementation of Image Manipulation Library but illustrates how to write such computations in Ikra. The update

51 loop is contained in a host section, a code section that is compiled to 1 union union_v_t C++ and executed on the host, as opposed to parallel sections which 2 { are executed on the device. The value of the last statement of a host 1 struct union_t 3 int int_; section is the return value of the host section. Inside host sections, 2 { 4 float float_; only simple Ruby code may be written: Everything that is allowed 3 int class_id; 5 bool bool_; inside a parallel section plus parallel sections themselves. More 4 union_v_t value; 6 void *pointer; advanced Ruby features (such as metaprogramming) are forbidden. 5 } 7 array_t array; For the code in Figure 7, one C++/CUDA program is generated. 8 } union_v_t; That program contains a C++ function for the host section and mul- tiple CUDA kernels. Control flow statements inside host sections Figure 9: Union Type Struct Definition are executed symbolically as opposed to control flow statements outside of host sections, which are executed by the Ruby interpreter. Host sections are translated with a conservative kind of ahead of compilation process of host sections is identical to the one of parallel time compilation: In the above example, it is not clear if the parallel sections, but there are additional steps to handle parallel sections section in Line 10 will be executed together with the one in Line 6 within them. Since no code generation can be done once a host (mapA + mapC ) or the one in Line 8 (mapB + mapC ), or both section (C++ code) is executing, fused kernels must be generated up in different iterations. Therefore, Ikra generates both fused kernel front. Ikra statically analyzes all code paths with parallel sections variants and launches the appropriate one at runtime. The detailed through the host section and generates a number of fused kernels, code generation process for this example is described in Section 5.2. even some that might never be used at runtime. At runtime, Ikra keeps track of which parallel sections were executed symbolically 5. Code Generation and eventually launches a fused kernel, which may contain multiple 5.1 Mapping Ruby Types to C++ Types parallel sections, when the result is accessed. During type inference, Ikra analyzes which types an expression can Kernel Fusion via Type Inference Within host sections, there have. If an expression is monomorphic (has a single type), a Ruby are additional type inference rules to handle parallel sections. Ikra type is directly mapped to a corresponding type in the C++/CUDA performs kernel fusion through type inference: The type of a parallel source code. Figure 8 shows the mapping of Ruby data types to section (i.e., a method call AST node invoking a method defined CUDA/C++ data types. Numeric values are currently represented in ParallelOperations) is the array command that it evaluates to by int and float; therefore, the range/precision of values is lim- when evaluating it symbolically. To that end, Ikra interprets such ited in Ikra. nil is represented by int value 0. Arrays are either method call nodes in the Ruby interpreter. For example: represented by array_t (a pointer-size pair) or a generated struct type for arrays that appear in zip types (to allow efficient zipping a = 10 of arrays of different type). Array commands will be discussed in # type(a) = int the next section. All other objects are represented by object IDs generated by Ikra’s object tracer [14]. b = Array.pnew(a) do ... end # Eval Ruby: Array.pnew(CodeRef.new(:a)) do ... end For polymorphic expressions (e.g., if Fixnums and nil are assigned to a variable), union type structs (Figure 9) are used [1]. # type(b) := ArrayCombineCommand instance Values are stored in union_v_t which can hold values (or pointers to values) for all C++ types in Figure 8. The class ID field contains c = b.pmap do ... end a number which identifies the runtime type unambiguously. If a # Eval Ruby: type(b).pmap do ... end method is called on a polymorphic expression, Ikra generates a # type(c) = (different) ArrayCombineCommand instance switch-case statement with all types that the expression can have For performance reasons, the values of arguments of parallel oper- at runtime. If a monomorphic value is assigned to a polymorphic ations (except for input or size arguments) must be known during lvalue, Ikra wraps the value in a union type struct. Arrays of union symbolic execution (symbolic execution-time constants), so that type structs are used to represent polymorphic arrays. they are constant in the GPU program [10]. For example, the neigh- 5.2 Symbolic Execution in Host Sections borhood of a stencil computation inside a host section must be con- stant and cannot be generated inside the host section. However, all Host sections are pieces of Ruby code that are entirely translated arguments of a combine operation may be variable. to C++ code. They may contain one or more parallel sections but no advanced language features like metaprogramming. The Compilation Process The result of a host section in the Ruby interpreter is an array command (HostSectionArrayCommand). The following list gives an overview of the single steps in symbolic Ruby Type C++/CUDA Type execution and code generation. Fixnum int Float float 1. Retrieve Ruby source code of host section TrueClass bool 2. Generate AST (abstract syntax tree) from source code FalseClass bool 3. Convert AST to SSA (static single assignment) form NilClass int Array array_t, generated struct type 4. Insert to_a method call on last expression ArrayCommand array_command_t * 5. Perform type inference (only allowed in host sections) (other) object_id_t (int) 6. Generate C++ source code for host section and CUDA source _ (multiple types) union_t code for all array commands on which to a or [] is called The first two steps are identical to symbolic execution of parallel Figure 8: Mapping Ruby Types to C++ Types sections. The SSA form simplifies type inference: If a variable is

52 written a second time with a value of different type, a new C++ Next, we infer the type of the last map operation. variable is allocated and a union type can sometimes be avoided. arr6 = map (arr5) The return value (last expression) of a host section must be an array C command or an array. In the former case, to ensure that an array = {mapC (mapA(id[input])), mapC (mapA(arr6)), command is executed, Ikra wraps the last expression in a method call mapC (mapB (id[input])), mapC (mapB (arr6))} node with method name to_a. That method does not have any effect As can be seen from the definitions above, the type of arr6 is for arrays but triggers array command execution. Type inference is circular. If we try to fully expand its definition, it will have an infinite currently implemented with multiple passes; i.e., Ikra extends the number of elements. This is because our type inference mechanism types of all expressions during every pass until a fixpoint is reached4. effectively analyzes all control flow code paths through the host After type inference, Ikra generates C++ source code for the section (but cannot “count”): The union type of arr7 will have one host section. If an expression has an array command type, Ikra uses type for every code path. Host sections with loops have an infinite a pointer to an array_command_t struct which has (among other number of paths, because the type inference engine is not aware things) a pointer to the cached result of the array command (if it of the number of iterations. Moreover, if while loops are used, the was accessed before). Essentially, an array_command_t instance number of iterations cannot be determined statically in general. holds all information that an ArrayCommand instance in Ruby holds, Once a cycle like this one is detected, Ikra breaks it by inserting except for information that is known statically (e.g., the out of a to_a method call, which will launch the kernel and return its bounds value of stencil computations is a numeric literal in the result as an array. Consequently, such a method call will stop the kernel source code). Whenever an array command access is detected kernel fusion process. Where exactly the cycle is broken is an (e.g., calling to_a on an array command), Ikra generates kernel implementation detail. Ikra currently inserts the method call in source code for the array command (receiver type!) and a kernel Line 11, but it could also be inserted in Lines 4 or 10. invocation snippet which checks if a cached result is available and otherwise transfers data (if necessary) and launches the kernel. Since 11 arr6 = arr5.to_a.pmap do |i| i + 3; end # mapC array commands may have dependent input commands, generated kernels may consist of multiple fused parallel sections. We can now complete the type inference process and fill in the arr2 placeholders in the other definitions. Example As an example, consider the following Ikra source code arr6 = mapC (id[arr5]) which is already in SSA form and has a to_a method call at the end. arr2 = {id[input], mapC (id[arr5])} _ 1 result = Ikra.host section do arr5 = {arr3, arr4} 2 arr1 = input.to_command(dimensions: [2, 3]) = {mapA(id[input]), mapA(mapC (id[arr5])), 3 for i in 0...10 4 arr2 = φ(arr1, arr6) mapB (id[input]), mapB (mapC (id[arr5]))} 5 if arr2.preduce(:+)[0] % 2 == 0 arr7 = {arr1, arr6} 6 arr3 = arr2.pmap do |i| i+1; end # mapA = {id[input], mapC (id[arr5])} 7 else 8 arr4 = arr2.pmap do |i| i+2; end # mapB For code generation, only arr2, arr5 and arr7 are of interest, be- 9 end cause their result is accessed. Ikra generates kernels and invocations 10 arr5 = φ(arr3, arr4) for them: The type of the variable (or class ID field if polymorphic) 11 arr6 = arr5.pmap do |i| i+3; end # mapC determines the kernel to be launched. In total, Ikra generates the 12 end following kernels in this example (some might never be launched). 13 arr7 = φ(arr1, arr6) 14 arr7.to_a (5.1) mapA(id[input]) (7.1) id[input] 15 end (5.2) mapA(mapC (id[arr5])) (7.2) mapC (id[arr5])

In the following, we take a look at the inferred types of all arri (5.3) mapB (id[input]) (r.1) reduce(id[input]) variables. arr1 is an identity command for Ruby array input and (5.4) mapB (mapC (id[arr5])) (r.2) reduce(mapC (id[arr5]) arr2 cannot be fully inferred yet because arr6 is still unknown. All polymorphic variables (all variables except for arr1) will arr1 = id[input] have type union_t in the generated C++ code. The class ID field is arr2 = {arr1, arr6} = {id[input], arr6} used to determine which kernel should be launched. For example, Next, we infer the types for the first two map operations by evalu- one of the last two kernels should be launched in Line 14 depending ating the pmap method calls in the Ruby interpreter with each type on whether there was at least one loop iteration or not. This informa- in the union type of arr2 as the receiver. Different subscripts of tion is implicitly encoded in the class ID field. In the generated code, map operations indicate that the operations are different array com- Lines 2, 4, 6, 8, 10 and 13 do not launch a kernel but merely update mand objects after symbolic evaluation (because they have different their respective variable with a new array_command_t object, pos- parallel sections) and, therefore, represent different types. sibly wrapped inside a union type struct containing the class ID for the command5. arr3 = mapA(arr2)= {mapA(id[input]), mapA(arr6)}

arr4 = mapB (arr2)= {mapB (id[input]), mapB (arr6)} 6. Benchmarks arr5 = {arr3, arr4} This section shows a number of microbenchmarks6 (small parallel

= {mapA(id[input]), mapA(arr6), sections, only for loops) of the current Ikra implementation in mapB (id[input]), mapB (arr6)} 5 Recall that polymorphic method calls are translated to switch-case state- ments. Every case updates the respective variable with a different class ID. 4 There are better techniques for type inference using constraint solving. 6 https://github.com/prg-titech/ikra-ruby, branch array17

53 28M)wt n-iesoa UAboksz f1024. of size block CUDA one-dimensional a with MB) (228 4.4.0-43-generic aiu ofiuain.Bnhak eerno optrwith computer a on run were Benchmarks configurations. various aiu oplto taeis o vr rga,w hwthe show we program, every For strategies. compilation various 65 aallscin nietelo ssalradtenme floop of number 100). vs. the (200 and of larger smaller number is is the iterations loop that the is the inside 6 achieve sections Program not as parallel does fusion kernel 7 from Program confidence speedup that with same reason measured The be experiments. even our not in could runtime and additional runtime to kernel leads the This ( iteration. overhead an this However, of overhead. kernel end which the of at track program keep launch to This to types 2x. union of with code speedup source a generates giving flow, Ikra. control of complex part Ruby more the of tree optimization shows a for This if potential together. still interpreter fused is and Ruby there analyzed the that is in commands time array long 501 very of a spends also in Ikra results even iterations of a number code the kernel resulting increasing the large; Moreover, becomes iterations loops). loop while of (e.g., number unknown the is case, general the in However, possible. either be can a a indicates code. Ikra-generated for seconds 2 around is here) shown (not time Compilation fusion. kernel (manual) with/without implementations fusion. 10: Figure nItlCr 762H P 27 H) 2G A,an RAM, version GB (kernel 16.04.1 32 Ubuntu GHz), GPU, (2.70 940MX CPU GeForce i7-6820HQ Nvidia Core Intel an nteRb nepee n otscin,aohr2 peu is speedup 2x another section), host (no interpreter all Ruby If fusion. the kernel in without a version parallel giving the together, 501 501 to fused kernel for are compared loop launched without speedup the are 5x version within kernels kernels the 5 101 All to 6, sections. sections, Program compared parallel In 11 speedup all fusion. 10x for launched a is kernel giving single a 1, Program letter The flow. boxes control indicate square arrows The the graph). and kernels flow indicate (control structure program to kernel comparison with programs in different performs 7 compare code We code. Ikra CUDA generated hand-written how and fusion rga sa3 tni optto namti fsize of matrix a on computation stencil 3D a is 5 Program prto.I uhcss kacno efr enlfso inside fusion kernel perform cannot Ikra cases, such In operation. kernel operation style # # #kernels #kernels in loop #loop iterations #runtime kernel with host section with fusion invocations after fusion

nvcc 10x × rgas35cnito opwt single a with loop a of consist 3–5 Programs with program a in fusion kernel of benefit the shows 7 Program In fusion. kernel of benefit the show 7 and 6 1, Programs kernel to due speedup performance the show benchmarks The ☆ ☆ map ... N 65 oplto ro.A a ese rmthe from seen be can As error. compilation Ikra-M l te rgasoeaeo arcso size of matrices on operate programs other All . / new 0.0 0.1 0.2 0.3 0.4 0.5 0.6 map irbnhakRniei Seconds. in Runtime Microbenchmark ✓ ✓

11 Ikra-F map prtosaefsdtgte yeeuigteloop the executing by together fused are operations ✓ map n/a / n/a n/a map salwrbudweealcd si igekre ee mn iterations). among (even kernel single a in is code all where bound lower a is 11 CUDA-F prto,teletter the operation, ✓ 11 Ikra no loop 11 (5) (4) (3) CUDA (2) (1) or ,Rb .. n h UAToktV8.0.44. Toolkit CUDA the and 2.3.1 Ruby ), stencil 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 stencil stencil ✓ 11 n/a Ikra CUDA

. 11 ☆ N ethost rest N a

new 500x / 200x / 1000x 14 16 10 12 0 2 4 6 8 smc mle than smaller much is ) prto,adtestar the and operation, ✓ Ikra map 501 map 500 501 1 CUDA ✓ Ikra-F interpreter 1 1 Ikra-M map 10.0 12.5 15.0 17.5 simple loop 0.0 2.5 5.0 7.5 stencil or sIr ihalcd nasnl otscinadkre fusion. kernel and section host single a in code all with Ikra is ✓ 200 6 Ikra

201 1 stencil 129 × 201 CUDA time, stencil 10 Himeno 0.0 0.5 1.0 1.5 2.0 2.5 3.0 M × 7 1000 ✓ Ikra

1001 1 54 1001 CUDA Himeno ihrdnatcmuaino ihsnhoiain osdr for Consider, synchronization. with or computation redundant with the on depends iterations) of number (e.g., flow control the where 7 ln oetn enlfso ocranseclcmuain that computations stencil certain to are a fusion There exhibit kernel location”. extend “same to is pattern plans input whose operations fuse oe ecnseta h enlrnie r lotietclfor identical almost are runtimes kernel the that see can we code, parallel a of result the when host the because to is transferred This host. only the is and data device the between data transferring xml,tefloigtosecloperations. stencil two following the example, Operations Stencil of Fusion Kernel Work Future 7. 7. Program in benchmarks structure our the in than ones complex the of more to never structure similar kernel is the programs that analyzed showed all work Their et applications. the Shen real of J. in by analysis programs an parallel of [ on number al. Based large a kernels. of generated structure kernel explosion of combinatorial number a the to flow lead of control can per This kernel loops). one (excluding generates path Ikra 5.2, Section in described Kernels Generated of union Number (allocating/comparing sections etc.). overhead, host loop in commands, types/array time code, little Ruby generating very Ruby and and the inference in type time performing additional for spends interpreter Ikra fused). (or optimized fully map occur. will case, transfer realistic data more more a data, In host. result the the to iteration, back last transferred the and after accessed Only is iterations. are of benchmarks number all fixed in a loops The accessed. is section (caching). access memory a efficient to the more due reusing and increased time When be allocation can loop. memory lower performance the of the within location, piece updated memory new is same a array allocate an time but every memory reuse not currently high a has It h op hc swyw nyrpr h efrac for performance the report only we why is which loop, the kaspormigsyecudices h ubro aallsections. parallel of number the increase could style programming Ikra’s 5x hncmaigIr’ efrac ihhn-rte CUDA hand-written with performance Ikra’s comparing When low a have benchmarks All 13 prtos h eeae oeof code generated The operations. M ... M N ,w eiv httenme fkresrmismanageable remains kernels of number the that believe we ], sabnhakwt eoybudseclcomputation. stencil memory-bound a with benchmark a is simple 100x alloc 10 12 14 16 0 2 4 6 8 ✓ ✓ ✓ ✓ ✓ egbrod enl uincnb oeeither done be can fusion Kernels neighborhood. 101 4 simple loop Ikra-F

iebcueIr adteCD aeie do baseline) CUDA the (and Ikra because time CUDA-F 101 2 map 100 CUDA-F 5 501 Ikra 501 CUDA

1 1 Ikra-M (6) and M transfer u otekre uinprocess fusion kernel the to Due CUDA M N M M M stencil r adwitnbaseline hand-written are

ie .. iesetfor spent time i.e., time, 200x kacncretyonly currently can Ikra 10 12 14 16 Ikra 0 2 4 6 8 prtosi o yet not is operations complex loop ✓ ✓ ✓ 201 8 Ikra-F map 200 swtotkernel without is Ikra var. 5

for var. CUDA

11 Ikra-M (7) op with loops 7 interpreter free transfer alloc rest host kernel Ikra and . a single kernel if they use only the values generated by the previous A1 = stencil(A0, [−1, 0],f, 10) section at the same location. Host sections separate code regions that extensively use dynamic language features from computations A2 stencil A1, , ,g, = ( [−1 0] 10) and allow Ikra to compile entire loops to C++ code. If inside of a The resulting arrays after each iteration are defined as follows. host section, parallel operations are fused during a static analysis us- 10 ing type inference. Future work will extend kernel fusion to stencil computations and focus on memory management features. f(A0[0],A0[1]) A1 = f(A0[1],A0[2]) ...     References 10 [1] M. Abadi, L. Cardelli, B. Pierce, and G. Plotkin. Dynamic typing g(10,f(A0[0],A0[1])) in a statically typed language. ACM Trans. Program. Lang. Syst., A2 = g(f(A0[0],A0[1]),f(A0[1],A0[2])) 13(2):237–268, April 1991. ...   [2] M. M.T. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover.   Accelerating haskell array codes with multicore GPUs. DAMP ’11, The definition of A2 represents the computation for the fused stencil pages 3–14. ACM, 2011. operation. Most terms using the function f are computed twice (redundantly). Alternatively, Ikra could split A0 in multiple arrays, [3] J. Filipovic,ˇ M. Madzin, J. Fousek, and L. Matyska. Optimizing assign every subarray to a CUDA block and store intermediate CUDA code by kernel fusion: application on BLAS. The Journal results in shared memory. Inter-block synchronization or redundant of Supercomputing, 71(10):3934–3957, 2015. computation is then only necessary at the subarray borders (ghost [4] J. Fumero, M. Steuwer, L. Stadler, and C. Dubach. Just-in-time GPU region [8]). This technique works well only if the neighborhood is compilation for interpreted languages with partial evaluation. VEE simple (i.e., the ghost region is small). ’17, pages 60–73. ACM, 2017. [5] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Reusing Memory Many programs that update a vector or matrix Elements of Reusable Object-oriented Software. Addison-Wesley iteratively only need access to the data from the previous computa- Longman Publishing Co., Inc., Boston, MA, USA, 1995. tion. For performance reasons (e.g., caching), CUDA programmers [6] T. Henriksen, K. F. Larsen, and C. E. Oancea. Design and GPGPU allocate one (for combine operations) or two (for stencil operations) performance of Futhark’s redomap construct. ARRAY 2016, pages arrays only and keep writing to these arrays. However, all opera- 17–24. ACM, 2016. tions in Ikra (and Ruby) create new arrays. Moreover, Ikra does not have a garbage collector, so the memory is released after the [7] E. Holk, R. Newton, J. Siek, and A. Lumsdaine. Region-based memory management for GPU programming languages: Enabling rich data execution of the host section or if the programmer frees memory structures on a spartan host. OOPSLA ’14, pages 141–155. ACM. explicitly in the Ruby code. To decide whether it is safe to reuse a previously allocated array, Ikra must do an escape analysis to en- [8] F. B. Kjolstad and M. Snir. Ghost cell pattern. ParaPLoP ’10. ACM. sure that overwritten data is not read at a later point of time. Such [9] A. Klöckner, N. Pinto, Y. Lee, B. Catanzaro, O. Ivanov, and A. Fasih. advanced memory management issues are subject to future work. PyCUDA and PyOpenCL: A scripting-based approach to GPU run- time code generation. Parallel Comput., 38(3):157–174, March 2012. 8. Related Work [10] A. S. D. Lee and T. S. Abdelrahman. Launch-time optimization of OpenCL GPU kernels. GPGPU-10, pages 32–41. ACM, 2017. Kernel fusion is an optimization that is supported by many other GPGPU frameworks and languages, but the focus is on different as- [11] B. Meyer. Object-Oriented Software Construction. Prentice-Hall, Inc., 1st edition, 1988. pects. For example, Harlan [7] supports nested kernels, Futhark [6] has support for nested parallelism and a powerful fusion engine for [12] S. Sato and H. Iwasaki. A Skeletal Parallel Framework with Fusion map/reduce combinations, and Kernel Weaver focuses on database Optimizer for GPGPU Programming, pages 79–94. Springer Berlin queries [17]. Furthermore, all of these tools focus on statically-typed Heidelberg, Berlin, Heidelberg, 2009. programming languages, making translation within a kernel easier [13] J. Shen, A. L. Varbanescu, X. Martorell, and H. Sips. A study of compared to Ikra because no union types are required and making application kernel structure for data parallel applications. Technical kernel fusion itself easier because it is known ahead of execution report, Delft University of Technology, 2015. time which kernels are executed together. [14] M. Springer and H. Masuhara. Object support in an array-based Fumero et al. designed a GPU extension for the R programming GPGPU extension for Ruby. ARRAY 2016, pages 25–31. ACM, 2016. language, a dynamically-typed language [4]. Their implementation [15] M. Viñas, Z. Bozkus, and B. B. Fraguela. Exploiting heterogeneous is built on top of the Truffle AST interpreter framework [18] and parallelism with the heterogeneous programming library. J. Parallel the Graal JIT compiler. They use partial evaluation to generate opti- Distrib. Comput., 73(12):1627–1638, December 2013. mized OpenCL code for hot code sections. In contrast to Ikra, the re- [16] M. Wahib and N. Maruyama. Scalable kernel fusion for memory- sulting OpenCL code can handle only monomorphic types, whereas bound GPU applications. SC ’14, pages 191–202. IEEE Press, 2014. Ikra generates a single CUDA program with union types that can [17] H. Wu, G. Diamos, S. Cadambi, and S. Yalamanchili. Kernel weaver: handle all types which could theoretically show up during runtime. Automatically fusing database primitives for efficient GPU computa- Consequently, their generated OpenCL code is more efficient, but tion. MICRO-45, pages 107–118. IEEE Computer Society, 2012. requires recompilation if the runtime types are changing. [18] T. Würthinger, C. Wimmer, A. Wöß, L. Stadler, G. Duboscq, C. Humer, G. Richards, D. Simon, and M. Wolczko. One VM to rule them all. 9. Summary Onward! 2013, pages 187–204. ACM, 2013. We presented the design and implementation of Ikra, a Ruby library [19] Y. Yan, M. Grossman, and V. Sarkar. Jcuda: A programmer-friendly for data-parallel computations. Ikra allows programmers to write interface for accelerating Java programs with CUDA. Euro-Par ’09, modular code with respect to reusability and composability of par- pages 887–899. Springer-Verlag, 2009. allel sections. Parallel sections that are used together are fused into

55