Exploiting Remote Memory Access for Automatic Multi-GPU Parallelization Javier Cabezas ∗§ Lluís Vilanova ∗§ Isaac Gelado φ Thomas B. Jablin ‡ Nacho Navarro ∗§ Wen-mei Hwu ‡ Barcelona Supercomputing Center∗ UPC§ NVIDIA Researchφ UIUC‡ {name.lastname}@bsc.es {jcabezas,vilanova,nacho}@upc.edu [email protected] {jablin,w-hwu}@illinois.edu

Abstract data structures. Consequently, prior work must conservatively replicate portions of the arrays that are never accessed. For In this paper we present AMGE, a programming framework example, consider a kernel that performs n-dimensional tiling and runtime system that transparently decomposes GPU ker- (a pattern often found in dense GPU computations [33, 37, 8]) nels and executes them on multiple GPUs in parallel. AMGE where computation partitions access different non-contiguous exploits the remote memory access capability in modern GPUs array regions. In such a case, Kim et al. transfer the whole to ensure that data can be accessed regardless of its physical memory ranges accessed by each computation partition, which location, thus allowing our runtime to safely decompose and may include large portions of the array that are never used, distribute arrays across GPU memories. It also implements a while Lee et al. replicate the whole data structure in all GPUs. compiler analysis that detects array access patterns in GPU This increases the memory usage, limiting the size of the prob- kernels. Using this information, the runtime chooses the best lems that can be handled, and imposes performance overheads computation and data distribution configuration. Results show due to larger data transfers. (2) Data coherence overhead: 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a replicated output memory regions need to be merged in the wide range of dense computations compared to the original host memory after each kernel call. In many cases, this merge versions on a single GPU. The GPU execution model allows step leads to large performance overheads. (3) Lack of support AMGE to hide the cost of remote memory accesses when they for atomic and global memory instructions. are kept below 3%. We further demonstrate that a thread block In this paper we present AMGE (Automatic Multi-GPU scheduling policy that distributes remote accesses thorough Execution), a programming interface, compiler support and the whole kernel execution helps reducing their overhead. runtime system for that automatically executes computations 1. Introduction that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type Current HPC cluster systems commonly install 2 or 4 CPUs for multidimensional arrays that allows for robust, transparent in each node [1]. Some also install discrete GPUs to further distribution of arrays across all GPU memories. This new type accelerate computations rich in data parallelism. As CPU provides dimensionality information that enables the compiler and GPU are integrated into the same chip (e.g., Intel Ivy to determine how the arrays are accessed in GPU kernels. The Bridge [21], AMD APU [4], NVIDIA K1 [5]), multi-GPU runtime system uses the compiler-provided information to nodes are expected to be common in future HPC systems. automatically choose the best computation and data distribu- Current GPU programming models, such as CUDA [31] and tion configuration to minimize inter-GPU communication and OpenCL [24], make multi-GPU programming a tedious and memory footprint. error-prone task. These models present GPUs as external AMGE assumes non-coherent non-uniform shared memory devices with their own private memory, and programmers are accesses (NCC-NUMA) between GPUs through a relatively in charge of splitting data and computation across GPUs and low-bandwidth interconnect, such that all GPUs can access taking care of data movement. and cache any partition of the arrays. Thus, we ensure that Some solutions have already been proposed to transparently arrays can be arbitrarily decomposed, distributed and safely exploit multiple GPUs. Kim et al. [25] build a single virtual accessed from any GPU in the system. In current systems compute device for all GPUs in the system. They decompose based on discrete GPUs, we utilize Peer-to-Peer [3] and Uni- computations and execute the partitions on different GPUs. fied Virtual Address Space [31] technologies that enable a Data is also decomposed across GPUs as long as their compiler GPU to transparently access the memory of any other GPU and run-time analyses are able to unequivocally determine the connected to the same PCIe domain. While remote GPU mem- regions of the arrays accessed by each computation partition. ory accesses have been used in the past [35], this is the first Otherwise data is replicated on all GPUs. Lee et al. [27] ex- work to use them as an enabling mechanism for automatic tend this idea to heterogeneous systems with different types of multi-GPU execution. compute devices (CPUs or GPUs) or computation capabilities We also present and discuss different implementation trade- (e.g., different GPU models). However, both solutions suffer offs for computation and data distribution across GPUs using a from fundamental limitations. (1) Memory footprint overhead: prototype implementation of AMGE for ++ and CUDA. The none of these solutions determine the dimensionality of the prototype includes a compiler pass that detects array access We evaluate AMGE on an existing commercial system that implements most of the features of an NCC-NUMA system based on NVIDIA discrete GPUs (see Figure 1). GPUs access their local memory (arc a) with full-bandwidth. Accesses to CPU memories from the GPU (arc b) are routed through the PCI Express (PCIe) interconnect and the CPU memory con- Figure 1: Multi-GPU architecture evaluated in this paper. troller. If the target address resides in a memory connected to a different CPU socket, the inter-CPU interconnect (Hyper- patterns in CUDA kernels and generates optimized versions Transport/QPI) must be traversed, too. GPUs can also access of the kernels for different array decompositions. The runtime the memory in another GPU through the PCIe interconnect system distributes data and computation, and selects the appro- (arc c). This is implemented on top of the peer-to-peer (P2P) priate kernel version. This prototype is evaluated using a set mechanism introduced in NVIDIA Fermi GPUs [18]. of GPU dense-computation benchmarks, originally developed While the execution model provided by GPUs can hide for single-GPU execution. Results on a real 4-GPU system large memory latencies, both CPU memory and the inter-GPU show 1.98× and 3.89× kernel execution speedups for 2 and interconnects (e.g., PCIe 2.0/3.0) deliver a memory bandwidth 4 GPUs, respectively, using the default distribution selection which is an order of magnitude lower than the local GPU mem- policy implemented in our prototype, compared to the original ory (GDDR5). New interconnects that provide much higher version of the kernels running on a single GPU. bandwidth have been announced (e.g., NVLink is projected The main contributions of this paper are: (1) A multi-GPU to deliver up to 100 GB/s), but the memory technology will parallelization system that provides space-efficient data de- also keep improving, thus maintaining this gap. Therefore, compositions to enable bigger problem sizes. (2) A novel minimizing remote accesses is key for performance. compiler analysis for GPU kernels that detects the array access patterns to systematically determine the array decomposition 2.1. GPU Programming Model and distribution configuration to be used in order to minimize GPUs are typically programmed using a Single Program Mul- remote memory accesses. (3) A simple programming interface tiple Data (SPMD) programming model, such as NVIDIA that can be easily introduced into languages such as CUDA CUDA or OpenCL. For simplicity, we use the CUDA nam- and OpenCL to robustly and transparently distribute compu- ing conventions in the rest of the paper. This model allows tation and data across several GPUs. (4) An evaluation that programmers to spawn a large number of threads that exe- shows the efficacy of the remote memory access mechanism to cute the same program, although each thread might take a enable multi-GPU parallelization even when built on a limited completely different control flow path. All these threads are bandwidth interconnect. organized into a computation grid of groups of threads (i.e., 2. Multi-GPU architecture thread blocks). Each thread block has an identifier and each thread has an identifier within the thread block, that can be Our target multi-GPU system architecture has one or several used by programmers to map the computation to the data struc- chips that contain both CPU and GPU cores. Since we focus tures. Both CUDA and OpenCL provide weak consistency on programmability within a node, we use the term system models: memory updates performed by a thread block might to refer to a node. Each chip is connected to one or more not be perceived by other thread blocks, except for atomic and memory modules, but all cores (both CPU and GPU) can ac- memory fence (GPU-wide and system-wide) instructions. We cess any memory module in the system. Accesses to remote refer the reader to the CUDA Programming Guide [31] and memories have longer access latency and lower bandwidth the OpenCL specification [24] for further details. than local accesses, thus forming a shared memory NUMA Multi-GPU Programming. In CUDA and OpenCL, GPUs system. CPU cores typically access memory through a co- are exposed as external devices with their own memories. Pro- herent cache hierarchy, while GPUs use weaker consistency grammers typically decompose computation and data so that models that do not require cache coherence between cores. each GPU only accesses its local memory. If there are regions Therefore, both coherent and non-coherent interconnects are of data that are accessed by several GPUs, programmers are supported in AMGE. This shared memory NCC-NUMA sys- responsible of replicating and keeping them coherent through tem architecture has been successfully implemented in the explicit memory transfers. CUDA exposes a Unified Virtual past (e.g., Cray T3E [32]). NVIDIA proposes a similar system Address Space (UVAS), which ensures that virtual memory architecture in the Echelon project [23]. Moreover, NVIDIA addresses are unique across all memories in the system, and will offer single-board multi-GPU configurations in which support for remote memory accesses. Using the UVAS pro- GPUs share the memory through a non-coherent interconnect grammers could map the pages of a memory allocation on named NVLink, in the Pascal family of GPUs. AMD also different GPU memories. However, CUDA does not provide implements coherent and non-coherent memory hierarchies in any means to control on how virtual addresses are mapped to their APU processors [4]. physical memory and allocations are bound to a single GPU.

2 Figure 2: Overview of AMGE components. The compiler extracts array access pattern information and stores it in the program binary. The runtime system uses this information to decompose and distribute computation and data across the GPUs in the system. In this example, the system is composed of a single CPU and 4 GPUs, connected through a PCIe interconnect. Often time, data structures need to be accessed by both CPU implementation alternatives for this data type. and GPU code and programmers are in charge of keeping sep- A key feature in AMGE is the utilization of remote memory arate copies of the data structures and keeping them coherent accesses between GPUs [3]. On each reference to the array, the through copying between the CPU and the GPU memories. underlying implementation determines whether the element This extra code incurs extra development time and harms being referenced is hosted in the memory local to the GPU maintainability. Some solutions have been proposed to trans- executing the code or on a different GPU. References from a parently manage CPU/GPU data coherence [17, 22]. Recently, GPU to parts of the array stored in different GPU memories CUDA introduced UVM (Unified Virtual Memory), which are handled using remote memory accesses. This approach allows to declare memory allocations that can be accessed by ensures that any computation can always be decomposed and CPU and GPU code, but not concurrently. UVM is based on executed across multiple GPUs regardless the chosen data the ADSM model [17], in which the memory that is shared by distribution configuration. This removes the requirement for CPU and GPUs is acquired/released by the GPU in kernel call the compiler analysis to unequivocally determine the bounds boundaries. OpenCL 2.0 also exposes a Shared Virtual Mem- of the memory range accessed by a computation partition. ory space [24], but it does not allow programmers to specify However, remote accesses can impose performance overheads in which memory data is actually allocated. AMGE builds on and they must be minimized. remote memory accesses, UVAS, and ADSM technologies. On each kernel call, the AMGE runtime transparently de- termines the best computation grid and array decompositions 3. AMGE overview using the access pattern information generated by the compiler, and distributes them across all GPUs in the system. AMGE is a programming framework that decomposes and dis- Memory model: Arrays are decomposed and/or replicated tributes GPU kernels and data to be collaboratively executed before each kernel call. Input arrays can be replicated at on all the GPUs in the system. We implement AMGE using the cost of additional space and data transfer bandwidth con- C++ and CUDA, but it can be extended to other languages. sumption, but replicated output present additional problems. Figure 2 shows the components in AMGE and how they inter- After a kernel call, partial modifications in each copy need to act with the hardware. AMGE aggregates the GPU resources be merged to provide a consistent view of the array, before in the system and presents them as a single virtual GPU. Thus, using it in another kernel or in the host code. Previous so- programmers are relieved from the burden of decomposing lutions [25, 27] transfer all copies back to the CPU memory the problem and explicitly managing several GPUs. for a merge step, which impose a large performance overhead The AMGE compiler is a source-to-source compiler, based in many workloads. We avoid this problem by not allowing on the LLVM framework, that analyzes the CUDA kernels in output arrays to be replicated. Instead, output arrays are al- the program to detect their array access patterns and store this ways distributed across GPU memories, and accessed through information in the program executable. We argue that the uti- remote memory accesses if necessary. lization of the array dimensionality information is paramount AMGE implements the ADSM model [17] to allow arrays in order to efficiently exploit multi-GPU systems. However, to be used both by host and GPU code. ADSM assumes a AMGE targets CUDA which is an extension of the C/C++ lan- release consistency model in which allocations belong to the guages that do not provide data types with such information. In CPU code by default and are implicitly acquired/released by C/C++, programmers typically flatten the multi-dimensional the GPU on kernel call boundaries. The runtime transparently arrays into 1D arrays and linearize the dimension indices in transfers arrays between CPU and GPU memories as needed. each reference to the array. It is practically difficult if not infea- 3.1. An example: matrix multiplication sible for static analysis to reliably recover the dimensionality once the accesses have been flattened. AMGE provides a new Code programmed to run on a single GPU requires only mi- data type for multi-dimensional arrays that makes available nor modifications to use AMGE. Listing 1 shows the GPU this information to the compiler. Section 5 discusses different code of a single-precision floating point matrix-matrix mul-

3 1 void sgemm(ndarray C, ndarray A, 2 ndarray B) 4. Computation and data distribution in AMGE 3 { 4 float partial[SGEMM_TILE_N]; 5 __shared__ float b_tile_sh[SGEMM_TILE_HEIGHT][SGEMM_TILE_N]; The AMGE runtime decomposes GPU kernels using thread 6 for (int i = 0; i < SGEMM_TILE_N; i++) partial[i] = 0.0f; block granularity. This is because threads within a thread block 7 8 int mid = threadIdx.y * blockDim.x + threadIdx.x; share resources (e.g., shared memory), and support barrier 9 int row = blockIdx.x * (SGEMM_TILE_N * SGEMM_TILE_HEIGHT) + mid; 10 int col = blockIdx.y * SGEMM_TILE_N + threadIdx.x; synchronization operations. This requires all threads within a 11 thread block to be executed in the same compute core of the 12 for (int i = 0; i < A.get_dim(1); i += SGEMM_TILE_HEIGHT) { 13 b_tile_sh[threadIdx.y][threadIdx.x] = B(i + threadIdx.y, col); same GPU. However, the GPU programming model guaran- 14 __syncthreads(); 15 for (int j = 0; j < SGEMM_TILE_HEIGHT; ++j) { tees that there are no data dependences across thread blocks 16 float a = A(row, i + j); 17 for (int k = 0; k < SGEMM_TILE_N; ++k) within a kernel and, therefore, they can execute independently. 18 partial[k] += a * b_tile_sh[j][k]; In CUDA, programmers specify a computation grid that 19 } 20 __syncthreads(); is a multidimensional space gridx × gridy × gridz of thread 21 } 22 for (int i = 0; i < SGEMM_TILE_N; i++) blocks, similar to the iteration space in loop nests. Each 23 C(row, i + by * SGEMM_TILE_N) = partial[i]; 24 } thread block has a unique identifier blocki, j,k : 0 ≤ i, j,k < grid ,grid ,grid within the computation grid. The AMGE Listing 1: Multi-GPU matrix-matrix multiplication GPU code us- x y z runtime decomposes the computation grid so that it can be ex- ing AMGE for C++ and CUDA. cmo means column major order. ecuted on several GPUs. In the GPU programming model, the 1 // Initialize A and B in the host code iteration space is canonical and rectangular. Thus, dimensions 2 ndarray A; 3 ndarray B; can be uniformly decomposed into partitions. Computation 4 5 read_array("A.dat", A); grid decompositions along any of its dimensions (or combina- 6 read_array("B.dat", B); tions of them) are supported. 7 8 ndarray C(A.get_dim(1), B.get_dim(0)); The AMGE runtime tries to place on each GPU most of 9 // Computation grid size 10 dim3 block(MATRIXMUL_TILE_N, SGEMM_TILE_HEIGHT); the data accessed by the computation partition assigned to 11 dim3 grid(C.get_dim(1)/(SGEMM_TILE_N * SGEMM_TILE_HEIGHT), 12 C.get_dim(0)/SGEMM_TILE_N); it, in order to minimize remote memory accesses. A naive 13 // Kernel launch. A, B and C are used in the GPU code approach to minimize remote memory accesses is to replicate 14 sgemm<<>>(C, A, B); 15 // Write results for C into a file all the input arrays in all the GPU memories. This, however, _ 16 write array("C.dat", C); imposes a large memory footprint overhead. AMGE uses Listing 2: Multi-GPU matrix-matrix multiplication host code us- compiler analysis to generate array access pattern information ing AMGE for C++ and CUDA. for all the GPU kernels in the program, that is used by the tiplication (i.e., sgemm) computation using AMGE [16]. This runtime component to decide the best computation and array code requires A and C matrices to be stored in column major distribution configuration. In the next subsections we describe order, and B in row major order. The highlighted text shows how this information is generated by the compiler and used by the modifications performed to the original code. The only the runtime system to decompose and distribute the arrays. additional programming requirement for the kernel to be auto- 4.1. Compiler analysis matically decomposed is the utilization of the array data type (lines 1-2), and its associated indexing routines (lines 12, 13, The AMGE compiler analyzes all array references in the ker- 16 and 23). The data type is implemented by the ndarray C++ class template, where T is the type of of the dimensions of the array, allowing the AMGE compiler the elements, Dims is the number of dimensions of the array, to detect the individual access pattern on each dimension. This and Storage is an optional parameter that defines the storage is in contrast to previous works [25, 27] that treat all arrays as type (if not specified, row major order storage is used). The one dimensional. Kim et al. [25] compute the upper and lower kernel uses a 2D computation grid, in which each thread block memory addresses of the tiles accessed by each computation computes a 2D tile of C by traversing A and B on their X and Y partition to distribute the arrays. Since each multi-dimensional dimensions, respectively. The compiler detects these patterns tile appears like a collection of strided bands when the arrays and stores them in the program executable. are viewed as one dimensional, previous single-address-range Listing 2 shows the CPU code of the sgemm computation. approaches falsely conclude that the tiles overlap, resulting First, float input matrices A and B are declared in lines 2 in unnecessary replication of large portions of the arrays. For and 3. ndarray objects can be passed both to CPU and GPU output arrays the scenario is even worse, because overlapping routines. The AMGE runtime intercepts the kernel call and regions are merged after each kernel execution. By detecting uses the information registered by the compiler to decompose the access pattern on each dimension, the AMGE runtime can the matrices, and distribute both computation and data across identify multi-dimensional tiles as non-overlapping entities all the GPUs in the system. Note that there are no explicit avoiding the unnecessary replication in prior works. data transfers between host and GPU memories. The runtime We consider three access pattern types: (1) as a function of transparently detects when data needs to be transferred. thread block indices, (2) within a thread block (e.g., loops),

4 tile or 1 if all threads access the same element of the array’s dimension (e.g., the same row in a matrix). • m is the index of the non-contiguous array tile accessed by a thread block (can be an induction variable or a constant). We claim that most array-based computations use this ex- pression to map thread block identifiers to the data they access. (a) BLOCK (b) BLOCK-CYCLIC Figure 3: Computation-to-data mapping examples. The most common and simple mapping is found when indi- vidual contiguous array tiles are assigned to contiguous thread and (3) data-dependent. The first type is the most common blocks (m = 0). This pattern classified as BLOCK. and is produced by programmers designating the data to be Another very common mapping is to assign non-contiguous referenced to each thread block. Typically, affine transforma- array tiles to each thread block (m > 0) using a grid-sized tions of the block and thread indices are used to compute the stride. This pattern is classified as BLOCK-CYCLIC. CYCLIC is indices used in the array references. As a result, threads that a special case of BLOCK-CYCLIC, in which the block size is 1 belong to blocks with contiguous identifiers in one dimension but, as we show in Section 5, it allows for a more efficient tend to access elements that are contiguous in the same or a implementation of array distribution. The AMGE runtime different dimension in the array. In the sgemm example (List- determines the size of the block by inspecting the thread block ing 1), blockx is used to access Ay (line 16) and Cy (line 23), size parameter provided by the programmer in the kernel call. while blocky is used to access Bx (line 13) and Cx (line 23). 4.1.2. Runtime information generation The compiler cre- We use the notation Ai to refer to the ith dimension of array A. ates a map that stores, for every dimension of each array This linear relationship allows us to relate thread blocks with reference, a set with the identifiers of the computation grid the portions of the arrays accessed by them. used to access it. It also stores the type of the access (i.e., The second access pattern type is produced when array Read or Write) and the distribution type (i.e., BLOCK, CYCLIC, dimensions are traversed through loop induction variables. BLOCK-CYCLIC) identified in the analysis. Since several array In the sgemm example (Listing 1), Ax is traversed using the reference statements can be found in a kernel, the results of all induction variables i + j of the nested loops (line 16). references to an array are combined into a single map. Merg- The third type of accesses cannot be determined at compile ing the results of two array references involves performing the time since the indices are computed with values that are only intersection of the sets of each dimension. Therefore, if an known at kernel execution time. array is read using different thread block identifiers in different 4.1.1. Distribution type Only array dimensions accessed us- parts of the kernel, the combined results will be an empty set, ing the first pattern type are eligible for decomposition. The which indicates that it must be replicated. compiler analyzes such patterns in order to determine the data 4.1.3. Analysis limitations The algorithm cannot decide how distribution type. AMGE assumes a distribution of the com- to distribute an array when an array dimension is traversed putation grid in which block identifiers are contiguous within using several dimensions of the computation grid. However, each partition and, typically, programmers assign contiguous this pattern is rarely used. One example is when a dimen- elements of data to contiguous thread blocks. This would sion is accessed as an n-dimensional space. Using a higher allow for a simple decomposition of arrays into contiguous dimensionality for the array would solve this problem. Data- tiles. However, often time programmers use other computa- dependent access patterns cannot be classified by the analysis, tion-to-data mappings. We propose a novel compiler analysis either. In these cases, the AMGE runtime replicates the arrays. that detects the thread block-to-data mappings used by pro- 4.1.4. Memory consistency model Decomposed kernels grammers and classifies them into the most common data dis- must honor the memory consistency of the GPU program- tribution types: BLOCK, CYCLIC and BLOCK-CYCLIC [19, 26, 10] ming model. Data propagation across thread blocks in the (Figure 3 shows two different mappings for a 2D array). The GPU model is only guaranteed for atomic memory operations computation distribution classification for each dimension of and memory fences. For atomic operations we exploit the the arrays is also communicated to the runtime. hardware support provided by modern system architectures The AMGE compiler attempts to map the index expression (e.g., atomic operations in PCIe 3.0). GPU-wide memory used for each array dimension to the canonical expression fences are translated to system-wide memory fences to ensure t + bs × B + m × bs × G, where: correctness. Finally, since a GPU may cache remote array • B is the thread block identifier (i.e., blockIdx). partitions, GPU caches are flushed at kernel exit boundary. • G gridDim is the number of thread blocks in the grid (i.e., ). 4.2. Runtime data decomposition and distribution • t is a thread identifier or a constant. This value does not determine the access patterns across thread blocks. On each kernel execution, the runtime reads the access pattern • bs indicates the bounds of the array tile being accessed by a information generated by the compiler. If no information thread block. Typically it is a multiple of blockDim if each is found, input arrays are replicated and output arrays are thread in the block accesses one or several elements of the decomposed on their highest-order dimension.

5 Ay is not decomposed either. In transpose, both blockx and blocky are used to index the two dimensions of A and B matri- ces and both can be decomposed. Dimensions of C in sgemm and B in transpose are accessed using identifiers of different dimensions of the computation grid (e.g., blockx is used to access dimension Y). Thus, in an XY computation distribu- tion configuration, neighboring tiles in the X dimension of the array are distributed across the GPUs in gpuy. Arrays that are not decomposed are replicated in all GPU memories. Moreover, tiles from arrays that are not indexed using the identifiers of all the decomposed dimensions of the Figure 4: Data accessed by each computation partition for dif- computation grid, are replicated in the memories of the GPUs ferent computation decompositions in a 4-GPU matrix-matrix that belong to the GPU grid dimensions on which the unused multiplication (sgemm) and matrix transposition (transpose). computation grid dimensions are distributed. Sticking to the sgemm example, the tiles in C can be directly mapped on the A, B, C are the arrays used in the kernels, Pi, j is a partition of GPU grid, but the array distribution configurations for A and the computation grid, Gi, j is a GPU in the GPU grid. B depend on the computation distribution configuration. For The runtime evaluates the characteristics of all possible 1D configurations (X and Y), either A or B are fully replicated distribution configurations and ranks them using a run-time in all GPUs. For XY, the tiles in A are distributed across the selectable policy. Since the computation grid is limited to 3 GPUs in gpux and each tile is replicated in all GPUs in gpuy. dimensions, the number of cases to be evaluated is small. The Conversely, the tiles in B are distributed across the GPUs in highest ranked configuration is used. gpuy and replicated in the GPUs in gpux. 4.2.1. Computation grid distribution The AMGE runtime system lays out the GPUs in the system on a grid with as many 5. Implementation details dimensions as decomposed dimensions in the computation 5.1. Array data type grid. Then, the computation grid is decomposed into as many partitions as the number of GPUs in each dimension of the We utilize the UVAS support provided by the hardware to GPU grid. Each partition of the computation grid is assigned place different parts of the array in different GPU memories to a GPU. The mapping of the computation grid to the GPU while having a continuous representation of the array in the grid is called computation distribution configuration in the virtual address space. Hence, decomposed arrays can be refer- paper. enced by using regular linearization operations on the indexes: n n (a1,··· ,an) → ∑ ai × ∏ D j where ai is the index and Di the number 4.2.2. Array distribution The runtime system uses the ar- i=1 j=i+1 ray access pattern information generated by the compiler to of elements in the ith dimension of the array. Indexes are determine how arrays must be decomposed for a specific com- ordered from the highest-order to the lowest-order dimension. putation distribution configuration, Arrays are decomposed Current versions of CUDA impose a 1 MB (instead of page- along those dimensions that are accessed using identifiers of size) granularity to allocate contiguous virtual memory ranges the computation grid whose dimensions are also decomposed. on different GPUs. Decompositions of the highest-order di- Once we determine the dimensions of the array to be decom- mension of the array are implemented by allocating a single posed for a specific computation distribution configuration, contiguous region of memory (multiple of 1 MB) for each the array is decomposed into an n-dimensional grid of tiles, partition on each GPU. Decompositions of dimensions that where n is the number of decomposed dimensions. The size are contiguous in memory are implemented by alternatively and number of tiles depends on how the computation grid is allocating 1 MB chunks on different GPUs, as needed. We mapped to the physical GPUs according to the access patterns. refer to this scheme as VM implementation. If blocki is used to index L j, and gridi is mapped on the k Nevertheless, the coarse memory allocation granularity can dimension of the GPU grid, L j is decomposed into as many produce data distribution imbalance if partitions cannot be tiles as GPUs in gpuk and distributed among them. Figure 4 stored in balanced-sized groups of 1 MB chunks, which results shows the relationship between the computation grid, the GPU in an increased number of remote memory accesses. One grid and the array decomposition grid in different computation solution to reduce the imbalance is to add padding to the distribution configurations for sgemm and transpose. In the lower order dimensions that are not decomposed. However, sgemm example, Cx, Cy, Ay and Bx can be decomposed, as they achieving perfect balancing using 1 MB allocation chunks are accessed using thread block identifiers. For a configura- can impose a footprint overhead in the order of hundreds of tion that decomposes the computation grid on its Y dimension times. Hence, this implementation is nonviable to partition (gridy, second row of Figure 4), Cx and Bx are decomposed. arrays’ lowest-order dimensions, whose elements are stored Since blockx is used to index Ay but gridx is not decomposed, contiguously in memory. Another solution is to permute the

6 dimensions of the array so that decomposed dimensions are Generated code is compiled into the program executable. not stored contiguously in memory. However, this can break 5.2. Run-time distribution configuration selection memory coalescing [34]. Therefore, we also provide an alternate implementation pro- As explained in Section 4.2, on a kernel call, the runtime sys- posed in [7] that reshapes the arrays. In this implementation, tem selects the best distribution configuration. Our prototype each GPU contains a memory allocation that hold all the ele- implements a policy that (1) minimizes the number of remote ments in a partition, and it is padded to the next page boundary. accesses, and (2) favors the array implementation that imposes The array is reorganized by adding a new dimension for each the least overhead. On one hand, the reshaped array imple- decomposed one (i.e., strip-mining), that indicates the GPU in mentation performs additional operations on the indices that which each partition is stored. Thus, in each array reference, impose a performance overhead. On the other hand, the VM the original indexes are transformed into a new set of indexes implementation may introduce unnecessary remote accesses for all the dimensions of the reshaped arrays. This approach due to data distribution imbalance. Therefore, the VM imple- allows arrays to be decomposed on any dimension as they do mentation is preferred unless it introduces too many remote not have to be stored contiguously in the virtual address space. memory accesses. Thus, our policy analytically computes the However, this flexibility comes at the cost of extra computa- data distribution imbalance introduced by the coarse alloca- tion. For example, if a 3D volume is decomposed along its tion granularity in the VM implementation, and uses it to rank highest-order dimension using a BLOCK distribution, the index every array distribution configuration. If this imbalance ex- for this dimension a1 is transformed as follows: ceeds a threshold (in our case 5%), reshaped is chosen. When   several configurations have the same score, decompositions on a1 0 (a1,a2,a3) → ( 0 ,a1 mod D1,a2,a3) D1 the highest-order dimension are preferred because they allow l D m where D0 = 1 and P is the number of tiles in the first dimen- 1 P1 1 for more efficient CPU↔GPU data transfers. An analysis of sion of the array’s decomposition grid. Thus, the operations different policies is out of the scope of the paper. needed to compute the location of the element a1,a2,a3, are di- Since computation partitions are executed independently, vided into: the computations of the offset of the block in the CUDA assigns new thread block identifiers in each invocation. dimension being decomposed, and the linearization of the in- In order to retain the original identifiers, we store the offsets dex within the block. Therefore, an extra division and modulo of each computation partition in the memory of each GPU. A operations are performed in each access to the array, com- preprocessor macro overrides blockIdx and uses these offsets pared with the regular index linearization. For CYCLIC and to compute the original indexes. BLOCK-CYCLIC, similar transformations are performed. 5.2.1. CPU/GPU array coherence The ADSM model is im- Providing a generic indexing routine that supports all possi- plemented like in the GMAC library presented in [17]. The ble decomposition can impose an unacceptable performance ndarray type keeps two copies of the array, one in host mem- overhead due to the extra operations needed to transform all ory and the second in the GPU memories. The host copy of the indexes. In order to ensure maximum performance, our the array is laid out as a regular C/C++ multidimensional array prototype provides different implementations of the indexing to ensure compatibility with third-party libraries. routines optimized for the different array decompositions and distribution types. However, using specialized indexing rou- 6. Experimental methodology tines for each decomposition requires changes in the kernels, 6.1. Hardware setup as (1) the kernel code must explicitly call the proper versions of the routines, and (2) the array decomposition to be used is All experiments were run on a system containing a quad-core not known at compile time. We implement a compiler pass Intel i7-3820 at 3.6 GHz with 64 GB of DDR3 RAM memory, that generates different kernel versions for different array de- and 4 NVIDIA Tesla K40 GPU cards with 12 GB of GDDR5 compositions of the arrays used in the kernel. In each version, each, connected through a PCIe 3.0 in x16 mode (containing array references use the indexing routines optimized for the two PCIe bridges like in Figure 1). The machine runs a GNU/ decomposition. On each kernel execution, the runtime system Linux system, with Linux kernel 3.12 and NVIDIA driver selects the data distribution to be used for all the arrays and 340.24. Benchmarks were compiled using GCC 4.8.3 and invokes the specialized kernel version for that distribution. LLVM 3.4 for CPU code and NVIDIA CUDA compiler 6.5 Our modified toolchain implements two new compilation for GPU code. Execution times were measured using the passes using the LLVM framework. The first pass performs CUPTI profiling library that provides support for sampling the array access pattern analysis introduced in Section 4.1 on and nanosecond timing accuracy. For runs with more than one the LLVM IR generated by the CUDA compiler, and gener- GPU, graphs show the time for the slowest GPU. ates the host code needed to communicate the information 6.2. Benchmarks to the runtime system. The second pass generates special- ized kernel versions for the different data decompositions and We evaluate AMGE using a number of dense scientific compu- the host code needed to select the proper version at run-time. tations that use different computation and array access patterns.

7 #Reg #Reg #Reg #Reg Array Array Kernel Suite Inputset Orig VM Resh B/C Resh B-C Decompositions Distribution

A,B: 16K×16K 19 22 X→A,B(*,BLOCK) C(*,*) A,B:{Dx} C:{Rx} convolution2D Parboil 2 17 19 19 22 Y→A,B(BLOCK,*) C(*,*) A,B:{Dx} C:{Rx} C: 9×9 19 22 XY→A,B(BLOCK,BLOCK) C(*,*) A,B:{Dx,y} C:{Rx,y} fft1D Parboil A,B: 256M 22 24 24 24 X→A(*) B(BLOCK) A:{Rx} B:{Dx} reduction SDK A,B: 256M 12 12 11 18 X→A,B(BLOCK) A,B:{Dx} saxpy - A,B: 256M 8 8 8 17 X→A,B(BLOCK) A,B:{Dx} 41 41 X→A,C(*,BLOCK) B(*,*) A,C:{Dx} B:{Rx} sgemm Parboil 2 A,B,C: 4K×4K 38 33 40 40 Y→A(*,*) B(*,BLOCK) C(BLOCK,*) A:{Rx} B:{Dx} C:{Dx} 42 42 XY→A,B(*,BLOCK) C(BLOCK,BLOCK) A:{Dx,Ry} B:{Dy,Rx} C:{Dx,y} sgemv - A: 8K×8K - B,C: 8K 11 11 11 16 X→A(BLOCK,*) B(*) C(BLOCK) A:{Dx} B:{Rx} C:{Dx} sort_merge_global 17 16 18 28 X→Ak,Av,Bk,Bv(BLOCK) Ak,Av,Bk,Bv:{Dx} sort_merge_shared SDK Ak,Av,Bk,Bv: 256M 17 16 18 27 X→Ak,Av,Bk,Bv(BLOCK) Ak,Av,Bk,Bv:{Dx} sort_shared 17 18 19 28 X→Ak,Av,Bk,Bv(BLOCK) Ak,Av,Bk,Bv:{Dx} A,B: 16K×16K 17 23 X→A,B(*,BLOCK) A,B:{Dx} stencil2D - + 17 18 17 19 Y→A,B(BLOCK,*) A,B:{Dx} halos 19 26 XY→A,B(BLOCK,BLOCK) A,B:{Dx,y} A,B: 1K×1K×512 30 34 X→A,B(*,*,BLOCK) A,B:{Dx} stencil3D Parboil 2 + 24 26 29 31 Y→A,B(*,BLOCK,*) A,B:{Dx} halos 33 41 XY→A,B(*,BLOCK,BLOCK) A,B:{Dx,y} 17 24 X→A(*,BLOCK) B(BLOCK,*) A,B:{Dx} transpose - A,B: 16K×16K 16 14 16 23 Y→A(BLOCK,*) B(*,BLOCK) A,B:{Dx} 18 26 XY→A,B(BLOCK,BLOCK) A:{Dx,y} B:{Dy,x} vecadd - A,B,C: 256M 10 10 10 18 X→A,B,C(BLOCK) A,B,C:{Dx} Table 1: Benchmark description.

The list of benchmarks is summarized in Table 1. Some of Slowdown (BLOCK and CYCLIC) #Inst. (BLOCK and CYCLIC) Slowdown (BLOCK-CYCLIC) #Inst. (BLOCK-CYCLIC) them are found in the Parboil benchmark suite [20], some in 4.5 14 the NVIDIA SDK, and the rest have been developed in-house. 4.0 12 These benchmarks are selected to provide a good variety of 3.5 10 access patterns and thus challenges. Both CPU and GPU codes 3.0 8 2.5 6 have been modified to use the ndarray data type instead of 2.0 4 Slowdown (x) 1.5 the flat 1D arrays which are commonly used. The benchmarks 2 1.0 0 have been compiled using our toolchain and linked with our Overhead in #instructions X Y XY X X X X Y XY X X X X X Y XY X Y XY X Y XY X runtime system. The ndarray implementation has an impact fft1D saxpy sgemv reduction sgemm stencil2D stencil3D vecadd sort_shared transpose on the register usage count of the kernels (columns 4-7). In convolution2D sort_merge_globalsort_merge_shared the column titles, “Orig” stands for original, “VM” for vir- Figure 5: Grey bars show the slowdown imposed by the index- tual memory and “Resh” for reshape, while “B” stands for ing routines for the reshape array implementation compared BLOCK, “C” for CYCLIC and “B-C” for BLOCK-CYCLIC. BLOCK to the baseline (left axis). Lines indicate the increase in num- and CYCLIC are in the same column (“#Reg Resh B/C”) as they ber of executed instructions (right axis). use the same number of registers. The array decompositions (column 8) and the array distri- plementation only performs the index linearization and, there- bution (column 9) for the different computation distribution fore, the register count is similar to, or even lower than in some configurations suggested by the compiler. A, B, C are the kernels like sort_merge_* and transpose, the original ver- names of the arrays used in the computation. In the case of sion of the benchmarks. reshape, on the other hand, uses more the kernels in merge sort a suffix has been added for keys registers in most of the kernels, especially in BLOCK-CYCLIC (Ak, Bk) and values (Av, Bv). In the last column “D” stands decompositions and those configurations in which arrays are for distribution and “R” for replication in the GPU grid. X decomposed along several dimensions. and Y make reference to the decomposed dimensions of the Figure 5 shows the overheads imposed by the indexing rou- computation grid. Array decompositions are shown using the tines for the reshape implementation on a single GPU. While notation in HPF [26], but dimensions are ordered (left to right) the compiler suggests the utilization of the BLOCK distribu- from highest to lowest order as they are stored in memory. For tion type in all kernels, we study the overhead of the routines example, the sgemv configuration says that for a computation for all the distribution types. Grey bars show the slowdown decomposition on X, the A matrix is decomposed on its Y imposed by the indexing overhead of the data distributions dimension and the C vector is decomposed on its X dimension. for each computation distribution configuration (left axis). Tiles in A and C are distributed across the GPUs in the X Lines represent the increase in number of executed instruc- dimension of the GPU grid while B vector is replicated. tions due to the extra operations performed on the indexes (right axis). BLOCK and CYCLIC are grouped (dark gray bar and 7. Performance evaluation solid line) as they perform very similar transformations on 7.1. Indexing overhead the indexes and the performance is virtually the same (±1%). BLOCK-CYCLIC (light Gray bar and dashed line) is consistently Table 1 shows that register utilization greatly varies depending the slowest implementation (up to 4.48×) and the one that on the used ndarray distribution implementation. The VM im- executes more instructions, too. This is caused both by the

8 Virtual Memory (2 GPUs) Reshape (2 GPUs) Ideal (2 GPUs) AMGE 2 GPUs 5 Virtual Memory (4 GPUs) Reshape (4 GPUs) Ideal (4 GPUs) AMGE 4 GPUs 4

3

2 Speedup (x) 1

0 X Y XY X X Y XY X X X X X X X Y XY X Y XY X Y XY X Impl AMGE

fft1D sgemm sgemv saxpy vecadd GMEAN reduction sort_shared stencil2D stencil3D transpose convolution2D sort_merge_globalsort_merge_shared Figure 6: Speedup over baseline for different computation decomposition configurations using reshape and VM implementations. Arrows point to the configuration chosen by the runtime system for each kernel. Results shown for 2/4 GPUs.

extra executed instructions and the lower achieved occupancy 20 74.81 75.02 56.24 57.33 Virtual Memory 82.68 78.8788.3843.6276.5633.3350.0050.00 in the GPU due to the increased number of registers. Slow- Reshape 15 downs are large for kernels in which thread blocks perform lit- tle work (convolution2D, reduction, saxpy, sort_merge_*, 10

stencil2D, transpose and vecadd). Results for VM imple- 5

mentation are not shown because performance is within ±5% % of remote accesses 0 of the baseline in all kernels. X Y XY X X Y XY X X X X X X X Y XY X Y XY X Y XY X fft1D sgemm sgemv saxpy vecadd reduction sort_shared stencil2D stencil3D transpose convolution2D 7.2. Multi-GPU performance sort_merge_globalsort_merge_shared Figure 7: Memory requests served by remote GPUs. Figure 6 shows the speedup achieved by AMGE on our multi- The policy correctly selects the best performing configuration GPU system for all possible distribution configurations. Re- for most of the kernels. The average performance across all sults are shown for the VM and reshape implementations of the the benchmarks (i.e., bars labeled with “AMGE”) is 1.98× ndarray data type, and for an ideal implementation with opti- and 3.89× for 2 and 4 GPUs, respectively, very close to ideal. mal data distribution (no remote accesses). Bars labeled with “Impl” show the geometric mean for the three implementations 7.3. Impact of remote accesses on performance of the speedups achieved in each kernel by the best computa- tion distribution configuration. The reshape implementation Figure 7 shows the percentage of access to RAM memory that exhibits linear speedups (1.91× and 3.54× on average for 2 are served by remote GPUs in all the kernels when they are dis- and 4 GPUs, respectively) for all kinds of distribution con- tributed across 4GPUs. reshape eliminates the need for remote figurations in most kernels. The the main exceptions are X accesses in most of the configurations. Only kernels in which and XY configurations in stencil3D due to remote memory computation partitions share some data (e.g., convolution2D, accesses, and saxpy, vecadd, sort_merge_global and the XY stencil{2,3}D) use them. The worst cases are the X and XY configuration in stencil2D due to the overhead of the index- decompositions for stencil3D in which 17.43% and 7.61% of ing function. The VM implementation outperforms reshape accesses to memory are remote, respectively. This is the rea- in some configurations in which the array partitions are large son why these configurations show poor speedups in Figure 6. enough not to suffer from imbalance due to the memory al- VM introduces a lot of remote accesses in many configurations location granularity imposed by CUDA, but performs very due to the memory allocation granularity in CUDA. poorly in the other configurations, producing lower speedups Fighting data distribution imbalance in VM: The dimen- on average (1.02× and 1.27× for 2 and 4 GPUs). For example, sions of the arrays in stencil2D and stencil3D kernels make in transpose at least one of the matrices must be decomposed it difficult to evenly split them using 1 MB granularity, causing along its X dimension, which is contiguous in memory, thus imbalance and, therefore, remote memory accesses. The com- leading to an imbalanced data distribution. Therefore, per- putation distribution configuration that we study is Y. Using formance is poor for VM in all configurations in transpose. this configuration, the volumes of stencil3D are distributed Another example is stencil3D, for which the Y distribution by allocating partitions of each plane alternatively in different configuration should provide reasonable performance since GPUs. The size of each plane (1K×1K + halos) produces each plane of the volume can be distributed across GPUs. an imbalanced distribution (2/1/1/1MB for 4GPUs). This re- Nevertheless, the size of each plane still produces an imbal- sults in excessive communication that limits the performance anced distribution. We study this example in more detail in (0.87× and 0.91× for 2 and 4 GPUs). Adding padding to the Section 7.3. X dimension of the volume to obtain a balanced distribution AMGE’s runtime system, on each kernel call, tries to choose results in a 127.01× memory footprint increment. Having a the best computation and data distribution configuration, as 4 KB granularity would reduce this overhead to 1.49×. Using discussed in Section 5.2. Figure 6 highlights with an arrow more friendly problem sizes that do not produce imbalance re- the configurations chosen by the implemented selection policy. sults in much improved performance, reaching linear speedups

9 Read bandwidth Write bandwidth IPC Benchmark Conf Kim[25] Lee[27] AMGE 90 4000 80 X 38.7K×38.7K 38.7K×38.7K 77.4K×77.4K 3500 70 convolution2D Y 38.7K×38.7K 38.7K×38.7K 77.4K×77.4K 3000 60 XY 38.7K×38.7K 38.7K×38.7K 77.4K×77.4K 2500 50 fft1D X 750M 750M 3G

2000 IPC

MB/s 40 1500 30 reduction X 2.99G 2.99G 11.99G 1000 20 saxpy X 6G 6G 6G 500 10 X 31.6K×31.6K 31.6K×31.6K 44.7K×44.7K 0 0 GPU 0 GPU 1 GPU 2 GPU 3 sgemm Y 44.7K×44.7K 31.6K×31.6K 44.7K×44.7K Execution timeline XY 38.7K×38.7K 31.6K×31.6K 48.9K×48.9K (a) Imbalanced distribution sgemv X 77.4K×77.4K 38.7K×38.7K 77.4×77.4 90 sort X 750M 750M 3G 4000 80 3500 X 38.7K×38.7K 38.7K×38.7K 77.4K×77.4K 70 stencil2D Y 48.9K×48.9K 38.7K×38.7K 77.4K×77.4K 3000 60 2500 50 XY 44.7K×44.7K 38.7K×38.7K 77.4K×77.4K 3 3 3

2000 IPC

MB/s 40 X 1.1K 1.1K 1.8K 1500 30 stencil3D Y 1.3K3 1.1K3 1.8K3 1000 20 3 3 3 500 10 XY 1.2K 1.1K 1.8K 0 0 X 48.9K×48.9K 38.7K×38.7K 77.4K×77.4K GPU 0 GPU 1 GPU 2 GPU 3 × × × Execution timeline transpose X 48.9K 48.9K 38.7K 38.7K 77.4K 77.4K XY 54.7K×54.7K 38.7K×38.7K 77.4K×77.4K (b) Balanced distribution vecadd X 4G 4G 4G 90 4000 80 3500 70 Table 2: Maximum problem size for a 4-GPU system AMGE and 3000 60 2500 50 in the related work. 6.02e+05 Kim CPU-GPU transfer

2000 IPC 1000

MB/s 40 1500 30 Kim merge step 1000 20 800 AMGE CPU-GPU transfer 500 10 AMGE remote access 0 0 600 GPU 0 GPU 1 GPU 2 GPU 3 Execution timeline 400 (c) Balanced distribution + transposed thread block scheduling 200 Figure 8: Execution timeline of stencil2D for 4 GPUs. Time (milliseconds) 0 X Y XY X X X X Y XY X X X Y XY X Y XY X Y XY X sort fft1D saxpy sgemv reduction sgemm stencil2D stencil3D transpose vecadd 2.08× and 3.95× for 2 and 4 GPUs. convolution2D The larger size (16K×16K + halos) of the plane in Figure 9: Overhead of the coherence mechanisms in AMGE stencil2D allows for a more balanced distribution of data and in the related work [25]. (65/64/64/64MB for 4GPUs). However, there are still some of a thread block scheduler that issues thread blocks that are effects on the performance. Figure 8a shows the memory band- contiguous in the Y dimension instead. We emulate it by trans- width consumption due to remote loads and stores (left axis), posing the mapping of thread block identifiers on the matrices. and the number of instructions per cycle (right axis) during Figure 8c shows that, using this scheduler, remote accesses the execution of the stencil2D kernel for each GPU in the are distributed throughout all kernel execution, thus reducing system. For the sake of clarity, we concatenate the execution the instantaneous bandwidth demands (200MBps). Now, the timelines of the four GPUs, although they execute in parallel. IPC is not affected since the cost of remote accesses is hidden GPU 0 does not perform any remote memory access and the with the execution of other thread blocks that only perform IPC (instructions per cycle) remains stable during the kernel local accesses, reducing execution time by a 5.8%. execution. GPUs 1, 2 and 3 perform remote memory accesses to the previous GPU memories (note that the imbalance in- 7.4. Comparison with previous works creases with the GPU identifier). Remote accesses degrade the Memory footprint overhead. AMGE performs much more performance of the GPU as reflected in the lower IPC. Using space-efficient data decompositions than previous works. We padding to obtain a fully balanced array distribution increases quantify the benefits of AMGE over Kim et al. [25] and Lee et the memory footprint by 7.99×. We reduce the imbalance by al. [27] by comparing the maximum problem size that can be offsetting the beginning of the array so that GPUs 0 and 3, and executed by the three solutions on our 4-GPU system. Table 2 GPUs 1 and 2 perform the same amount of remote memory ac- shows that AMGE is able to run bigger problem sizes than cesses (Figure 8b). The improved balancing lowers the remote previous works for most benchmarks, especially on those that memory bandwidth consumption (2.5 GBps vs 4 GBps), and work on multi-dimensional arrays, satisfying one of the major the period in which remote accesses are performed is shorter. motivations for using multiple GPUs. This results in a 8.2% execution speedup on the lowest GPU. Coherence overhead: AMGE ensures coherence by not Reducing instantaneous bandwidth demands: In replicating output arrays and using remote memory accesses stencil2D, memory accesses concentrate at the beginning/ when needed. The related work relies on replication and a end of the kernel execution because the default thread block merge step after kernel execution. Besides, in both solutions, scheduler in the GPU issues thread blocks that are contiguous data needs to be copied from CPU to GPU memories before in the X dimension in order. Thus, thread blocks that access kernel execution. Figure 9 shows the overhead of both solu- the boundaries of the matrix partitions tend to execute con- tions. AMGE shows lower CPU/GPU transfer times in most currently, increasing the instantaneous bandwidth demands cases (CPU-GPU transfer), because of the more efficient par- and reducing the achieved IPC. We evaluate the performance titioning of multi-dimensional data structures, that requires

10 smaller regions of the arrays to be resident in each GPU mem- Language/Library-based transparent multi-GPU exe- ory. The overhead of the merge step is even larger compared cution: GlobalArrays [29] (GA) is a library that allows exe- with the cost of remote accesses. The most extreme case is the cution of computations across a distributed system using an sort benchmark, since it executes kernels iteratively and the array-like interface, and GA-GPU [36] extends GA to GPUs. merge step needs to be performed after each kernel call. The memory model in GA-GPU allows memory accesses to be ordered and, hence, it does not fit into bulk synchronous SPMD 8. Related Work programming models such as CUDA or OpenCL. Therefore, Program auto-parallelization in shared-memory NUMA GA-GPU recommends the utilization of GA data-parallel prim- systems: High Performance Fortran[26] provides primitives to itives (at the cost of lower performance due to the overhead distribute data and implements the owner-computes rule [12], of launching one kernel for each of the primitive operations). that schedules loop iterations in such a way that communica- X10 [14] and Habanero [13] present the programmer with a tion is minimized. AMGE relieves programmers from speci- single partitioned global address space. The compiler and fying the distribution of data. Performance degradation due the runtime system transparently redirect remote memory ac- to remote access is much larger in GPUs than in CPU NUMA cesses to the proper memory. Sequoia [15] tries to address systems, and bad programmer choices might lead to large slow- the problem of programming systems with different memory downs. Therefore, providing a system that accomplishes this topologies/hierarchies. Programs are composed of two parts: without programmer intervention is key for GPUs. Other pro- (a) an algorithmic representation of the computation using a posals exploit architectural mechanisms to implement dynamic C-like programming language that decomposes data structures memory distribution policies (e.g., first touch placement) or and defines how to map the computation on them; and (b) a data migration [11, 30, 28]. Nevertheless, current GPUs do mapping of the algorithm to the specific system using a declar- not provide the necessary mechanisms (e.g., user-managed ative language. PGAS languages require a complete rewrite of memory protection or the appropriate performance counters) the program, while AMGE requires minor code modifications. to implement these proposals. We rely on compiler analysis to MAGMA [38] and some other libraries take advantage of minimize inter-GPU communication. multi-GPU execution but only for a limited set of functions. Compiler-based transparent multi-GPU execution: Arrays in GPUs: Thrust [9] is a C++ library that provides a Kim et al. [25] introduce an OpenCL framework that com- 1D array container (vector) and a number of pre-defined algo- bines multiple GPUs and treats them as a single compute rithms and map/reduce primitives. ArrayFire [6] is a C/C++/ device. In order to split data, they compute the array ranges Fortran library that provides abstractions for multidimensional accessed by computation partitions by performing sampling arrays and a number of libraries that use them (e.g., data anal- runs of the kernels on the CPU. The runtime system chooses ysis, linear algebra, image and signal processing). However, the computation decomposition that minimizes the size of the arrays cannot be used in custom functions. Microsoft offers data transfers between CPU and GPU memories. However, multidimensional arrays in the C++-AMP [2] programming this only works for kernels in which array references are affine model. They can be freely used in all kind of computations functions of the thread and thread block identifiers. Otherwise and are accessed using the regular subscript notation. How- they fall back to replication. But even in cases where data can ever, none of these solutions support automatic multi-GPU be decomposed, any array decomposition not performed on execution and, therefore, could benefit from AMGE. its highest-order dimension will produce tiles whose memory 9. Conclusions and Future work address ranges overlap, thus replicating big portions of the array in all memories. Array regions that are potentially modi- Modern GPUs provide mechanisms that enable efficient auto- fied from different computation partitions need to be merged parallelization systems. In this paper we introduce AMGE, a after kernel execution. Lee et al. [27] extend the same idea to programming interface, compiler support and runtime system heterogeneous systems with CPUs and GPUs. They do not that enables multi-GPU execution of computations written for use the sampling runs on the CPU and generate a merge kernel a single GPU. Thanks to remote memory accesses, AMGE that is more efficient than in Kim et al.; although both solu- imposes much lower memory footprint and coherence mech- tions require the merge step to be executed on the CPU, thus anism (i.e., remote accesses) overheads than previous works. increasing the CPU↔GPU traffic. AMGE uses a similar ap- We also demonstrate that transparent data distribution can be proach but the compiler analysis uses the array dimensionality efficiently implemented on current GPUs using the UVAS and information by the ndarray type. Thanks to this information compiler/runtime-assisted code versioning. Using the array and the utilization of remote memory accesses, AMGE avoids data type provided in AMGE results in shorter and cleaner replication in most cases, enabling bigger problem sizes and code, too. AMGE achieves almost linear speedups for most of minimizing unnecessary CPU↔GPU communication. More- the benchmarks on a real 4-GPU system with an interconnect over, AMGE adds support for cyclic data distributions, and with moderate bandwidth. Further performance improvements exploits hardware support required by codes that use atomic can be achieved by reducing the virtual memory mapping operations and global memory fences. granularity exposed by CUDA and by allowing programmers

11 to tune the thread block scheduling policy. [28] Jaydeep Marathe, Vivek Thakkar, and Frank Mueller. Feedback- directed page placement for ccNUMA via hardware-generated memory We believe that AMGE could be used in future systems traces. Journal of Parallel and Distributed Computing, 70(12), 2010. such as NVIDIA Pascal boards to automatically scale the [29] Jaroslaw Nieplocha, Robert J. Harrison, and Richard J. Littlefield. performance of GPU kernels to multiple GPUs. We plan to Global Arrays: A portable "shared-memory" programming model for distributed memory computers. SC ’94, 1994. extend our evaluation to irregular computation patterns. [30] D.S. Nikolopoulos, T.S. Papatheodorou, C.D. Polychronopoulos, J. Labarta, and E. Ayguadé. User-level dynamic page migration for References multiprogrammed shared-memory multiprocessors. ICPP 2000, 2000. [31] NVIDIA Corporation. CUDA C Programming Guide, 2013. [1] TOP500 list - June 2014. http://top500.org/list/2014/06/. [32] Steven L. Scott. Synchronization and communication in the T3E [2] C++ AMP: C++ Accelerated Massive Parallelism, 2012. multiprocessor. ASPLOS VII, 1996. [3] NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect, 2012. [33] John E. Stone, David J. Hardy, Ivan S. Ufimtsev, and Klaus Schul- [4] APU 101: All about AMD Fusion Accelerated Processing Units, 2013. ten. GPU-accelerated molecular modeling coming of age. Journal of [5] Tegra K1 Next-Gen Mobile Processor, 2014. Molecular Graphics and Modelling, 29(2), 2010. [6] AccelerEyes. ArrayFire, 2012. [34] John A. Stratton, Christopher I. Rodrigues, I-Jui Sung, Li-Wen Chang, [7] Jennifer M. Anderson, Saman P. Amarasinghe, and Monica S. Lam. Nasser Anssari, Geng (Daniel) Liu, Wen mei W. Hwu, and Nady Obeid. Data and computation transformations for multiprocessors. PPoPP ’95, Algorithm and data optimization techniques for scaling to massively 1995. threaded systems. IEEE Computer, 45(8), 2012. [8] M. Araya-Polo, J. Cabezas, M. Hanzich, M. Pericas, F. Rubio, [35] Ivan Tanasic, Lluís Vilanova, Marc Jordà, Javier Cabezas, Isaac Gelado, I. Gelado, M. Shafiq, E. Morancho, N. Navarro, E. Ayguadé, J.M. Nacho Navarro, and Wen-mei W. Hwu. Comparison based sorting for Cela, and M. Valero. Assessing Accelerator-Based HPC Reverse Time systems with multiple GPUs. GPGPU-6, 2013. Migration. TPDS, 22(1), 2011. [36] Vinod Tipparaju and Jeffrey S. Vetter. GA-GPU: extending a library- [9] Nathan Bell. Thrust: A parallel template library for CUDA, 2009. based global address space programming model for scalable heteroge- [10] Laura Susan Blackford, J. Choi, A. Cleary, A. Petitet, R. C. Wha- neous computing systems. CF ’12, 2012. ley, J. Demmel, I. Dhillon, K. Stanley, J. Dongarra, S. Hammarling, [37] Jonas Tölke. Implementation of a Lattice Boltzmann kernel using the G. Henry, and D. Walker. ScaLAPACK: a portable linear algebra library Compute Unified Device Architecture developed by NVIDIA. Com- for distributed memory computers - design issues and performance. puting and Visualization in Science, 13(1), 2010. SC ’96, 1996. [38] S. Tomov, R. Nath, H. Ltaief, and J. Dongarra. Dense linear algebra [11] William J. Bolosky, Michael L. Scott, Robert P. Fitzgerald, Robert J. solvers for multicore with GPU accelerators. IPDPS 2010, April 2010. Fowler, and Alan L. Cox. NUMA policies and their relation to memory architecture. ASPLOS IV, 1991. [12] David Callahan and Ken Kennedy. Compiling programs for distributed- memory multiprocessors. The Journal of Supercomputing, 2(2), 1988. [13] Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. Habanero- Java: the new adventures of old X10. PPPJ ’11, 2011. [14] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Don- awa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster com- puting. OOPSLA ’05, 2005. [15] Kayvon Fatahalian, Daniel Reiter Horn, Timothy J. Knight, Larkhoon Leem, Mike Houston, Ji Young Park, Mattan Erez, Manman Ren, Alex Aiken, William J. Dally, and Pat Hanrahan. Sequoia: programming the memory hierarchy. SC ’06, 2006. [16] Michael Garland, Scott Le Grand, John Nickolls, Joshua Anderson, Jim Hardwick, Scott Morton, Everett Phillips, Yao Zhang, and Vasily Volkov. Parallel computing experiences with CUDA. IEEE Micro, 28(4), July 2008. [17] Isaac Gelado, John E. Stone, Javier Cabezas, Sanjay Patel, Nacho Navarro, and Wen-mei W. Hwu. An asymmetric distributed shared memory model for heterogeneous parallel systems. ASPLOS XV, 2010. [18] Peter N. Glaskowsky. NVIDIA’s Fermi: The First Complete GPU Computing Architecture, 2009. [19] M. Gupta and P. Banerjee. Demonstration of automatic data partition- ing techniques for parallelizing compilers on multicomputers. TPDS, 3(2), March 1992. [20] IMPACT Group. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php, 2012. [21] Intel Corporation. Ivy Bridge Archictecture, 2011. [22] Thomas B. Jablin, Prakash Prabhu, James A. Jablin, Nick P. John- son, Stephen R. Beard, and David I. August. Automatic CPU-GPU communication management and optimization. PLDI ’11, 2011. [23] S.W. Keckler, W.J. Dally, B. Khailany, M. Garland, and D. Glasco. GPUs and the future of parallel computing. Micro, IEEE, 31(5), 2011. [24] The Khronos Group Inc. The OpenCL Specification, 2013. [25] Jungwon Kim, Honggyu Kim, Joo Hwan Lee, and Jaejin Lee. Achiev- ing a single compute device image in OpenCL for multiple GPUs. PPoPP ’11, 2011. [26] Charles H. Koelbel, David B. Loveman, Robert S. Schreiber, Guy L. Steele, Jr., and Mary E. Zosel. The high performance Fortran handbook. 1994. [27] Janghaeng Lee, Mehrzad Samadi, Yongjun Park, and Scott Mahlke. Transparent CPU-GPU collaboration for data-parallel kernels on het- erogeneous systems. PACT ’13, 2013.

12