Exploiting Remote Memory Access for Automatic Multi-GPU Parallelization Javier Cabezas ∗§ Lluís Vilanova ∗§ Isaac Gelado Φ Thomas B

Exploiting Remote Memory Access for Automatic Multi-GPU Parallelization Javier Cabezas ∗§ Lluís Vilanova ∗§ Isaac Gelado f Thomas B. Jablin ‡ Nacho Navarro ∗§ Wen-mei Hwu ‡ Barcelona Supercomputing Center∗ UPC§ NVIDIA Researchf UIUC‡ {name.lastname}@bsc.es {jcabezas,vilanova,nacho}@upc.edu [email protected] {jablin,w-hwu}@illinois.edu Abstract data structures. Consequently, prior work must conservatively replicate portions of the arrays that are never accessed. For In this paper we present AMGE, a programming framework example, consider a kernel that performs n-dimensional tiling and runtime system that transparently decomposes GPU ker- (a pattern often found in dense GPU computations [33, 37, 8]) nels and executes them on multiple GPUs in parallel. AMGE where computation partitions access different non-contiguous exploits the remote memory access capability in modern GPUs array regions. In such a case, Kim et al. transfer the whole to ensure that data can be accessed regardless of its physical memory ranges accessed by each computation partition, which location, thus allowing our runtime to safely decompose and may include large portions of the array that are never used, distribute arrays across GPU memories. It also implements a while Lee et al. replicate the whole data structure in all GPUs. compiler analysis that detects array access patterns in GPU This increases the memory usage, limiting the size of the prob- kernels. Using this information, the runtime chooses the best lems that can be handled, and imposes performance overheads computation and data distribution configuration. Results show due to larger data transfers. (2) Data coherence overhead: 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a replicated output memory regions need to be merged in the wide range of dense computations compared to the original host memory after each kernel call. In many cases, this merge versions on a single GPU. The GPU execution model allows step leads to large performance overheads. (3) Lack of support AMGE to hide the cost of remote memory accesses when they for atomic and global memory instructions. are kept below 3%. We further demonstrate that a thread block In this paper we present AMGE (Automatic Multi-GPU scheduling policy that distributes remote accesses thorough Execution), a programming interface, compiler support and the whole kernel execution helps reducing their overhead. runtime system for that automatically executes computations 1. Introduction that are programmed for a single GPU across all the GPUs in the system. The programming interface provides a data type Current HPC cluster systems commonly install 2 or 4 CPUs for multidimensional arrays that allows for robust, transparent in each node [1]. Some also install discrete GPUs to further distribution of arrays across all GPU memories. This new type accelerate computations rich in data parallelism. As CPU provides dimensionality information that enables the compiler and GPU are integrated into the same chip (e.g., Intel Ivy to determine how the arrays are accessed in GPU kernels. The Bridge [21], AMD APU [4], NVIDIA K1 [5]), multi-GPU runtime system uses the compiler-provided information to nodes are expected to be common in future HPC systems. automatically choose the best computation and data distribu- Current GPU programming models, such as CUDA [31] and tion configuration to minimize inter-GPU communication and OpenCL [24], make multi-GPU programming a tedious and memory footprint. error-prone task. These models present GPUs as external AMGE assumes non-coherent non-uniform shared memory devices with their own private memory, and programmers are accesses (NCC-NUMA) between GPUs through a relatively in charge of splitting data and computation across GPUs and low-bandwidth interconnect, such that all GPUs can access taking care of data movement. and cache any partition of the arrays. Thus, we ensure that Some solutions have already been proposed to transparently arrays can be arbitrarily decomposed, distributed and safely exploit multiple GPUs. Kim et al. [25] build a single virtual accessed from any GPU in the system. In current systems compute device for all GPUs in the system. They decompose based on discrete GPUs, we utilize Peer-to-Peer [3] and Uni- computations and execute the partitions on different GPUs. fied Virtual Address Space [31] technologies that enable a Data is also decomposed across GPUs as long as their compiler GPU to transparently access the memory of any other GPU and run-time analyses are able to unequivocally determine the connected to the same PCIe domain. While remote GPU mem- regions of the arrays accessed by each computation partition. ory accesses have been used in the past [35], this is the first Otherwise data is replicated on all GPUs. Lee et al. [27] ex- work to use them as an enabling mechanism for automatic tend this idea to heterogeneous systems with different types of multi-GPU execution. compute devices (CPUs or GPUs) or computation capabilities We also present and discuss different implementation trade- (e.g., different GPU models). However, both solutions suffer offs for computation and data distribution across GPUs using a from fundamental limitations. (1) Memory footprint overhead: prototype implementation of AMGE for C++ and CUDA. The none of these solutions determine the dimensionality of the prototype includes a compiler pass that detects array access We evaluate AMGE on an existing commercial system that implements most of the features of an NCC-NUMA system based on NVIDIA discrete GPUs (see Figure 1). GPUs access their local memory (arc a) with full-bandwidth. Accesses to CPU memories from the GPU (arc b) are routed through the PCI Express (PCIe) interconnect and the CPU memory con- Figure 1: Multi-GPU architecture evaluated in this paper. troller. If the target address resides in a memory connected to a different CPU socket, the inter-CPU interconnect (Hyper- patterns in CUDA kernels and generates optimized versions Transport/QPI) must be traversed, too. GPUs can also access of the kernels for different array decompositions. The runtime the memory in another GPU through the PCIe interconnect system distributes data and computation, and selects the appro- (arc c). This is implemented on top of the peer-to-peer (P2P) priate kernel version. This prototype is evaluated using a set mechanism introduced in NVIDIA Fermi GPUs [18]. of GPU dense-computation benchmarks, originally developed While the execution model provided by GPUs can hide for single-GPU execution. Results on a real 4-GPU system large memory latencies, both CPU memory and the inter-GPU show 1.98× and 3.89× kernel execution speedups for 2 and interconnects (e.g., PCIe 2.0/3.0) deliver a memory bandwidth 4 GPUs, respectively, using the default distribution selection which is an order of magnitude lower than the local GPU mem- policy implemented in our prototype, compared to the original ory (GDDR5). New interconnects that provide much higher version of the kernels running on a single GPU. bandwidth have been announced (e.g., NVLink is projected The main contributions of this paper are: (1) A multi-GPU to deliver up to 100 GB/s), but the memory technology will parallelization system that provides space-efficient data de- also keep improving, thus maintaining this gap. Therefore, compositions to enable bigger problem sizes. (2) A novel minimizing remote accesses is key for performance. compiler analysis for GPU kernels that detects the array access patterns to systematically determine the array decomposition 2.1. GPU Programming Model and distribution configuration to be used in order to minimize GPUs are typically programmed using a Single Program Mul- remote memory accesses. (3) A simple programming interface tiple Data (SPMD) programming model, such as NVIDIA that can be easily introduced into languages such as CUDA CUDA or OpenCL. For simplicity, we use the CUDA nam- and OpenCL to robustly and transparently distribute compu- ing conventions in the rest of the paper. This model allows tation and data across several GPUs. (4) An evaluation that programmers to spawn a large number of threads that exe- shows the efficacy of the remote memory access mechanism to cute the same program, although each thread might take a enable multi-GPU parallelization even when built on a limited completely different control flow path. All these threads are bandwidth interconnect. organized into a computation grid of groups of threads (i.e., 2. Multi-GPU architecture thread blocks). Each thread block has an identifier and each thread has an identifier within the thread block, that can be Our target multi-GPU system architecture has one or several used by programmers to map the computation to the data struc- chips that contain both CPU and GPU cores. Since we focus tures. Both CUDA and OpenCL provide weak consistency on programmability within a node, we use the term system models: memory updates performed by a thread block might to refer to a node. Each chip is connected to one or more not be perceived by other thread blocks, except for atomic and memory modules, but all cores (both CPU and GPU) can ac- memory fence (GPU-wide and system-wide) instructions. We cess any memory module in the system. Accesses to remote refer the reader to the CUDA Programming Guide [31] and memories have longer access latency and lower bandwidth the OpenCL specification [24] for further details. than local accesses, thus forming a shared memory NUMA Multi-GPU Programming. In CUDA and OpenCL, GPUs system. CPU cores typically access memory through a co- are exposed as external devices with their own memories. Pro- herent cache hierarchy, while GPUs use weaker consistency grammers typically decompose computation and data so that models that do not require cache coherence between cores. each GPU only accesses its local memory. If there are regions Therefore, both coherent and non-coherent interconnects are of data that are accessed by several GPUs, programmers are supported in AMGE.

Load more