1 CUDArrays: transparent multi-GPU computation in NCC-NUMA GPU systems Anonymous Review

Abstract—Many dense scientific computations can benefit from through a relatively low-bandwidth interconnect (Non-Cache- GPU execution. However, large applications might require the Coherent Non-, or NCC-NUMA), utilization of multiple GPUs to overcome the physical memory such that all GPUs can access any partition of the logical data size constraints. On the other hand, as integrated GPUs improve, multi-GPU nodes are expected to be ubiquitous in future HPC , providing the illusion of programming for a single- systems. Therefore, applications must be ported to exploit the GPU node. For future systems with integrated GPUs, we computational power of all the GPUs on such nodes. Nevertheless, envision system architecture support such as Heterogeneous multi-GPU programming currently requires programmers to System Architecture (HSA) [8] that enables GPUs in the explicitly partition data and computation. Although some solu- same node to access each other’s physical memory. In systems tions have been proposed, they rely on specialized programming languages or compiler extensions, and do not exploit advanced with discrete GPUs we rely on the Peer-to-peer (P2P) DMA features of modern GPUs like remote memory accesses. technology [9] that enables one GPU to access the memory of In this paper we present CUDArrays, an interface for current any other GPU connected to the same PCIe bus. We present and future multi-GPU systems that transparently distributes data how transparent multi-GPU data and computation distribution and CUDA kernel computation across all the GPUs in a node. scheme can be encapsulated using an N-dimensional array The interface is based on an N-dimensional array data that also makes the CUDA kernels more readable and concise. We (ND-array) , and introduce four different imple- propose four different implementation approaches and evaluate mentation techniques for this using ++ and CUDA. them on a 4-GPU system. We also perform a analysis This interface not only makes it possible to distribute data and to project performance results for up to 16 GPUs. Our interface computation across GPUs but also makes the CUDA kernels relies on the peer-to-peer DMA technology to transparently more concise and readable. We modify a of GPU dense- access data located on remote GPUs. We modify a number of GPU workloads to use our data structure. Results show that computation benchmarks, originally supporting only single- linear speedups can be achieved for most of the computations if GPU systems, to use our ND-array data structure. After these the correct partitioning strategy is chosen. Using an analytical minor modifications, we run these benchmarks on a multi- model we project that most computations can scale up to 16 GPU system to evaluate the implementation alternatives, and GPUs. analyze the scalability of our approach. Experimental results show that linear speedups can be achieved for most of the I.INTRODUCTION computations if the correct partitioning strategy is chosen. Current HPC systems built using commodity micropro- Using an analytical model we project that most computations cessors commonly install two or four CPU chips in each can scale up to 16 GPUs. node [1]. Many systems also install GPUs to further accelerate The main contributions of this paper are (1) a simple inter- computations rich in . As CPU and GPU are face to transparently distribute computation and data across integrated into the same chip, we can expect several GPUs several GPUs, (2) an analysis of different implementation per node to be common in future HPC systems. Hence, techniques to efficiently partition dense data structures across applications will have to exploit all the GPUs present in each GPUs, and (3) an evaluation of the performance of the data node to achieve optimal performance. distribution techniques using a wide range of GPU workloads. Commercial GPU programming models, such as CUDA [2] The rest of this paper is organized as follows. SectionII and OpenCL [3], make multi-GPU programming a tedious introduces our base architecture and discusses some challenges and error-prone task. GPUs are typically exposed as external found in multi-GPU programming. Section III discusses trade- devices with their own private memory. Programmers are offs for automated data and computation distribution. Sec- responsible for splitting both data and computation across tionIV presents our framework to ease the development of GPUs by hand. While some solutions have been proposed [4], multi-GPU applications and examines different implementa- [5], [6], [7], they require programmers to rewrite their pro- tion techniques. The experimental methodology is presented in grams using specialized programming languages or compiler SectionV, and SectionVI evaluates our proposed framework. extensions. Moreover, most of them do not exploit advanced We discuss the related work in Section VII. Finally, we features of modern GPUs like remote memory accesses. conclude our paper in Section VIII. In this paper we present a data and computation partitioning interface for dense data structures, i.e., multi-dimensional II.BACKGROUND AND MOTIVATION arrays, that simplifies the programming of current and future This section presents the base multi-GPU architecture tar- multi-GPU systems (either discrete or integrated). Our pro- geted in this paper and introduces the necessary terminology. posal assumes non-coherent memory accesses between GPUs Since we focus in the programmability within a node, we refer 2

Interconnect Latency Bandwidth DDR3 (CPU, 1 channel) ∼ 50ns ∼ 30 GBps GDDR5 (GPU) ∼ 500ns ∼ 200 GBps Hypertransport 3 ∼ 40ns 25+25 GBps QPI ∼ 40ns 25+25 GBps PCI Express 2.0 ∼ 200ns 8+8 GBps PCI Express 3.0 ∼ 200ns 16+16 GBps I: Interconnection network characteristics. Figure 1: Multi-socket integrated CPU/GPU system. bandwidths and latencies for the different interconnects that can be found in this kind of systems are summarized in Table I. Accesses to host memories from the GPU (2) go through the PCIe interconnect and the CPU memory controller. If the target address resides in memory connected to a different CPU socket, the inter-CPU interconnect (Hypertransport/QPI) must be traversed, too. Support for peer-DMA memory accesses was added to NVIDIA Tesla devices in the Fermi family [11]. This feature allows GPUs to directly access the memory on other GPUs through the PCIe interconnect (3). This feature was coupled with the Unified Virtual Address Space (UVAS), Figure 2: Multi-GPU architecture used in this paper. which ensures that a virtual belongs to a to it as system in the rest of the paper. We also discuss some single physical device and, therefore, it is easy to route the programmability issues present in multi-GPU systems. application-level data requests to the proper physical memory. Hence, code in CUDA programs can transparently access any A. Non-Coherent NUMA architectures memory in the system through regular pointers. Vendors that support less-featureful programming interfaces like OpenCL Figure 1 shows the base multi-GPU system architecture do not explicitly export the mechanisms to perform remote assumed in this paper. The system has one or several chips memory accesses, but such accesses could be used internally that contain both CPU and GPU cores. CPU cores implement in the implementation of such interfaces. While the execution out of order pipelines that allow to execute complex control model provided by GPUs (discussed in detail in the next sub- code efficiently. GPU cores provide wide vector units and a section) can hide large memory latencies, both host memory highly multi-threaded execution model better suited for codes and the inter-GPU interconnects (e.g., PCI Express) deliver rich in data parallelism (e.g., dense scientific computations). a memory bandwidth which is an order of magnitude lower Each chip is connected to one or more memory modules, than the local GPU memory (GDDR5). The main limitation but all cores can access any memory module in the system. found in current systems is the incompatibilities between the Accesses to remote memories have longer access latency than protocols of the inter-CPU interconnect and the peer-memory local accesses thus forming a NUMA system. CPU cores access. Therefore, our test platform uses a single CPU and typically access memory through a coherent cache hierar- several discrete GPUs. chy, while GPUs use much weaker consistency models that Current integrated CPU/GPU chips are less popular in do not require among cores. Therefore, in HPC systems due to their lower memory bandwidth. However order to support these two different memory organization many vendors have already implemented support for general schemes, both coherent and non-coherent interconnects are purpose computations in integrated GPUs. For example, since provided for inter-chip memory accesses. This NCC-NUMA the Sandy Bridge family, integrated GPUs from Intel support system architecture has been successfully used in the past OpenCL programs [12]. AMD also introduced the Fusion (e.g., Cray T3E [10]). Using this kind of organization allows architecture, which supports OpenCL and C++-AMP [13]. In to easily distribute computation across the GPU cores like Fusion devices, general purpose cores and GPU cores share the in current CPU-only NUMA systems (e.g., last level of the memory cache hierarchy, but a non-coherent SMP). However, as in any NUMA system, remote accesses interconnect is provided to achieve full memory bandwidth. must be minimized to avoid their longer access latency and Depending on the characteristics of the memory allocation lower memory bandwidth. This paper presents a software (e.g., accessible by both CPU and GPU, or by the GPU only), framework which allows programmers to easily exploit NCC- GPU cores use the coherent or the non-coherent interconnect NUMA multi-GPU systems in dense scientific computations. (respectively). As the integration technology improves, these We evaluate this framework on top of an existing commercial chips are expected to be dominant in the future. system that implements most of the features of an NCC- NUMA system (Figure 2). In this paper we focus on the GPUs and the inter-GPU communication. C. SPMD Programming Model GPUs are typically programmed using a Single Program B. Multi-GPU systems Multiple Data (SPMD) programming model, such as NVIDIA Modern NVIDIA GPUs allow programs running on the CUDA [2] or OpenCL [3]. This model allows programmers GPU to access most memories in the system [2]. Memory to spawn a large number of threads that execute the same 3

program, although each might take a completely dif- 1 void stencil2D(float *out, const float *in) { 2 int tx = threadIdx.x + 1, ty = threadIdx.y + 1; ferent control flow path. All these threads are organized into 3 a computational grid of groups of threads (i.e., thread blocks 4 int id_x = blockIdx.x * B_X + tx; 5 int id_y = blockIdx.y * B_Y + ty; in CUDA, work groups in OpenCL). Each thread block has 6 7 int idx = (id_y * size_x) + id_x; a block identifier and each thread has an identifier within the 8 thread block, which can be used by programmers to map the 9 ssize_t size_x = gridDim.x * B_X, size_y = gridDim.y * B_Y; 10 computation to the data structures (3D identifiers are provided 11 __shared__ float sh[B_Y + 2][B_X + 2]; 12 sh[ty][tx] = in[idx]; // Load central point to ease the mapping on N-dimensional data structures). Both 13 CUDA and OpenCL provide very weak consistency models: 14 if (tx == 1) { // Load left/right "halo" points 15 sh[ty][0 ] = in[idx - 1]; memory updates performed by a thread block do not have to 16 sh[ty][B_X + 1] = in[idx + B_X]; 17 } be visible to other thread blocks (except when using atomic 18 if (ty == 1) { // Load up/down "halo" points instructions). Therefore, thread blocks can be independently 19 sh[0 ][tx] = in[idx - 1 * size_x]; 20 sh[B_Y + 1][tx] = in[idx + B_Y * size_x]; executed. 21 } 22 __syncthreads(); Each thread block is scheduled to run on a GPU core, 23 24 out[idx] = sh[ty][tx] + (sh[ty][tx-1] + sh[ty][tx+1]) + and threads within a thread block are issued in fixed-length 25 (sh[ty-1][tx] + sh[ty+1][tx]); groups (i.e., warps in CUDA). Each GPU core provides fine- 26 } grained multi-threading hardware to context switch among the Listing 1: 2D Stencil Computation on multiple GPUs using different warps being executed on the same GPU core. This boundary exchanges in the host. context switching happens on data dependencies with long void stencil2D_peer(float *out, const float *in, latency instructions (e.g., misses in the L1 cache). This scheme const float *in_up, const float *in_down){ allows the GPU core to hide long latency operations by concur- ... if (tx == 1) { // Load left/right "halo" points rently executing many threads. Threads within the same thread sh[ty][0 ] = in[idx - 1]; block can communicate through a shared memory region or sh[ty][B_X + 1] = in[idx + B_X]; } using synchronization instructions (e.g., __syncthreads). The if (ty == 1) { // Load up/down "halo" points if (idx_y == 1 && in_up != NUL) { amount of thread blocks that can execute concurrently on the sh[0 ][tx] = in_up[idx_x]; same GPU core depends on the number of threads and other } else { sh[0 ][tx] = in[idx - 1 * size_x]; resources needed by each block (e.g., shared memory and } if (idx_y == size_y - 1 && in_down != NULL) { registers). Hence, the utilization of these resources must be sh[B_Y + 1][tx] = in_down[id_x]; carefully managed to achieve a full utilization of the GPU. } else { sh[B_Y + 1][tx] = in[idx + B_Y * size_x]; } } ... D. Multi-GPU programming } The CUDA device language [2] is not aware of multi-GPU Listing 2: 2D Stencil Computation on multiple GPUs using execution. Thread block identifiers are local to each GPU and peer memory accesses. memory allocations are bound to a single GPU. Moreover, although the CUDA programming model exposes to program- Listing 1 shows the CUDA kernel code of a 5-point 2D sten- mers a UVAS and provides support for remote accesses, there cil computation. Stencil computations are common in iterative is no mechanism to control how virtual addresses are mapped FDTD (finite-difference time-domain) simulations. In these to the physical memories in the system. Hence, programmers types of iterative simulations the output of a step becomes an typically split data structures by allocating a chunk of the input of the following steps. The code in Listing 1 is written structure in each GPU (obtaining non-contiguous virtual mem- for one GPU, but it can be used to a subdomain of ory regions) and must explicitly use the appropriate pointers to the problem if domain decomposition is used for multi-GPU access allocations located on remote GPUs. This is analogous execution, provided that the programmer manually handles to the traditional solutions proposed for data consistency among the subdomains (e.g., by copying data systems like message passing (e.g., MPI [14]). This extra code among GPUs). To compute the output at position i, j the harms the programmability of the kernel code and is likely to 2D stencil computation uses the values within a 1 radius of the impose some . Therefore, the most common way to i, j position of the input array. Thus, an extra row of the input divide the computation in CUDA is to access the local GPU for each neighboring domain (also known as “halo”) must be memory only, thus replicating those memory regions that are replicated in each GPU. In the next iteration the output is used accessed by more than one GPU. Changes to these shared as input and, therefore, the values at the domain boundaries regions must be propagated to the rest of GPUs. OpenCL [3] that are needed by the neighbors must be appropriately copied introduces the concept of global offset to ease the partitioning by the host code. of a computation across different devices. Programmers can In order to avoid the extra code and copies, peer-memory use the global id of the thread blocks to access the same allo- accesses can be used to access those rows that belong to cation from different processors on shared memory systems. the neighboring subdomains. Listing 2 shows the changes to However, on devices with distributed memories (like GPUs), the kernel needed to use peer-memory accesses. The function data structures must still be manually split, distributed, and signature of the kernel reflects that three pointers must be managed by the programmer. used for the input data instead of one: one for the allocation 4 of the GPU on which the code is running, and pointers to the data of the neighbouring subdomains that are necessary to compute the output at the boundaries. Moreover, when the halo points (lines 14-21 in Listing 1) are located on a remote GPU, the appropriate pointer must be used. Nevertheless, the modifications shown in this example are relatively simple because a specific data partitioning layout is assumed (the 2D Figure 3: Different computation partitioning schemes for a matrices are divided along the highest-order dimension and 2D grid. Each shade of gray represents a different partition, rows cannot be split). A more generic implementation should smaller squares represent thread blocks and threads. check where data is located on every access to the arrays. indexes. As a result, threads with contiguous identifiers in one dimension (e.g., x) tend to access elements which are III.DATA PARTITIONING IN DENSE SPMDCODES contiguous in the same or a different (e.g., y) dimension in In this Section we discuss how to distribute computation the nd-array. There are two main reasons for SPMD dense and data across the GPUs in the system in a efficient way. scientific codes to use this kind of index transformations. First, We decouple computation and data distribution and discuss affine transformations provide an easy and convenient way for them independently. However, these two problems are deeply programmers to map threads to parts of the arrays. Second, interrelated in terms of system performance. The perfect the GPU hardware requires threads with contiguous identifiers distribution scheme would distribute the dataset associated to (i.e., warp) to access contiguous memory locations to achieve any computation partition to the same physical GPU, thus high memory bandwidth. completely removing the need for remote memory accesses. Affine transformations used to produce array indexes from Such simple rule presents many caveats because, in many thread indexes typically increase monotonically: thread blocks cases, it is not possible to unequivocally determine which with contiguous indexes in a given dimension (e.g., Bx,y,z and parts of the dataset are accessed by each computation unit Bx,y,z+1) are likely to access parts of the arrays that are also (i.e., thread). Even if we assume this information is available, contiguous in some dimension. This is the most commonly determining an optimal distribution approach is an NP problem found pattern because the global thread index (i.e., the index of whose solution space is unmanageable. Furthermore, parts of the thread within the computational grid) is usually computed the dataset may be accessed by multiple threads, so data using the block identifier. This pattern is also favored because replication might be needed to provide good performance. the GPU hardware schedules contiguous thread blocks in the There is a trade-off between the extra storage requirements same compute core to maximize the reuse of data in the L1 of data replication, and the performance losses due to remote cache. memory accesses. As a general rule, replication should be Another pattern commonly found in dense scientific appli- avoided unless the performance is severely affected, i.e., the cations is the utilization of read-only data structures whose number of remote memory accesses is very high. elements are accessed by many or all threads. Examples of In this paper we assume no knowledge about the data these data structures are coefficient matrices in stencil and access pattern performed by the GPU kernels. Hence, we convolution computations, or input matrices for some linear aim to find a small set of partitioning schemes that deliver algebra computations such as matrix-matrix multiplications. reasonable performance for the most common patterns found The access patterns to these data structures heavily depend on in dense scientific computations. Although compiler analysis the specific computation and, thus, a common pattern cannot might provide useful information, it is out of the scope of this be easily inferred. paper. B. Computation Partitioning A. SPMD Dense Scientific Codes CUDA model imposes the thread block as the smallest In this paper we target dense scientific computations where granularity for computation partitioning. Individual threads both the input and output datasets are represented as multi- cannot be independently executed because the programming dimensional arrays (i.e., nd-arrays). Some examples of these model guarantees that threads within a warp are executed codes are Finite Difference Time Domain (FDTD) compu- in lockstep. Moreover, threads within a thread block share tations, Reverse Time Migration (RTM) seismic imaging, or resources (e.g., shared memory), and support barrier syn- Lattice Boltzmann Methods (LBM) for computational fluid- chronization operations. This requires partitioning schemes to dynamics. We first analyze the most common data access pat- ensure that all threads within a thread block are executed in terns in these codes to later define static partitioning schemes the same compute core. that will produce high multi-GPU performance execution in Figure 3 illustrates our computation distribution approach. those codes exhibiting such data access patterns. We group thread blocks in as many evenly-sized sets as The SPMD paradigm gives complete freedom to the pro- GPUs are present in the system. Each set is composed of grammer regarding which data is accessed by each thread. thread blocks whose indexes are contiguous in a given di- However, quite often programmers write the code in such a mension. This scheme offers the programmer/runtime system way that accesses to nd-dimensional arrays are computed as as many different degrees of freedom as dimensions in the an affine transformation (i.e., linear combination) of the thread computational grid of the SPMD model (i.e., three in CUDA 5

and OpenCL) when deciding how the computation gets split void stencil2D(dynarray out, const_dynarray in){ across the GPUs in the system. These computation partitioning int tx = threadIdx.x + 1, ty = threadIdx.y + 1; schemes are arbitrary, but we choose them by their imple- int id_x = get_global_block_id().x * B_X + tx; int id_y = get_global_block_id().y * B_Y + ty; mentation simplicity: global block identifiers can be easily __shared__ float sh[B_Y + 2][B_X + 2]; calculated using a simple offset addition for each GPU and, sh[ty][tx] = in(id_y, id_x); // Load central point thus, impose minimal overhead to the kernel code. if (tx == 1) { // Load left/right "halo" points sh[ty][0 ] = in(id_y, 0); sh[ty][B_X + 1] = in(id_y, B_X + 1); } C. Data Partitioning if (ty == 1) { // Load up/down "halo" points sh[0 ][tx] = in(0, id_x); sh[B_Y + 1][tx] = in(BY + 1, id_x); The data partitioning problem is heavily influenced by the } static computation partitioning approach previously described. __syncthreads(); Thread-blocks commonly access elements whose indexes are out(id_y, id_x) = sh[ty][tx] + (sh[ty][tx-1] + sh[ty][tx+1]) + (sh[ty-1][tx] + sh[ty+1][tx]); the result of an affine transformation of the thread indexes. } Hence, if arrays are partitioned using the same scheme as the Listing 3: 2D Stencil Computation on multiple GPUs using computation, it is likely that thread blocks executing in a given the ::dynarray abstraction. GPU mostly access data hosted in its global memory. However, often time, SPMD codes maximize memory band- A. API width by either using a computational grid whose dimensional- We encapsulate the functionality of our interface in a ity does not match that of the memory structures (e.g., using data type used to represent ND-arrays. This data type is a 2D computational grid on a 3D structure), or by having implemented by the cuda::dynarray C++ template. a different interrelation between the ordering of axes on the The number of dimensions (N) and the type of the elements data structures and on the computational grid (e.g., the X axis of the array (T) are statically defined by the programmer. The of the computational grid corresponds to the Y axis of the size of each dimension is specified in the class constructor structure). An illustrative example is the implementation of function. Programmers can also specify the per-dimension data the 3D-stencil computation by Micikevicius [15], where all partitioning and alignment schemes (although data partitioning the threads traverse the dataset across the z dimension. To is currently implemented for a single arbitrary dimension). support these elaborate data access patterns, we decouple data Moreover, classes instantiated from this template provide an and computation partitioning and let the programmer specify overloaded version of the function operator (operator()) to the dimensions on which to split each data structure. access the array using a notation similar to the array subscript Data partitioning can be implemented by evenly splitting operator (operator[]). For example, the position i, j, k for a arrays along any of their dimensions, and hosting each par- 3D volume A is accessed as A(i, j, k). These accessors also tition in one of the GPU memories in the system. On each hide the implementation details of the data distribution scheme memory access, our implementation determines whether the from programmers. Furthermore, arrays instantiated from this element being accessed is hosted in the memory local to the class can be accessed by both CPU and GPU code. We also GPU executing the code or on a remote GPU and sends provide an option to replicate an array on all the GPUs. the request to the corresponding memory. In Section IV For arrays whose dimensions are known at compile-time we present different software and hardware approaches to the framework provides a similar C++ template which takes efficiently implement this mechanism. the extents of the data structure (cuda::array). This In some codes, the overhead of remote memory accesses implementation offers better performance when accessing the can produce very large performance overheads that make array, and a lower register usage count since most of the index multi-GPU execution slower than single-GPU execution. As linearization computations can be resolved at compile-time. previously discussed, this pattern mostly happens on read-only Using this API, the device code of the 2D stencil example arrays (i.e., they are not modified by the GPU code), like in a used in Section II is shown in Listing 3. The only required matrix-matrix multiplication (where a GPU accesses its local changes are the use of the cuda::dynarray structure and the partition for one of the matrices, but entirely accesses the other removal of the boundary checking. Since CUDA does not sup- matrix) or in a convolution (where a small coefficient matrix is port specifying an offset to the block identifiers (OpenCL does read by all threads). To achieve reasonable performance these support it), we provide the function get_global_block_id that arrays must be replicated on all GPU memories, so that all automatically computes the global block id using an offset in accesses to them will be local. Replication is only viable if constant memory that is written by the runtime just before the there is enough available memory in all GPUs; otherwise we kernel call. Note that index linearization is not required either, fall back to a distributed approach. as it is automatically performed by the accessor function. When the host code invokes a kernel, programmers can also IV. CUDARRAYS IMPLEMENTATION specify the per-dimension computational partitioning scheme. If no computational partitioning scheme is provided by the We introduce the interface of CUDArrays and discuss programmer, the framework automatically uses all the avail- implementation techniques to efficiently support transparent able GPUs and partitions the computational grid across its computation and data distribution on multiple GPUs. highest-order dimension, although more advanced policies can 6 be applied by inspecting the current partitioning schemes of new data distribution approach that partitions the array across the data structures used by the kernel. different GPUs using as many 1 MB allocations as needed for each chunk (see Figure 4c). Thanks to the contiguity, this implementation does not impose an overhead B. Data partitioning when accessing the array, since no other operation besides Data partitioning must offer efficient array indexing routines index linearization (which already exists on the original code) that impose as little overhead as possible on code execution. must be performed. Our framework currently implements four different data par- However, the imposed 1 MB granularity can lead to load titioning techniques, each with increasing sophistication. For imbalance for small data structures. Thus, we provide an simplicity, we assume that arrays are partitioned along their implementation that shifts the start of the array to balance the highest-order dimension, but most of the presented techniques distribution of the data structure across two GPUs, mimicking can be applied to partitionings on other dimensions (or combi- what would be allocations at a granularity of 4 KB (as can nations of them) with minor modifications of what is explained be seen on Figure 4d). While this implementation cannot in this section. currently be used with an arbitrary number of GPUs, it allows 1) Naive: This implementation splits the array into evenly- us to showcase its potential benefits. This problem would sized chunks which are allocated into each of the GPUs in be eliminated if memory allocations and their mappings to the system (see Figure 4a). Using this approach, every time physical memories could be managed through the CUDA an array is accessed, both the allocation and the offset within driver by providing a finer granularity (e.g., using a mmap- the allocation must be determined for the given indexes. This like function). We plan to work with the CUDA driver team operation starts with the linear index of the access (idx) and to formalize such support in future CUDA versions. uses a division operation to compute the chunk index (chunk = idx / chunkSize) and the modulo operation to compute the C. Memory offset within the chunk (off = idx % chunkSize). Then, the Due to the memory consistency model in CUDA, threads chunk is used to index within the table of GPU allocations and in a thread block do not perceive regular memory updates the offset is used to index within the selected chunk (ptr_- performed by threads in a different thread block, and atomic table[chunk][off]). instructions must be used in order to update the same memory 2) Precomputed pointer table: In order to reduce the locations. However, due to limitations in current NVIDIA potential overhead of the index and offset computation, we GPUs, which do not implement atomic updates to remote implement a second scheme that uses a precomputed pointer memories (although they are supported in PCIe 3.0 [16]), our table (ptr_table), where each entry points to the beginning interface currently does not support codes that use them. of each position of the dimension being partitioned on the Input arrays are usually initialized by the host code and array. Thus, the given index for that dimension (i) is used transferred to the GPU before kernel execution. On the other to index the pointer table, and the remaining indices are hand, output arrays are modified by the GPU code and copied used to compute the linear index (idx) to be used within the to the host code to be written to an external device, or to be dimension (ptr_table[i][idx], see Figure 4b). In order for processed for later computations. This coherence operations this to work, all elements within a N-1-dimensional subarray are explicitly written by the programmer. Since our framework must reside in the same GPU. Therefore, the allocation policy performs transformations to the arrays, which are opaque to must be slightly tuned so that the arrays are partitioned using programmers, it has to provide the coherence operations to a granularity equal to the size of a N-1-dimensional subarray. update the different copies of the arrays as needed. We use the For 1D we use a ptr_table as big as the array (one pointer per ADSM memory consistency model presented in [17] to keep element). As opposed to the naive approach, having partitions data coherent without requiring explicit memory transfers. across a whole dimension can cause imbalance if the number of elements in the highest-order dimension is not big enough to be evenly distributed among the GPUs. Moreover, the V. EXPERIMENTALMETHODOLOGY extra memory indirection in each access to the array can hurt A. Hardware setup performance because of the extra memory access and the lower All experiments were run on a system containing a quad- cache efficiency. core Intel i7-3820 at 3.6 GHz with 16 GB of DDR3 RAM 3) Virtual memory: To completely eliminate the extra over- memory, and 4 NVIDIA Tesla C2050 GPU cards with 3 GB of head in the array indexing operation, data chunks located GDDR5 each, connected through a PCIe 3.0 in x16 mode (con- on the different GPUs must have contiguous virtual memory taining two PCIe bridges like in Figure 2). The machine runs addresses to maintain the logical indexing semantics after a GNU/Linux system, with Linux kernel 3.5.0 and NVIDIA index linearization. Current versions of CUDA do not support driver 304.64. Benchmarks were compiled using GCC 4.7.2 such control over the mappings within the UVAS. However, for CPU code and NVIDIA CUDA compiler 5.0 for GPU we have conducted some experiments with the CUDA memory code. Execution times were measured using the std::chrono allocator and results consistently show that the driver reserves High-Resolution Timers, for the host code, and CUDA events contiguous 1 MB virtual memory ranges for consecutive mem- for GPU code and memory transfers (µsecond accuracy). For ory allocation calls, regardless of the GPU in which memory runs with more than one GPU, the graphs show the time for is allocated. Using this observation we have implemented a the slowest execution of each kernel call. 7

(a) Naive (b) Pointer table

(c) Virtual Memory 1MB (d) Virtual Memory 4KB Figure 4: Data partitioning implementation approaches.

B. Benchmarks GPU. Let Tremote be the time required to perform all remote We have selected a number of dense scientific computations accesses for a GPU. Because data and/or computation might found in the Parboil benchmarks suite [18], and the NVIDIA not be evenly distributed across all the GPUs, we take the SDK. The list of benchmarks used in this evaluation is summa- maximum execution time of all GPUs in the system. We rized in TableII. Both host and kernel code has been modified assume that the total application execution time Texe can be to use the dynarray data structure instead of the flat 1D arrays calculated as Texe = max(Tlocal,Tremote). which are commonly used (due to space constraints we do not This model is a reasonable approximation as long as the evaluate the benefits of array for arrays whose dimensions are number of threads not requiring remote memory accesses is known at compile time). This resulted in cleaner and shorter much larger than the number of threads performing remote code due to the elimination of the array offset computation. memory accesses, which is the case in all the benchmarks we On the other hand, this has impacted on the register usage use. Notice that when analyzing strong scaling, this assump- count of the benchmarks. The effect on the usage count tion breaks for large number of nodes. greatly varies depending on the indexing functions used by We compute Tremote using queueing theory, using a M/D/ the array implementation. By default, computation and data 1 model for each link in the system. The arrival rate (λ) is are partitioned along the highest-order dimension. However, computed as, NGP U some computations require a different partition scheme: (1) P i Nremote in matrixmul (C = A × B), proposed by Volkov in [19], λ = i=1 A is in column major order while B is in row major order. Tlocal

Thus, each thread block traverses both A and B on their y where NGP U is the number of GPUs connected to the link, dimension and, therefore, crosses partition boundaries when i and Nremote is the number of remote memory accesses the default partition policy is used. In this case the preferred performed by the ith GPU connected to the link. We assume dimension to be partitioned is x. (2) in stencil3D [15] a 2D a deterministic time of each remote memory access in the computational grid is created and each thread block traverses network (i.e., µ) given by the fabric bandwidth assuming 32 the input and output 3D volumes on their z dimension, thus byte accesses, which corresponds to the L2 cache line size in traversing partition boundaries. Therefore, volumes must be current GPUs. partitioned on a different dimension. In the evaluation we study the performance implications of the different partitioning for VI.PERFORMANCE EVALUATION these cases. A. Indexing overhead TableII shows that, although the indexing functions are C. Scalability Analysis inlined by the compiler, naive (Regs 1) and pointer table (Regs Current systems support up to four GPUs connected to the 2) implementations for dynarray increase the register utiliza- same PCIe root complex. This imposes a hard limit to the tion count compared to virtual memory (Regs 3&4, which number of GPUs where we can experimentally measure the uses the same amount of register as the baseline). Moreover, scalability of our proposed partitioning scheme. To overcome naive adds extra computations which are likely to incur a this limitation, we use an analytic model for weak and strong greater performance penalty. Figure 5 shows the slowdown scaling for systems with up to 16 GPUs using different inter- imposed by the indexing mechanism of each implementation connection networks (i.e., different bandwidth and topology). on a single GPU using the biggest input dataset. Results Our modeling requires two input parameters, the GPU exe- confirm that the performance of naive is unacceptable for most cution time when no remote memory accesses are performed benchmarks due to the lower occupancy of the GPU and the (Tlocal), and the number of remote memory accesses for each increased number of issued instructions. pointer table shows 8 Array Inputset Inputset Inputset Regs Regs Regs Benchmark Suite #dims (small) (medium) (big) 1 2 3&4 reduction CUDA SDK 1D 1M 4M 16M 18 12 10 sort_merge_global 22 18 16 sort_merge_shared CUDA SDK 1D 1M 4M 16M 17 16 17 sort_shared 17 17 17 saxpy - 1D 1M 4M 16M 14 10 7 vecadd - 1D 1M 4M 16M 15 14 12 fft1D Parboil 1D 1M 4M 16M 22 24 24 2D 512×512 2K×2K 8K×8K GEMV (matrixvec) - 24 15 13 1D 512 2K 8K GEMM (matrixmul) Parboil 2 2D 256×256 1K×1K 4K×4K 35 32 35 1K×1K 4K×4K 16K×16K convolution2D Parboil 2 2D 23 21 20 3×3 3×3 3×3 stencil2D - 2D 1K×1K+halos 4K×4K+halos 16K×16K+halos 25 19 20 stencil3D Parboil 2 3D 1K×1K×32+halos 1K×1K×128+halos 1K×1K×512+halos 41 38 38 Table II: Benchmark description. 14 Naive Pointer Table Virtual Memory 12 Naive Pointer Table Virtual Memory 6 10 8 5 6 4 4 Overhead (x) 3 2 2 0 (x) 1 fft1D saxpy vecadd matrixmulmatrixvecreduction sort_sharedstencil2Dstencil3D convolution2D 0 sort_merge_globalsort_merge_shared matrixvec matrixmul convolution2D Figure 5: Slowdown of each dynarray implementation com- matrixvec_repB matrixmul_repAB pared to the baseline version on a single GPU. convolution2D_repConv Benchmark small medium big Figure 7: Effect of the replication of data structures on the stencil2D 1.66 1.06 1.008 performance of the benchmarks. Results shown for 4 GPUs. stencil3D 1.035 1.028 1.018 Table III: Speedup of 4 KB vs 1 MB granularity for 2 GPUs using 1 MB granularity. For small input datasets the speedup obtained in stencil2D using finer allocation granularity is performance degradation in 1D computations, mostly. This is 1.66×, and it decreases as the input dataset size increases. The because of the overhead of using a indirection table as big as performance improvement in stencil3D is smaller due to its virtual memory the data structure itself. As expected, does not bigger footprint (which reduces the imbalance). This numbers impose any overhead since its indexing function is the same show that using fine granularity might be key when data can as in the baseline implementation. not be evenly split. Although, this scenario only appears in a couple of benchmarks in our evaluation it becomes more B. Multi-GPU performance important when data structures are partitioned on lower-order We execute the benchmarks using 2 and 4 GPUs on dimensions. our test system. Figure 6 shows the speedup achieved for 1) Data replication: There are computations that use big the different implementations of the dynarray data structure parts of some arrays to compute each element of the result using the biggest input dataset. In these results both data (e.g., convolution2D, matrixvec). Other computations have a structures and computations are partitioned using the default data access pattern that make impossible to distribute input partitioning scheme (i.e., along the highest-order dimension). arrays in a way that remote accesses are minimized (e.g., virtual memory (with 1 MB allocation granularity) is the matrixmul). Thus, we study the effect of replicating these data implementation approach that delivers best performance in structures. Figure 7 shows the speedup over the baseline for all benchmarks but stencil2D. Scalability for most of the the original version and the version with the replicated data benchmarks with no communication is close to linear but structures (labels contain the rep suffix followed by the name other kernels like stencil2D, reduction and sort_merge_- of the data structure being replicated). convolution2D shows global show speedups greater than 3.4 × . matrixvec has a dramatic improvement when replicating the Conv matrix. a superlinear speedup for the virtual memory implementation. Our experiments suggest that, in the original implementation, This benchmark has a memory access efficiency (actual versus this small 3 × 3 matrix is evicted from the caches of the requested memory bandwidth) which is much lower (below local GPU and often has to be requested to the remote GPU 5%) than the other benchmarks and hugely benefits from the that hosts it. On the contrary, in matrixvec the performance larger aggregated memory bandwidth provided by multi-GPU remains constant across versions using the virtual memory execution. matrixmul, stencil3D have the worst results as implementation. On the other hand, pointer table does benefit they are expected to perform badly with the default partitioning from replication because the reduction of capacity conflicts policy. convolution2D also exhibits poor scalability, specially between B and the indirection table in the GPU caches (the when running on 4 GPUs. hardware performance counter for reexecuted instructions is One aspect that is not captured in the previous figure, reduced by 300%). Nevertheless, matrixmul shows the largest since it only shows results using the biggest input dataset, is improvement when replicating both A and B matrices (across how the allocation granularity can impact on the performance all the dynarray implementations). due to load imbalance. Table III shows the speedup that is 2) Arbitrary data partitioning: In matrixmul, both matri- achieved when 4 KB granularity is used for stencil2D and ces A (stored in column major order) and B (row major stencil3D. We study these benchmarks because the halo order) are traversed on their y dimension, thus exhibiting bad in all their dimensions make impossible to evenly split it performance with the default partitioning policy (matrixmul 9 6 Naive (2 GPUs) Naive (4 GPUs) Pointer Table (2 GPUs) Pointer Table (4 GPUs) 5 Virtual Memory (2 GPUs) Virtual Memory (4 GPUs)

4

3 Speedup (x) 2

1

0 fft1D saxpy vecadd matrixmul matrixvec reduction sort_shared stencil2D stencil3D convolution2D sort_merge_globalsort_merge_shared Figure 6: Speedup vs baseline for the different dynarray implementations. Results shown for 2 and 4 GPUs.

small medium big convolution2D matrixvec reduction stencil2D convolution2D_repConv matrixvec_repB sort_merge_global stencil3D 2.0 matrixmul 8 1.5 7 1.0 6 5 Speedup (x) 0.5 4 3 0.0 2

matrixmul stencil3D Contention index 1 matrixmul_transA matrixmul_repAB stencil3D_partY matrixmul_transA_repB matrixmul_partX_repB 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 Figure 8: Effect of the partitioning of the data structures and Interconnection network bandwidth (GB/sec) computations along different dimensions on 2 GPUs. (a) Bus configuration. in Figure 8). The default policy works better on an alterna- 8 tive kernel implementation that takes A in row major order 7 (matrixmul_transA) because it is able to remove all remote 6 5 accesses to it. However, remote accesses to B still limit the 4 performance of that version of the code and it has to be 3 2 replicated (matrixmul_transA_repB) to show a performance Contention index 1 comparable to the ideal version (matrixmul_repAB). Nev- 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 ertheless, the objective of CUDArrays is to be transparent Interconnection network bandwidth (GB/sec) (no modifications to the kernel code) and, therefore, we try (b) Crossbar configuration. to achieve the same effect by using a different partitioning Figure 9: Contention indexes on different network configura- scheme. Partitioning A along the x dimension and replicating tions. B (matrixmul_partX_repB), gives a speedup of 1.52 × for the medium input dataset for 2 GPUs. Again, we attribute connected using a bus topology, this benchmarks require about the performance loss (compared to the alternative kernel 140 TB/s to offer strong scalability. If a cross-bar is used implementation) to the overhead of the partitioning along this instead (Figure 9b) this bandwidth requirements decrease to 60 dimension due to the 1 MB allocation granularity imposed by GB/s. This results shows that, even if using a fully connected the driver. This restriction makes our implementation to use interconnection network, large bandwidths are required if data big memory paddings between the rows of the matrices, thus partitioning produces very large memory of accesses. harming data access locality. It also increases the footprint Figure 9a also shows that most of our benchmarks require of the data structures by a huge factor (this is why we have extremely large bandwidths when using a bus topology. Even not been able to obtain results for the biggest input dataset). if not considering stencil3D and matrixmul, a bandwidth of 4 TB/s is still required. However, if a cross-bar is used these For the stencil3D case, if we partition volumes along their bandwidth requirements drop to 40 GB/s to provide strong y dimension (stencil3D_partY), instead of the default z, we obtain a speedup close to the ideal. Since in this case we are scalability. Although this bandwidth is not achievable as of partitioning the second-order dimension of the data structures, today by most interconnect, this bandwidth figure is likely the allocation granularity problems seem not to have a big to be achievable in short term. Moreover, three benchmarks impact on the performance and allows us to simulate all input (stencil2D and sort-merge) are able to provide strong scalabil- dataset sizes. ity up to 16 GPUs when using a cross-bar of 10 GB/s, which is already provided by current fabrics. C. Scalability analysis VII.RELATED WORK We use the mathematical model introduced in the SectionV to project scalability of CUDArrays for a larger number of Solutions have been proposed in many areas to exploit GPUs (up to 16). Figure 9a shows the contention ratio for a multi-GPU execution. bus interconnect. As expected, due to the wrong partitioning a) Run-time task distribution: Ayguadé et. al. presented scheme, stencil3D and matrixmul require a very large of the GPUSs programming model and runtime in [4]. GPUSs remote memory accesses. As a consequence, if GPUs are relies on annotations to host functions and CUDA kernels, 10 processed by a source-to-source compiler to create a data functions and, in order to use it in custom CUDA kernels, dependency graph. A run-time is in charge of scheduling regular CUDA allocations must be used. Microsoft offers ND - kernel execution, allocating memory and performing data arrays in the C++-AMP [13] programming model. They can copies among the GPUs in the system. Augonnet et. al. be freely used in all kind of computations and are accessed presented StarPU in [7]. StarPU is a runtime dependency using the regular subscript notation. Moreover, C++-AMP tracker and scheduler for heterogeneous systems. Programmers automatically takes care of transferring data among CPU/GPU write tasks in the form of codelets and define the dependencies memories, transparently. among tasks. Contrary to GPUSs, StarPU allows programmers to define arbitrary dependencies between tasks (using task VIII.CONCLUSIONS identifiers or tags). The runtime dynamically chooses the In this paper we show that CUDArrays, a simple multi- version of the codelet to be executed depending on the load of dimensional array interface, enables both simple and multi- the different processors in the system. Although both solutions GPU programming and effective compile-time and run-time free programmers from explicitly allocating GPU memory, implementation. We also show that a fine granularity (4 KB) performing memory transfers and scheduling kernel execution, memory allocation system that guarantees contiguous place- programmers still have to manually decompose computations ment of consecutive allocation requests in the virtual space, into tasks and define inputs and outputs for each task. regardless of the target physical GPU, has major effect on the b) Compiler-based transparent multi-GPU execution: achievable performance. Using real hardware with GPUs we Kim et. al. introduce a OpenCL framework that combines show that our system achieves good speedup using multiple multiple GPUs and treat them as a single compute device GPUs. Using an analytical model we show that the system in [20]. In order to split computations, they analyze the array can potentially make good use of up to 16 GPUs in future ranges accessed by the thread blocks by performing a first hardware. We are currently investigating compiler techniques run of the kernels on the CPU. A source-to-source compiler for better control of partitioning computation and data for our generates the CPU version of the code. Depending on how system. threads are mapped on the arrays, the run-time chooses the best data partitioning scheme. When there are array ranges REFERENCES that are modified from threads blocks that belong to different [1] “TOP500 list - November 2012.” [Online]. Available: http://top500.org/ list/2012/11/100/ partitions, a diffing step is performed to update the elements [2] CUDA C Programming Guide, NVIDIA, 2012. with the correct values. Authors claim that their solution also [3] The OpenCL Specification, 2009. [4] E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. allows to partition and execute computations that do not fit in Quintana-Ortí, “An extension of the StarSs programming model for the memory of a single GPU. Their partitioning approach is platforms with multiple GPUs,” in Euro-Par ’09, 2009. similar to our proposed solution, although it is restricted to a [5] J. Nieplocha, R. J. Harrison, and R. J. Littlefield, “: A portable "shared-memory" programming model for. . .” ser. SC ’94, subset of thread to data mappings, while our solution is more 1994. robust as it works seamlessly on any kind of mapping. [6] V. Cavé, J. Zhao, J. Shirako, and V. Sarkar, “Habanero-java: the new adventures of old x10,” ser. PPPJ ’11, 2011. c) Language/Library-based transparent multi-GPU exe- [7] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: cution: GlobalArrays [5] is a library that allows to distribute A unified platform for task scheduling on heterogeneous multicore architectures,” in Euro-Par ’09, 2009. computations across a distributed memory system using an [8] “Heterogeneous System Architecture: a technical review.” [Online]. arra-like interface, similar to CUDArrays. However, it does Available: http://hsafoundation.com/publications/ [9] “NVIDIA GPUDirect,” NVIDIA. [Online]. Available: https://developer. not support GPU execution and SPMD languages. X10 [21] nvidia.com/gpudirect and Habanero[6] are two PGAS languages which present the [10] S. L. Scott, “Synchronization and communication in the t3e multipro- cessor,” ser. ASPLOS VII, New York, NY, USA, 1996. programmer with a single global address space. The compiler [11] P. N. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Com- and the runtime transparently redirect remote memory accesses puting Architecture.” [12] Ivy Bridge Archictecture, 2011. to the proper memory. Sequoia [22] to address the [13] C++ AMP: C++ Accelerated Massive Parallelism, Microsoft, 2012. problem of programming for the wide range of memory [14] MPI-3: A Message-Passing Interface Standard, Message Passing Inter- face Forum, 2012. topologies/hierarchies found in modern systems. Programs are [15] P. Micikevicius, “3d finite difference computation on gpus using cuda,” composed of two parts. (a) An algorithmic representation ser. GPGPU-2, 2009. [16] PCI Express Base 3.0 Specification, PCI-SIG, 2010. of the computation using a C-like programming language [17] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.- that partitions data structures and defines how to map the m. W. Hwu, “An asymmetric distributed shared memory model for heterogeneous parallel systems,” ser. ASPLOS XV, 2010. computation on them. A mapping of the to the [18] IMPACT Group, “Parboil benchmark suite,” specific system using a declarative language. http://impact.crhc.illinois.edu/parboil.php. [19] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, d) ND-arrays for GPUs: Thrust [23] is a C++ library S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “ that provides a 1D array container (vector) that efficiently experiences with cuda,” IEEE Micro, 2008. [20] J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a single compute supports a number of pre-defined and map/reduce device image in for multiple gpus,” ser. PPoPP ’11, 2011. primitives. However, explicit transfers between host and GPU [21] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-oriented memories are required. ArrayFire [24] is a user-level C/C++/ approach to non-uniform cluster computing,” ser. OOPSLA ’05, 2005. library that provides abstractions for ND-arrays and a [22] K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequoia: number of functions for array indexing, manipulation, data programming the memory hierarchy,” 2006. analysis, linear algebra, image and signal processing and [23] N. Bell, “Thrust: A parallel template library for cuda,” 2009. sparse matrices. Arrays can only be used in the advertised [24] “Arrayfire,” AccelerEyes, 2012.