1 CUDArrays: transparent multi-GPU computation in NCC-NUMA GPU systems Anonymous Review
Abstract—Many dense scientific computations can benefit from through a relatively low-bandwidth interconnect (Non-Cache- GPU execution. However, large applications might require the Coherent Non-Uniform Memory Access, or NCC-NUMA), utilization of multiple GPUs to overcome the physical memory such that all GPUs can access any partition of the logical data size constraints. On the other hand, as integrated GPUs improve, multi-GPU nodes are expected to be ubiquitous in future HPC structures, providing the illusion of programming for a single- systems. Therefore, applications must be ported to exploit the GPU node. For future systems with integrated GPUs, we computational power of all the GPUs on such nodes. Nevertheless, envision system architecture support such as Heterogeneous multi-GPU programming currently requires programmers to System Architecture (HSA) [8] that enables GPUs in the explicitly partition data and computation. Although some solu- same node to access each other’s physical memory. In systems tions have been proposed, they rely on specialized programming languages or compiler extensions, and do not exploit advanced with discrete GPUs we rely on the Peer-to-peer (P2P) DMA features of modern GPUs like remote memory accesses. technology [9] that enables one GPU to access the memory of In this paper we present CUDArrays, an interface for current any other GPU connected to the same PCIe bus. We present and future multi-GPU systems that transparently distributes data how transparent multi-GPU data and computation distribution and CUDA kernel computation across all the GPUs in a node. scheme can be encapsulated using an N-dimensional array The interface is based on an N-dimensional array data structure that also makes the CUDA kernels more readable and concise. We (ND-array) data structure, and introduce four different imple- propose four different implementation approaches and evaluate mentation techniques for this data type using C++ and CUDA. them on a 4-GPU system. We also perform a scalability analysis This interface not only makes it possible to distribute data and to project performance results for up to 16 GPUs. Our interface computation across GPUs but also makes the CUDA kernels relies on the peer-to-peer DMA technology to transparently more concise and readable. We modify a set of GPU dense- access data located on remote GPUs. We modify a number of GPU workloads to use our data structure. Results show that computation benchmarks, originally supporting only single- linear speedups can be achieved for most of the computations if GPU systems, to use our ND-array data structure. After these the correct partitioning strategy is chosen. Using an analytical minor modifications, we run these benchmarks on a multi- model we project that most computations can scale up to 16 GPU system to evaluate the implementation alternatives, and GPUs. analyze the scalability of our approach. Experimental results show that linear speedups can be achieved for most of the I.INTRODUCTION computations if the correct partitioning strategy is chosen. Current HPC systems built using commodity micropro- Using an analytical model we project that most computations cessors commonly install two or four CPU chips in each can scale up to 16 GPUs. node [1]. Many systems also install GPUs to further accelerate The main contributions of this paper are (1) a simple inter- computations rich in data parallelism. As CPU and GPU are face to transparently distribute computation and data across integrated into the same chip, we can expect several GPUs several GPUs, (2) an analysis of different implementation per node to be common in future HPC systems. Hence, techniques to efficiently partition dense data structures across applications will have to exploit all the GPUs present in each GPUs, and (3) an evaluation of the performance of the data node to achieve optimal performance. distribution techniques using a wide range of GPU workloads. Commercial GPU programming models, such as CUDA [2] The rest of this paper is organized as follows. SectionII and OpenCL [3], make multi-GPU programming a tedious introduces our base architecture and discusses some challenges and error-prone task. GPUs are typically exposed as external found in multi-GPU programming. Section III discusses trade- devices with their own private memory. Programmers are offs for automated data and computation distribution. Sec- responsible for splitting both data and computation across tionIV presents our framework to ease the development of GPUs by hand. While some solutions have been proposed [4], multi-GPU applications and examines different implementa- [5], [6], [7], they require programmers to rewrite their pro- tion techniques. The experimental methodology is presented in grams using specialized programming languages or compiler SectionV, and SectionVI evaluates our proposed framework. extensions. Moreover, most of them do not exploit advanced We discuss the related work in Section VII. Finally, we features of modern GPUs like remote memory accesses. conclude our paper in Section VIII. In this paper we present a data and computation partitioning interface for dense data structures, i.e., multi-dimensional II.BACKGROUND AND MOTIVATION arrays, that simplifies the programming of current and future This section presents the base multi-GPU architecture tar- multi-GPU systems (either discrete or integrated). Our pro- geted in this paper and introduces the necessary terminology. posal assumes non-coherent memory accesses between GPUs Since we focus in the programmability within a node, we refer 2
Interconnect Latency Bandwidth DDR3 (CPU, 1 channel) ∼ 50ns ∼ 30 GBps GDDR5 (GPU) ∼ 500ns ∼ 200 GBps Hypertransport 3 ∼ 40ns 25+25 GBps QPI ∼ 40ns 25+25 GBps PCI Express 2.0 ∼ 200ns 8+8 GBps PCI Express 3.0 ∼ 200ns 16+16 GBps Table I: Interconnection network characteristics. Figure 1: Multi-socket integrated CPU/GPU system. bandwidths and latencies for the different interconnects that can be found in this kind of systems are summarized in Table I. Accesses to host memories from the GPU (2) go through the PCIe interconnect and the CPU memory controller. If the target address resides in memory connected to a different CPU socket, the inter-CPU interconnect (Hypertransport/QPI) must be traversed, too. Support for peer-DMA memory accesses was added to NVIDIA Tesla devices in the Fermi family [11]. This feature allows GPUs to directly access the memory on other GPUs through the PCIe interconnect (3). This feature was coupled with the Unified Virtual Address Space (UVAS), Figure 2: Multi-GPU architecture used in this paper. which ensures that a virtual memory address belongs to a to it as system in the rest of the paper. We also discuss some single physical device and, therefore, it is easy to route the programmability issues present in multi-GPU systems. application-level data requests to the proper physical memory. Hence, code in CUDA programs can transparently access any A. Non-Coherent NUMA architectures memory in the system through regular pointers. Vendors that support less-featureful programming interfaces like OpenCL Figure 1 shows the base multi-GPU system architecture do not explicitly export the mechanisms to perform remote assumed in this paper. The system has one or several chips memory accesses, but such accesses could be used internally that contain both CPU and GPU cores. CPU cores implement in the implementation of such interfaces. While the execution out of order pipelines that allow to execute complex control model provided by GPUs (discussed in detail in the next sub- code efficiently. GPU cores provide wide vector units and a section) can hide large memory latencies, both host memory highly multi-threaded execution model better suited for codes and the inter-GPU interconnects (e.g., PCI Express) deliver rich in data parallelism (e.g., dense scientific computations). a memory bandwidth which is an order of magnitude lower Each chip is connected to one or more memory modules, than the local GPU memory (GDDR5). The main limitation but all cores can access any memory module in the system. found in current systems is the incompatibilities between the Accesses to remote memories have longer access latency than protocols of the inter-CPU interconnect and the peer-memory local accesses thus forming a NUMA system. CPU cores access. Therefore, our test platform uses a single CPU and typically access memory through a coherent cache hierar- several discrete GPUs. chy, while GPUs use much weaker consistency models that Current integrated CPU/GPU chips are less popular in do not require cache coherence among cores. Therefore, in HPC systems due to their lower memory bandwidth. However order to support these two different memory organization many vendors have already implemented support for general schemes, both coherent and non-coherent interconnects are purpose computations in integrated GPUs. For example, since provided for inter-chip memory accesses. This NCC-NUMA the Sandy Bridge family, integrated GPUs from Intel support system architecture has been successfully used in the past OpenCL programs [12]. AMD also introduced the Fusion (e.g., Cray T3E [10]). Using this kind of organization allows architecture, which supports OpenCL and C++-AMP [13]. In to easily distribute computation across the GPU cores like Fusion devices, general purpose cores and GPU cores share the in current CPU-only shared memory NUMA systems (e.g., last level of the memory cache hierarchy, but a non-coherent SMP). However, as in any NUMA system, remote accesses interconnect is provided to achieve full memory bandwidth. must be minimized to avoid their longer access latency and Depending on the characteristics of the memory allocation lower memory bandwidth. This paper presents a software (e.g., accessible by both CPU and GPU, or by the GPU only), framework which allows programmers to easily exploit NCC- GPU cores use the coherent or the non-coherent interconnect NUMA multi-GPU systems in dense scientific computations. (respectively). As the integration technology improves, these We evaluate this framework on top of an existing commercial chips are expected to be dominant in the future. system that implements most of the features of an NCC- NUMA system (Figure 2). In this paper we focus on the GPUs and the inter-GPU communication. C. SPMD Programming Model GPUs are typically programmed using a Single Program B. Multi-GPU systems Multiple Data (SPMD) programming model, such as NVIDIA Modern NVIDIA GPUs allow programs running on the CUDA [2] or OpenCL [3]. This model allows programmers GPU to access most memories in the system [2]. Memory to spawn a large number of threads that execute the same 3
program, although each thread might take a completely dif- 1 void stencil2D(float *out, const float *in) { 2 int tx = threadIdx.x + 1, ty = threadIdx.y + 1; ferent control flow path. All these threads are organized into 3 a computational grid of groups of threads (i.e., thread blocks 4 int id_x = blockIdx.x * B_X + tx; 5 int id_y = blockIdx.y * B_Y + ty; in CUDA, work groups in OpenCL). Each thread block has 6 7 int idx = (id_y * size_x) + id_x; a block identifier and each thread has an identifier within the 8 thread block, which can be used by programmers to map the 9 ssize_t size_x = gridDim.x * B_X, size_y = gridDim.y * B_Y; 10 computation to the data structures (3D identifiers are provided 11 __shared__ float sh[B_Y + 2][B_X + 2]; 12 sh[ty][tx] = in[idx]; // Load central point to ease the mapping on N-dimensional data structures). Both 13 CUDA and OpenCL provide very weak consistency models: 14 if (tx == 1) { // Load left/right "halo" points 15 sh[ty][0 ] = in[idx - 1]; memory updates performed by a thread block do not have to 16 sh[ty][B_X + 1] = in[idx + B_X]; 17 } be visible to other thread blocks (except when using atomic 18 if (ty == 1) { // Load up/down "halo" points instructions). Therefore, thread blocks can be independently 19 sh[0 ][tx] = in[idx - 1 * size_x]; 20 sh[B_Y + 1][tx] = in[idx + B_Y * size_x]; executed. 21 } 22 __syncthreads(); Each thread block is scheduled to run on a GPU core, 23 24 out[idx] = sh[ty][tx] + (sh[ty][tx-1] + sh[ty][tx+1]) + and threads within a thread block are issued in fixed-length 25 (sh[ty-1][tx] + sh[ty+1][tx]); groups (i.e., warps in CUDA). Each GPU core provides fine- 26 } grained multi-threading hardware to context switch among the Listing 1: 2D Stencil Computation on multiple GPUs using different warps being executed on the same GPU core. This boundary exchanges in the host. context switching happens on data dependencies with long void stencil2D_peer(float *out, const float *in, latency instructions (e.g., misses in the L1 cache). This scheme const float *in_up, const float *in_down){ allows the GPU core to hide long latency operations by concur- ... if (tx == 1) { // Load left/right "halo" points rently executing many threads. Threads within the same thread sh[ty][0 ] = in[idx - 1]; block can communicate through a shared memory region or sh[ty][B_X + 1] = in[idx + B_X]; } using synchronization instructions (e.g., __syncthreads). The if (ty == 1) { // Load up/down "halo" points if (idx_y == 1 && in_up != NUL) { amount of thread blocks that can execute concurrently on the sh[0 ][tx] = in_up[idx_x]; same GPU core depends on the number of threads and other } else { sh[0 ][tx] = in[idx - 1 * size_x]; resources needed by each block (e.g., shared memory and } if (idx_y == size_y - 1 && in_down != NULL) { registers). Hence, the utilization of these resources must be sh[B_Y + 1][tx] = in_down[id_x]; carefully managed to achieve a full utilization of the GPU. } else { sh[B_Y + 1][tx] = in[idx + B_Y * size_x]; } } ... D. Multi-GPU programming } The CUDA device language [2] is not aware of multi-GPU Listing 2: 2D Stencil Computation on multiple GPUs using execution. Thread block identifiers are local to each GPU and peer memory accesses. memory allocations are bound to a single GPU. Moreover, although the CUDA programming model exposes to program- Listing 1 shows the CUDA kernel code of a 5-point 2D sten- mers a UVAS and provides support for remote accesses, there cil computation. Stencil computations are common in iterative is no mechanism to control how virtual addresses are mapped FDTD (finite-difference time-domain) simulations. In these to the physical memories in the system. Hence, programmers types of iterative simulations the output of a step becomes an typically split data structures by allocating a chunk of the input of the following steps. The code in Listing 1 is written structure in each GPU (obtaining non-contiguous virtual mem- for one GPU, but it can be used to process a subdomain of ory regions) and must explicitly use the appropriate pointers to the problem if domain decomposition is used for multi-GPU access allocations located on remote GPUs. This is analogous execution, provided that the programmer manually handles to the traditional solutions proposed for distributed memory data consistency among the subdomains (e.g., by copying data systems like message passing (e.g., MPI [14]). This extra code among GPUs). To compute the output value at position i, j the harms the programmability of the kernel code and is likely to 2D stencil computation uses the values within a 1 radius of the impose some overhead. Therefore, the most common way to i, j position of the input array. Thus, an extra row of the input divide the computation in CUDA is to access the local GPU for each neighboring domain (also known as “halo”) must be memory only, thus replicating those memory regions that are replicated in each GPU. In the next iteration the output is used accessed by more than one GPU. Changes to these shared as input and, therefore, the values at the domain boundaries regions must be propagated to the rest of GPUs. OpenCL [3] that are needed by the neighbors must be appropriately copied introduces the concept of global offset to ease the partitioning by the host code. of a computation across different devices. Programmers can In order to avoid the extra code and copies, peer-memory use the global id of the thread blocks to access the same allo- accesses can be used to access those rows that belong to cation from different processors on shared memory systems. the neighboring subdomains. Listing 2 shows the changes to However, on devices with distributed memories (like GPUs), the kernel needed to use peer-memory accesses. The function data structures must still be manually split, distributed, and signature of the kernel reflects that three pointers must be managed by the programmer. used for the input data instead of one: one for the allocation 4 of the GPU on which the code is running, and pointers to the data of the neighbouring subdomains that are necessary to compute the output at the boundaries. Moreover, when the halo points (lines 14-21 in Listing 1) are located on a remote GPU, the appropriate pointer must be used. Nevertheless, the modifications shown in this example are relatively simple because a specific data partitioning layout is assumed (the 2D Figure 3: Different computation partitioning schemes for a matrices are divided along the highest-order dimension and 2D grid. Each shade of gray represents a different partition, rows cannot be split). A more generic implementation should smaller squares represent thread blocks and threads. check where data is located on every access to the arrays. indexes. As a result, threads with contiguous identifiers in one dimension (e.g., x) tend to access elements which are III.DATA PARTITIONING IN DENSE SPMDCODES contiguous in the same or a different (e.g., y) dimension in In this Section we discuss how to distribute computation the nd-array. There are two main reasons for SPMD dense and data across the GPUs in the system in a efficient way. scientific codes to use this kind of index transformations. First, We decouple computation and data distribution and discuss affine transformations provide an easy and convenient way for them independently. However, these two problems are deeply programmers to map threads to parts of the arrays. Second, interrelated in terms of system performance. The perfect the GPU hardware requires threads with contiguous identifiers distribution scheme would distribute the dataset associated to (i.e., warp) to access contiguous memory locations to achieve any computation partition to the same physical GPU, thus high memory bandwidth. completely removing the need for remote memory accesses. Affine transformations used to produce array indexes from Such simple rule presents many caveats because, in many thread indexes typically increase monotonically: thread blocks cases, it is not possible to unequivocally determine which with contiguous indexes in a given dimension (e.g., Bx,y,z and parts of the dataset are accessed by each computation unit Bx,y,z+1) are likely to access parts of the arrays that are also (i.e., thread). Even if we assume this information is available, contiguous in some dimension. This is the most commonly determining an optimal distribution approach is an NP problem found pattern because the global thread index (i.e., the index of whose solution space is unmanageable. Furthermore, parts of the thread within the computational grid) is usually computed the dataset may be accessed by multiple threads, so data using the block identifier. This pattern is also favored because replication might be needed to provide good performance. the GPU hardware schedules contiguous thread blocks in the There is a trade-off between the extra storage requirements same compute core to maximize the reuse of data in the L1 of data replication, and the performance losses due to remote cache. memory accesses. As a general rule, replication should be Another pattern commonly found in dense scientific appli- avoided unless the performance is severely affected, i.e., the cations is the utilization of read-only data structures whose number of remote memory accesses is very high. elements are accessed by many or all threads. Examples of In this paper we assume no knowledge about the data these data structures are coefficient matrices in stencil and access pattern performed by the GPU kernels. Hence, we convolution computations, or input matrices for some linear aim to find a small set of partitioning schemes that deliver algebra computations such as matrix-matrix multiplications. reasonable performance for the most common patterns found The access patterns to these data structures heavily depend on in dense scientific computations. Although compiler analysis the specific computation and, thus, a common pattern cannot might provide useful information, it is out of the scope of this be easily inferred. paper. B. Computation Partitioning A. SPMD Dense Scientific Codes CUDA model imposes the thread block as the smallest In this paper we target dense scientific computations where granularity for computation partitioning. Individual threads both the input and output datasets are represented as multi- cannot be independently executed because the programming dimensional arrays (i.e., nd-arrays). Some examples of these model guarantees that threads within a warp are executed codes are Finite Difference Time Domain (FDTD) compu- in lockstep. Moreover, threads within a thread block share tations, Reverse Time Migration (RTM) seismic imaging, or resources (e.g., shared memory), and support barrier syn- Lattice Boltzmann Methods (LBM) for computational fluid- chronization operations. This requires partitioning schemes to dynamics. We first analyze the most common data access pat- ensure that all threads within a thread block are executed in terns in these codes to later define static partitioning schemes the same compute core. that will produce high multi-GPU performance execution in Figure 3 illustrates our computation distribution approach. those codes exhibiting such data access patterns. We group thread blocks in as many evenly-sized sets as The SPMD paradigm gives complete freedom to the pro- GPUs are present in the system. Each set is composed of grammer regarding which data is accessed by each thread. thread blocks whose indexes are contiguous in a given di- However, quite often programmers write the code in such a mension. This scheme offers the programmer/runtime system way that accesses to nd-dimensional arrays are computed as as many different degrees of freedom as dimensions in the an affine transformation (i.e., linear combination) of the thread computational grid of the SPMD model (i.e., three in CUDA 5
and OpenCL) when deciding how the computation gets split void stencil2D(dynarray
(a) Naive (b) Pointer table
(c) Virtual Memory 1MB (d) Virtual Memory 4KB Figure 4: Data partitioning implementation approaches.
B. Benchmarks GPU. Let Tremote be the time required to perform all remote We have selected a number of dense scientific computations accesses for a GPU. Because data and/or computation might found in the Parboil benchmarks suite [18], and the NVIDIA not be evenly distributed across all the GPUs, we take the SDK. The list of benchmarks used in this evaluation is summa- maximum execution time of all GPUs in the system. We rized in TableII. Both host and kernel code has been modified assume that the total application execution time Texe can be to use the dynarray data structure instead of the flat 1D arrays calculated as Texe = max(Tlocal,Tremote). which are commonly used (due to space constraints we do not This model is a reasonable approximation as long as the evaluate the benefits of array for arrays whose dimensions are number of threads not requiring remote memory accesses is known at compile time). This resulted in cleaner and shorter much larger than the number of threads performing remote code due to the elimination of the array offset computation. memory accesses, which is the case in all the benchmarks we On the other hand, this has impacted on the register usage use. Notice that when analyzing strong scaling, this assump- count of the benchmarks. The effect on the usage count tion breaks for large number of nodes. greatly varies depending on the indexing functions used by We compute Tremote using queueing theory, using a M/D/ the array implementation. By default, computation and data 1 model for each link in the system. The arrival rate (λ) is are partitioned along the highest-order dimension. However, computed as, NGP U some computations require a different partition scheme: (1) P i Nremote in matrixmul (C = A × B), proposed by Volkov in [19], λ = i=1 A is in column major order while B is in row major order. Tlocal
Thus, each thread block traverses both A and B on their y where NGP U is the number of GPUs connected to the link, dimension and, therefore, crosses partition boundaries when i and Nremote is the number of remote memory accesses the default partition policy is used. In this case the preferred performed by the ith GPU connected to the link. We assume dimension to be partitioned is x. (2) in stencil3D [15] a 2D a deterministic time of each remote memory access in the computational grid is created and each thread block traverses network (i.e., µ) given by the fabric bandwidth assuming 32 the input and output 3D volumes on their z dimension, thus byte accesses, which corresponds to the L2 cache line size in traversing partition boundaries. Therefore, volumes must be current GPUs. partitioned on a different dimension. In the evaluation we study the performance implications of the different partitioning for VI.PERFORMANCE EVALUATION these cases. A. Indexing overhead TableII shows that, although the indexing functions are C. Scalability Analysis inlined by the compiler, naive (Regs 1) and pointer table (Regs Current systems support up to four GPUs connected to the 2) implementations for dynarray increase the register utiliza- same PCIe root complex. This imposes a hard limit to the tion count compared to virtual memory (Regs 3&4, which number of GPUs where we can experimentally measure the uses the same amount of register as the baseline). Moreover, scalability of our proposed partitioning scheme. To overcome naive adds extra computations which are likely to incur a this limitation, we use an analytic model for weak and strong greater performance penalty. Figure 5 shows the slowdown scaling for systems with up to 16 GPUs using different inter- imposed by the indexing mechanism of each implementation connection networks (i.e., different bandwidth and topology). on a single GPU using the biggest input dataset. Results Our modeling requires two input parameters, the GPU exe- confirm that the performance of naive is unacceptable for most cution time when no remote memory accesses are performed benchmarks due to the lower occupancy of the GPU and the (Tlocal), and the number of remote memory accesses for each increased number of issued instructions. pointer table shows 8 Array Inputset Inputset Inputset Regs Regs Regs Benchmark Suite #dims (small) (medium) (big) 1 2 3&4 reduction CUDA SDK 1D 1M 4M 16M 18 12 10 sort_merge_global 22 18 16 sort_merge_shared CUDA SDK 1D 1M 4M 16M 17 16 17 sort_shared 17 17 17 saxpy - 1D 1M 4M 16M 14 10 7 vecadd - 1D 1M 4M 16M 15 14 12 fft1D Parboil 1D 1M 4M 16M 22 24 24 2D 512×512 2K×2K 8K×8K GEMV (matrixvec) - 24 15 13 1D 512 2K 8K GEMM (matrixmul) Parboil 2 2D 256×256 1K×1K 4K×4K 35 32 35 1K×1K 4K×4K 16K×16K convolution2D Parboil 2 2D 23 21 20 3×3 3×3 3×3 stencil2D - 2D 1K×1K+halos 4K×4K+halos 16K×16K+halos 25 19 20 stencil3D Parboil 2 3D 1K×1K×32+halos 1K×1K×128+halos 1K×1K×512+halos 41 38 38 Table II: Benchmark description. 14 Naive Pointer Table Virtual Memory 12 Naive Pointer Table Virtual Memory 6 10 8 5 6 4 4 Overhead (x) 3 2 2 0 Speedup (x) 1 fft1D saxpy vecadd matrixmulmatrixvecreduction sort_sharedstencil2Dstencil3D convolution2D 0 sort_merge_globalsort_merge_shared matrixvec matrixmul convolution2D Figure 5: Slowdown of each dynarray implementation com- matrixvec_repB matrixmul_repAB pared to the baseline version on a single GPU. convolution2D_repConv Benchmark small medium big Figure 7: Effect of the replication of data structures on the stencil2D 1.66 1.06 1.008 performance of the benchmarks. Results shown for 4 GPUs. stencil3D 1.035 1.028 1.018 Table III: Speedup of 4 KB vs 1 MB granularity for 2 GPUs using 1 MB granularity. For small input datasets the speedup obtained in stencil2D using finer allocation granularity is performance degradation in 1D computations, mostly. This is 1.66×, and it decreases as the input dataset size increases. The because of the overhead of using a indirection table as big as performance improvement in stencil3D is smaller due to its virtual memory the data structure itself. As expected, does not bigger footprint (which reduces the imbalance). This numbers impose any overhead since its indexing function is the same show that using fine granularity might be key when data can as in the baseline implementation. not be evenly split. Although, this scenario only appears in a couple of benchmarks in our evaluation it becomes more B. Multi-GPU performance important when data structures are partitioned on lower-order We execute the benchmarks using 2 and 4 GPUs on dimensions. our test system. Figure 6 shows the speedup achieved for 1) Data replication: There are computations that use big the different implementations of the dynarray data structure parts of some arrays to compute each element of the result using the biggest input dataset. In these results both data (e.g., convolution2D, matrixvec). Other computations have a structures and computations are partitioned using the default data access pattern that make impossible to distribute input partitioning scheme (i.e., along the highest-order dimension). arrays in a way that remote accesses are minimized (e.g., virtual memory (with 1 MB allocation granularity) is the matrixmul). Thus, we study the effect of replicating these data implementation approach that delivers best performance in structures. Figure 7 shows the speedup over the baseline for all benchmarks but stencil2D. Scalability for most of the the original version and the version with the replicated data benchmarks with no communication is close to linear but structures (labels contain the rep suffix followed by the name other kernels like stencil2D, reduction and sort_merge_- of the data structure being replicated). convolution2D shows global show speedups greater than 3.4 × . matrixvec has a dramatic improvement when replicating the Conv matrix. a superlinear speedup for the virtual memory implementation. Our experiments suggest that, in the original implementation, This benchmark has a memory access efficiency (actual versus this small 3 × 3 matrix is evicted from the caches of the requested memory bandwidth) which is much lower (below local GPU and often has to be requested to the remote GPU 5%) than the other benchmarks and hugely benefits from the that hosts it. On the contrary, in matrixvec the performance larger aggregated memory bandwidth provided by multi-GPU remains constant across versions using the virtual memory execution. matrixmul, stencil3D have the worst results as implementation. On the other hand, pointer table does benefit they are expected to perform badly with the default partitioning from replication because the reduction of capacity conflicts policy. convolution2D also exhibits poor scalability, specially between B and the indirection table in the GPU caches (the when running on 4 GPUs. hardware performance counter for reexecuted instructions is One aspect that is not captured in the previous figure, reduced by 300%). Nevertheless, matrixmul shows the largest since it only shows results using the biggest input dataset, is improvement when replicating both A and B matrices (across how the allocation granularity can impact on the performance all the dynarray implementations). due to load imbalance. Table III shows the speedup that is 2) Arbitrary data partitioning: In matrixmul, both matri- achieved when 4 KB granularity is used for stencil2D and ces A (stored in column major order) and B (row major stencil3D. We study these benchmarks because the halo order) are traversed on their y dimension, thus exhibiting bad in all their dimensions make impossible to evenly split it performance with the default partitioning policy (matrixmul 9 6 Naive (2 GPUs) Naive (4 GPUs) Pointer Table (2 GPUs) Pointer Table (4 GPUs) 5 Virtual Memory (2 GPUs) Virtual Memory (4 GPUs)
4
3 Speedup (x) 2
1
0 fft1D saxpy vecadd matrixmul matrixvec reduction sort_shared stencil2D stencil3D convolution2D sort_merge_globalsort_merge_shared Figure 6: Speedup vs baseline for the different dynarray implementations. Results shown for 2 and 4 GPUs.
small medium big convolution2D matrixvec reduction stencil2D convolution2D_repConv matrixvec_repB sort_merge_global stencil3D 2.0 matrixmul 8 1.5 7 1.0 6 5 Speedup (x) 0.5 4 3 0.0 2
matrixmul stencil3D Contention index 1 matrixmul_transA matrixmul_repAB stencil3D_partY matrixmul_transA_repB matrixmul_partX_repB 0 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 22000 Figure 8: Effect of the partitioning of the data structures and Interconnection network bandwidth (GB/sec) computations along different dimensions on 2 GPUs. (a) Bus configuration. in Figure 8). The default policy works better on an alterna- 8 tive kernel implementation that takes A in row major order 7 (matrixmul_transA) because it is able to remove all remote 6 5 accesses to it. However, remote accesses to B still limit the 4 performance of that version of the code and it has to be 3 2 replicated (matrixmul_transA_repB) to show a performance Contention index 1 comparable to the ideal version (matrixmul_repAB). Nev- 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 ertheless, the objective of CUDArrays is to be transparent Interconnection network bandwidth (GB/sec) (no modifications to the kernel code) and, therefore, we try (b) Crossbar configuration. to achieve the same effect by using a different partitioning Figure 9: Contention indexes on different network configura- scheme. Partitioning A along the x dimension and replicating tions. B (matrixmul_partX_repB), gives a speedup of 1.52 × for the medium input dataset for 2 GPUs. Again, we attribute connected using a bus topology, this benchmarks require about the performance loss (compared to the alternative kernel 140 TB/s to offer strong scalability. If a cross-bar is used implementation) to the overhead of the partitioning along this instead (Figure 9b) this bandwidth requirements decrease to 60 dimension due to the 1 MB allocation granularity imposed by GB/s. This results shows that, even if using a fully connected the driver. This restriction makes our implementation to use interconnection network, large bandwidths are required if data big memory paddings between the rows of the matrices, thus partitioning produces very large memory of accesses. harming data access locality. It also increases the footprint Figure 9a also shows that most of our benchmarks require of the data structures by a huge factor (this is why we have extremely large bandwidths when using a bus topology. Even not been able to obtain results for the biggest input dataset). if not considering stencil3D and matrixmul, a bandwidth of 4 TB/s is still required. However, if a cross-bar is used these For the stencil3D case, if we partition volumes along their bandwidth requirements drop to 40 GB/s to provide strong y dimension (stencil3D_partY), instead of the default z, we obtain a speedup close to the ideal. Since in this case we are scalability. Although this bandwidth is not achievable as of partitioning the second-order dimension of the data structures, today by most interconnect, this bandwidth figure is likely the allocation granularity problems seem not to have a big to be achievable in short term. Moreover, three benchmarks impact on the performance and allows us to simulate all input (stencil2D and sort-merge) are able to provide strong scalabil- dataset sizes. ity up to 16 GPUs when using a cross-bar of 10 GB/s, which is already provided by current fabrics. C. Scalability analysis VII.RELATED WORK We use the mathematical model introduced in the SectionV to project scalability of CUDArrays for a larger number of Solutions have been proposed in many areas to exploit GPUs (up to 16). Figure 9a shows the contention ratio for a multi-GPU execution. bus interconnect. As expected, due to the wrong partitioning a) Run-time task distribution: Ayguadé et. al. presented scheme, stencil3D and matrixmul require a very large of the GPUSs programming model and runtime in [4]. GPUSs remote memory accesses. As a consequence, if GPUs are relies on annotations to host functions and CUDA kernels, 10 processed by a source-to-source compiler to create a data functions and, in order to use it in custom CUDA kernels, dependency graph. A run-time is in charge of scheduling regular CUDA allocations must be used. Microsoft offers ND - kernel execution, allocating memory and performing data arrays in the C++-AMP [13] programming model. They can copies among the GPUs in the system. Augonnet et. al. be freely used in all kind of computations and are accessed presented StarPU in [7]. StarPU is a runtime dependency using the regular subscript notation. Moreover, C++-AMP tracker and scheduler for heterogeneous systems. Programmers automatically takes care of transferring data among CPU/GPU write tasks in the form of codelets and define the dependencies memories, transparently. among tasks. Contrary to GPUSs, StarPU allows programmers to define arbitrary dependencies between tasks (using task VIII.CONCLUSIONS identifiers or tags). The runtime dynamically chooses the In this paper we show that CUDArrays, a simple multi- version of the codelet to be executed depending on the load of dimensional array interface, enables both simple and multi- the different processors in the system. Although both solutions GPU programming and effective compile-time and run-time free programmers from explicitly allocating GPU memory, implementation. We also show that a fine granularity (4 KB) performing memory transfers and scheduling kernel execution, memory allocation system that guarantees contiguous place- programmers still have to manually decompose computations ment of consecutive allocation requests in the virtual space, into tasks and define inputs and outputs for each task. regardless of the target physical GPU, has major effect on the b) Compiler-based transparent multi-GPU execution: achievable performance. Using real hardware with GPUs we Kim et. al. introduce a OpenCL framework that combines show that our system achieves good speedup using multiple multiple GPUs and treat them as a single compute device GPUs. Using an analytical model we show that the system in [20]. In order to split computations, they analyze the array can potentially make good use of up to 16 GPUs in future ranges accessed by the thread blocks by performing a first hardware. We are currently investigating compiler techniques run of the kernels on the CPU. A source-to-source compiler for better control of partitioning computation and data for our generates the CPU version of the code. Depending on how system. threads are mapped on the arrays, the run-time chooses the best data partitioning scheme. When there are array ranges REFERENCES that are modified from threads blocks that belong to different [1] “TOP500 list - November 2012.” [Online]. Available: http://top500.org/ list/2012/11/100/ partitions, a diffing step is performed to update the elements [2] CUDA C Programming Guide, NVIDIA, 2012. with the correct values. Authors claim that their solution also [3] The OpenCL Specification, 2009. [4] E. Ayguadé, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. allows to partition and execute computations that do not fit in Quintana-Ortí, “An extension of the StarSs programming model for the memory of a single GPU. Their partitioning approach is platforms with multiple GPUs,” in Euro-Par ’09, 2009. similar to our proposed solution, although it is restricted to a [5] J. Nieplocha, R. J. Harrison, and R. J. Littlefield, “Global Arrays: A portable "shared-memory" programming model for. . .” ser. SC ’94, subset of thread to data mappings, while our solution is more 1994. robust as it works seamlessly on any kind of mapping. [6] V. Cavé, J. Zhao, J. Shirako, and V. Sarkar, “Habanero-java: the new adventures of old x10,” ser. PPPJ ’11, 2011. c) Language/Library-based transparent multi-GPU exe- [7] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: cution: GlobalArrays [5] is a library that allows to distribute A unified platform for task scheduling on heterogeneous multicore architectures,” in Euro-Par ’09, 2009. computations across a distributed memory system using an [8] “Heterogeneous System Architecture: a technical review.” [Online]. arra-like interface, similar to CUDArrays. However, it does Available: http://hsafoundation.com/publications/ [9] “NVIDIA GPUDirect,” NVIDIA. [Online]. Available: https://developer. not support GPU execution and SPMD languages. X10 [21] nvidia.com/gpudirect and Habanero[6] are two PGAS languages which present the [10] S. L. Scott, “Synchronization and communication in the t3e multipro- cessor,” ser. ASPLOS VII, New York, NY, USA, 1996. programmer with a single global address space. The compiler [11] P. N. Glaskowsky, “NVIDIA’s Fermi: The First Complete GPU Com- and the runtime transparently redirect remote memory accesses puting Architecture.” [12] Ivy Bridge Archictecture, 2011. to the proper memory. Sequoia [22] tries to address the [13] C++ AMP: C++ Accelerated Massive Parallelism, Microsoft, 2012. problem of programming for the wide range of memory [14] MPI-3: A Message-Passing Interface Standard, Message Passing Inter- face Forum, 2012. topologies/hierarchies found in modern systems. Programs are [15] P. Micikevicius, “3d finite difference computation on gpus using cuda,” composed of two parts. (a) An algorithmic representation ser. GPGPU-2, 2009. [16] PCI Express Base 3.0 Specification, PCI-SIG, 2010. of the computation using a C-like programming language [17] I. Gelado, J. E. Stone, J. Cabezas, S. Patel, N. Navarro, and W.- that partitions data structures and defines how to map the m. W. Hwu, “An asymmetric distributed shared memory model for heterogeneous parallel systems,” ser. ASPLOS XV, 2010. computation on them. A mapping of the algorithm to the [18] IMPACT Group, “Parboil benchmark suite,” specific system using a declarative language. http://impact.crhc.illinois.edu/parboil.php. [19] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, d) ND-arrays for GPUs: Thrust [23] is a C++ library S. Morton, E. Phillips, Y. Zhang, and V. Volkov, “Parallel computing that provides a 1D array container (vector) that efficiently experiences with cuda,” IEEE Micro, 2008. [20] J. Kim, H. Kim, J. H. Lee, and J. Lee, “Achieving a single compute supports a number of pre-defined algorithms and map/reduce device image in opencl for multiple gpus,” ser. PPoPP ’11, 2011. primitives. However, explicit transfers between host and GPU [21] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-oriented memories are required. ArrayFire [24] is a user-level C/C++/ approach to non-uniform cluster computing,” ser. OOPSLA ’05, 2005. Fortran library that provides abstractions for ND-arrays and a [22] K. Fatahalian, D. R. Horn, T. J. Knight, L. Leem, M. Houston, J. Y. Park, M. Erez, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan, “Sequoia: number of functions for array indexing, manipulation, data programming the memory hierarchy,” 2006. analysis, linear algebra, image and signal processing and [23] N. Bell, “Thrust: A parallel template library for cuda,” 2009. sparse matrices. Arrays can only be used in the advertised [24] “Arrayfire,” AccelerEyes, 2012.