Transparent Multi-GPU Computation in NCC-NUMA GPU Systems Anonymous Review

1 CUDArrays: transparent multi-GPU computation in NCC-NUMA GPU systems Anonymous Review Abstract—Many dense scientific computations can benefit from through a relatively low-bandwidth interconnect (Non-Cache- GPU execution. However, large applications might require the Coherent Non-Uniform Memory Access, or NCC-NUMA), utilization of multiple GPUs to overcome the physical memory such that all GPUs can access any partition of the logical data size constraints. On the other hand, as integrated GPUs improve, multi-GPU nodes are expected to be ubiquitous in future HPC structures, providing the illusion of programming for a single- systems. Therefore, applications must be ported to exploit the GPU node. For future systems with integrated GPUs, we computational power of all the GPUs on such nodes. Nevertheless, envision system architecture support such as Heterogeneous multi-GPU programming currently requires programmers to System Architecture (HSA) [8] that enables GPUs in the explicitly partition data and computation. Although some solu- same node to access each other’s physical memory. In systems tions have been proposed, they rely on specialized programming languages or compiler extensions, and do not exploit advanced with discrete GPUs we rely on the Peer-to-peer (P2P) DMA features of modern GPUs like remote memory accesses. technology [9] that enables one GPU to access the memory of In this paper we present CUDArrays, an interface for current any other GPU connected to the same PCIe bus. We present and future multi-GPU systems that transparently distributes data how transparent multi-GPU data and computation distribution and CUDA kernel computation across all the GPUs in a node. scheme can be encapsulated using an N-dimensional array The interface is based on an N-dimensional array data structure that also makes the CUDA kernels more readable and concise. We (ND-array) data structure, and introduce four different imple- propose four different implementation approaches and evaluate mentation techniques for this data type using C++ and CUDA. them on a 4-GPU system. We also perform a scalability analysis This interface not only makes it possible to distribute data and to project performance results for up to 16 GPUs. Our interface computation across GPUs but also makes the CUDA kernels relies on the peer-to-peer DMA technology to transparently more concise and readable. We modify a set of GPU dense- access data located on remote GPUs. We modify a number of GPU workloads to use our data structure. Results show that computation benchmarks, originally supporting only single- linear speedups can be achieved for most of the computations if GPU systems, to use our ND-array data structure. After these the correct partitioning strategy is chosen. Using an analytical minor modifications, we run these benchmarks on a multi- model we project that most computations can scale up to 16 GPU system to evaluate the implementation alternatives, and GPUs. analyze the scalability of our approach. Experimental results show that linear speedups can be achieved for most of the I. INTRODUCTION computations if the correct partitioning strategy is chosen. Current HPC systems built using commodity micropro- Using an analytical model we project that most computations cessors commonly install two or four CPU chips in each can scale up to 16 GPUs. node [1]. Many systems also install GPUs to further accelerate The main contributions of this paper are (1) a simple inter- computations rich in data parallelism. As CPU and GPU are face to transparently distribute computation and data across integrated into the same chip, we can expect several GPUs several GPUs, (2) an analysis of different implementation per node to be common in future HPC systems. Hence, techniques to efficiently partition dense data structures across applications will have to exploit all the GPUs present in each GPUs, and (3) an evaluation of the performance of the data node to achieve optimal performance. distribution techniques using a wide range of GPU workloads. Commercial GPU programming models, such as CUDA [2] The rest of this paper is organized as follows. SectionII and OpenCL [3], make multi-GPU programming a tedious introduces our base architecture and discusses some challenges and error-prone task. GPUs are typically exposed as external found in multi-GPU programming. Section III discusses trade- devices with their own private memory. Programmers are offs for automated data and computation distribution. Sec- responsible for splitting both data and computation across tionIV presents our framework to ease the development of GPUs by hand. While some solutions have been proposed [4], multi-GPU applications and examines different implementa- [5], [6], [7], they require programmers to rewrite their pro- tion techniques. The experimental methodology is presented in grams using specialized programming languages or compiler SectionV, and SectionVI evaluates our proposed framework. extensions. Moreover, most of them do not exploit advanced We discuss the related work in Section VII. Finally, we features of modern GPUs like remote memory accesses. conclude our paper in Section VIII. In this paper we present a data and computation partitioning interface for dense data structures, i.e., multi-dimensional II. BACKGROUND AND MOTIVATION arrays, that simplifies the programming of current and future This section presents the base multi-GPU architecture tar- multi-GPU systems (either discrete or integrated). Our pro- geted in this paper and introduces the necessary terminology. posal assumes non-coherent memory accesses between GPUs Since we focus in the programmability within a node, we refer 2 Interconnect Latency Bandwidth DDR3 (CPU, 1 channel) ∼ 50ns ∼ 30 GBps GDDR5 (GPU) ∼ 500ns ∼ 200 GBps Hypertransport 3 ∼ 40ns 25+25 GBps QPI ∼ 40ns 25+25 GBps PCI Express 2.0 ∼ 200ns 8+8 GBps PCI Express 3.0 ∼ 200ns 16+16 GBps Table I: Interconnection network characteristics. Figure 1: Multi-socket integrated CPU/GPU system. bandwidths and latencies for the different interconnects that can be found in this kind of systems are summarized in Table I. Accesses to host memories from the GPU (2) go through the PCIe interconnect and the CPU memory controller. If the target address resides in memory connected to a different CPU socket, the inter-CPU interconnect (Hypertransport/QPI) must be traversed, too. Support for peer-DMA memory accesses was added to NVIDIA Tesla devices in the Fermi family [11]. This feature allows GPUs to directly access the memory on other GPUs through the PCIe interconnect (3). This feature was coupled with the Unified Virtual Address Space (UVAS), Figure 2: Multi-GPU architecture used in this paper. which ensures that a virtual memory address belongs to a to it as system in the rest of the paper. We also discuss some single physical device and, therefore, it is easy to route the programmability issues present in multi-GPU systems. application-level data requests to the proper physical memory. Hence, code in CUDA programs can transparently access any A. Non-Coherent NUMA architectures memory in the system through regular pointers. Vendors that support less-featureful programming interfaces like OpenCL Figure 1 shows the base multi-GPU system architecture do not explicitly export the mechanisms to perform remote assumed in this paper. The system has one or several chips memory accesses, but such accesses could be used internally that contain both CPU and GPU cores. CPU cores implement in the implementation of such interfaces. While the execution out of order pipelines that allow to execute complex control model provided by GPUs (discussed in detail in the next sub- code efficiently. GPU cores provide wide vector units and a section) can hide large memory latencies, both host memory highly multi-threaded execution model better suited for codes and the inter-GPU interconnects (e.g., PCI Express) deliver rich in data parallelism (e.g., dense scientific computations). a memory bandwidth which is an order of magnitude lower Each chip is connected to one or more memory modules, than the local GPU memory (GDDR5). The main limitation but all cores can access any memory module in the system. found in current systems is the incompatibilities between the Accesses to remote memories have longer access latency than protocols of the inter-CPU interconnect and the peer-memory local accesses thus forming a NUMA system. CPU cores access. Therefore, our test platform uses a single CPU and typically access memory through a coherent cache hierar- several discrete GPUs. chy, while GPUs use much weaker consistency models that Current integrated CPU/GPU chips are less popular in do not require cache coherence among cores. Therefore, in HPC systems due to their lower memory bandwidth. However order to support these two different memory organization many vendors have already implemented support for general schemes, both coherent and non-coherent interconnects are purpose computations in integrated GPUs. For example, since provided for inter-chip memory accesses. This NCC-NUMA the Sandy Bridge family, integrated GPUs from Intel support system architecture has been successfully used in the past OpenCL programs [12]. AMD also introduced the Fusion (e.g., Cray T3E [10]). Using this kind of organization allows architecture, which supports OpenCL and C++-AMP [13]. In to easily distribute computation across the GPU cores like Fusion devices, general purpose cores and GPU cores share the in current CPU-only shared memory NUMA systems (e.g., last level of the memory cache hierarchy, but a non-coherent SMP). However, as in any NUMA system, remote accesses interconnect is provided to achieve full memory bandwidth. must be minimized to avoid their longer access latency and Depending on the characteristics of the memory allocation lower memory bandwidth. This paper presents a software (e.g., accessible by both CPU and GPU, or by the GPU only), framework which allows programmers to easily exploit NCC- GPU cores use the coherent or the non-coherent interconnect NUMA multi-GPU systems in dense scientific computations.

Transparent Multi-GPU Computation in NCC-NUMA GPU Systems Anonymous Review

Data Structure

Data Structure Invariants

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Verification-Aware Opencl Based Read Mapper for Heterogeneous

Applying Front End Compiler Process to Parse Polynomials in Parallel

Exploratory Large Scale Graph Analytics in Arkouda 59 2 60 3 61 4 Zhihui Du,Oliver Alvarado Rodriguez and Michael Merrill and William Reus 62 5 David A

Array Data Structure

View of This Work

When Prefetching Works, When It Doesn&Rsquo

Array Data Structure

Unsynchronized Techniques for Approximate Parallel Computing

Netlist Security Algorithm Acceleration Using Opencl on Fpgas