Exploring the Multiple-GPU Design Space

Exploring the Multiple-GPU Design Space Dana Schaa and David Kaeli Department of Electrical and Computer Engineering Northeastern University {dschaa, kaeli}@ece.neu.edu Abstract of superscalar CPUs for certain classes of parallel applications. However, obtaining peak GPU performance requires Graphics Processing Units (GPUs) have been growing in significant programming effort. For example, when running popularity due to their impressive processing capabilities, the matrix multiplication function in CUDA’s BLAS library and with general purpose programming languages such on a GeForce 8800 GTX (theoretical 368 peak GFLOPS) as NVIDIA’s CUDA interface, are becoming the platform we measured a peak execution of around 40 GFLOPS. To of choice in the scientific computing community. Previous begin to obtain better performance, researchers would hand- studies that used GPUs focused on obtaining significant tune their code to match the characteristics of the underlying performance gains from execution on a single GPU. These GPU hardware. Although tuning may be required in some studies employed low-level, architecture-specific tuning in cases to achieve high performance (such as aligning data for order to achieve sizeable benefits over multicore CPU exe- efficient reads in memory-bound applications), these types cution. of optimizations require hardware-specific knowledge and In this paper, we consider the benefits of running on multi- make the code less portable between generations of devices. ple (parallel) GPUs to provide further orders of performance Similarly, previous work has shown that the optimal data speedup. Our methodology allows developers to accurately layout varies across different CUDA software releases [20]. predict execution time for GPU applications while varying Instead of focusing on fine tuning applications for hard- the number and configuration of the GPUs, and the size ware, our approach is to utilize multiple GPUs to exploit of the input data set. This is a natural next step in GPU even larger degrees of parallelism. To show that execution computing because it allows researchers to determine the on multiple GPUs is beneficial, we introduce models for most appropriate GPU configuration for an application the various components of GPU execution and provide a without having to purchase hardware, or write the code methodology for predicting execution of GPU applications. for a multiple-GPU implementation. When used to predict Our methodology is designed to accurately predict the exe- performance on six scientific applications, our framework cution time of a given CUDA application (based on a single- produces accurate performance estimates (11% difference GPU implementation), while varying the number of GPUs, on average and 40% maximum difference in a single case) their configuration, and the data set size of the application. for a range of short and long running scientific programs. Using this technique, we can generate accurate performance estimates that allow users to determine the system and GPU configuration that best fits their cost-performance needs. 1. Introduction Execution on parallel GPUs is promising because applications that are best suited to run on GPUs inherently have The benefits of using Graphics Processing Units (GPUs) large amounts of parallelism. Utilizing multiple GPUs avoids for general purpose programming has been recognized for dealing with many of the limitations of GPUs (e.g., on-chip some time, and many general purpose APIs have been memory resources) by exploiting the combined resources of created to abstract the graphics hardware from the program- multiple boards. Of course, inter-GPU communication be- mer [2], [10], [15]. General purpose GPU (GPGPU) pro- comes a new problem that we need to address, and involves gramming has become the scientific computing platform of considering the efficiency of the current communication choice mainly due to the availability of standard C libraries fabric provided on GPUs. Our resulting framework is both using NVIDIA’s CUDA programming interface running on effective and accurate in capturing the dynamics present NVIDIA GTX GPUs. Since its first release, a number of as we move from a single GPU to multiple GPUs. With efforts have explored how to reap large performance gains our work, GPU developers can determine potential speedups on CUDA-enabled GPUs [3], [7], [16], [18], [20], [21], [23]. gained from execution of their applications on any number One reason for this rapid adoption is that current gener- of GPUs without having to purchase expensive hardware or ation GPUs (theoretically approaching a TFLOP of compu- even write code to simulate a parallel implementation. tational power) have the potential to replace a large number The remainder of this paper is organized as follows. Section 2 presents previous GPGPU performance studies, the code size for a Lattice-Boltzman model from 2800 lines including some works that utilize multiple GPUs. Section 3 to 100 lines, while maintaining identical performance. is a brief overview of the CUDA programming model. Since Caravela [24] is a stream-based computing model that the factors affecting multiple-GPU execution have not been incorporates GPUs into GRID computing and uses them to discussed in previous work, in Section 4 we provide back- replace CPUs as computation devices. While the GRID is ground on the multiple-GPU design space, and also describe not an ideal environment for coordinated GPGPU execution our methodology. Section 5 discusses the applications used (due to the communication overheads we will discuss in to illustrate the power of our methodology. Results are Section 4), this model may have a niche for long-executing presented in Section 6. Finally, Section 7 concludes the applications with large memory requirements, as well as for paper. researchers who wish to run many batch GPU-based jobs. Given the growing interest in exploiting multiple GPUs, 2. Related Work and previous work that has demonstrated that performance gains are possible for specific applications, an in-depth Using GPUs for general purpose scientific computing analysis of the factors affecting multiple-GPU execution is has allowed a range of challenging problems to be solved lacking. The main contribution of our work is the creation of faster and has enabled researchers to study larger (e.g., models representing the significant aspects of parallel-GPU finer-grained) data sets [3], [7], [16], [18], [23]. In [21], execution (especially communication). These models can be Ryoo et. al provide a nice overview of the GeForce 8800 used to determine the best system configuration to deliver the GTX architecture, and also present strategies for optimizing required amount of performance for any application, while performance. We build upon some of the ideas presented taking into account factors such as hardware specifications in these prior papers by exploiting additional degrees of and input data size. Our work should accelerate the move available parallelism with the mapping of applications to to utilizing multiple GPUs in GPGPU computing. multiple GPUs. Our work shows that there can be great benefits when pursuing this path. 3. Parallel Computing with CUDA 2.1. Parallel GPUs Next, we provide a brief overview of the CUDA programming model as relevant to this work. We also discuss Even though most prior work has focused on algorithm issues with CUDA related to distributed and shared-system mapping techniques specific to performance on a single GPU computing. Those seeking more details on CUDA GPU, there are a number of efforts studying how to exploit programming should refer to the CUDA tutorials provided larger numbers of GPUs to accelerate specific problems. by Luebke, et. al [9]. To facilitate a discussion of the issues The Visualization Lab at Stony Brook University has a involved in utilizing multiple GPUs, we begin by formalizing 66-node cluster that contains GeForce FX5800 Ultra and some of the terminology in this work: Quadro FX4500 graphics cards that are used for both visu- • Distributed GPUs - A networked group of distributed alization and computation. Parallel algorithms implemented systems each containing a single GPU. on the cluster include medical reconstruction, particle flow, • Shared-system GPUs - A single system containing and dispersion algorithms [6], [17]. This prior work targets multiple GPUs that communicate through a shared CPU effective usage of a networked groups of distributed systems, RAM (such as the NVIDIA Tesla S870 server [13]). each containing a single GPU. • GPU-Parallel Execution - Execution that takes place Moerschell and Owens describe the implementation of a across multiple GPUs in parallel (as opposed to parallel distributed shared memory system to simplify execution on execution on a single GPU). This term is inclusive of multiple distributed GPUs [11]. In their work, they formulate both distributed and shared-system GPU execution. a memory consistency model to handle inter-GPU memory requests. By their own admission, the shortcoming of their 3.1. The CUDA Programming Model approach is that memory requests have a large impact on performance, and any abstraction where the programmer CUDA terminology refers to a GPU as the device, and a does not know where data is stored (i.e., on or off GPU) CPU as the host. These terms are used in the same manner is impractical for current GPGPU implementations. Still, as for the remainder of this paper. Next, we summarize CUDA’s GPUs begin to incorporate Scalable Link Interface (SLI) threading and memory models. technology for boards connected to the same system, this technique may prove promising. 3.1.1. Highly Threaded Environment. CUDA supports a In a related paper, Fan et. al explore how to utilize dis- large number of active threads and uses single-cycle context tributed GPU memories using object oriented libraries [5]. switches to hide datapath and memory-access latencies. For one particular application, they were able to decrease When running on NVIDIA’s G80 Series GPUs, threads are managed across 16 multiprocessors, each consisting of 3.2. GPU-Parallel Execution 8 single-instruction-multiple-data (SIMD) cores.

Load more