Mpi-Acc: Accelerator-Aware Mpi for Scientific Applications 1

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 1 MPI-ACC: Accelerator-Aware MPI for Scientific Applications Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Karthik Murthy, Milind Chabbi, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur. Abstract—Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to- end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI- ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization. Index Terms—Heterogeneous (hybrid) systems, Parallel systems, Distributed architectures, Concurrent programming F 1 INTRODUCTION movement. Currently, transmission of data from accelerator Graphics processing units (GPUs) have gained widespread memory must be done by explicitly copying data to host use as general-purpose computational accelerators and have memory before performing any communication operations. been studied extensively across a broad range of scientific This process impacts productivity and can lead to a severe applications [1], [2], [3]. The presence of general-purpose loss in performance. Significant programmer effort would accelerators in high-performance computing (HPC) clusters be required to recover this performance through vendor- and has also steadily increased, and 15% of today’s top 500 system-specific optimizations, including GPUDirect [8] and fastest supercomputers (as of November 2014) employ node and I/O topology awareness. general-purpose accelerators [4]. We introduce MPI-ACC, an integrated and extensible Nevertheless, despite the growing prominence of acceler- framework that provides end-to-end data movement in ators in HPC, data movement on systems with GPU accel- accelerator-based clusters. MPI-ACC significantly improves erators remains a significant problem. Hybrid programming productivity by providing a unified programming interface, with the Message Passing Interface (MPI) [5] and the compatible with both CUDA and OpenCL, that can allow Compute Unified Device Architecture (CUDA) [6] or Open end-to-end data movement irrespective of whether data re- Computing Language (OpenCL) [7] is the dominant means sides in host or accelerator memory. In addition, MPI-ACC of utilizing GPU clusters; however, data movement between allows applications to easily and portably leverage vendor- processes is currently limited to data residing in the host and platform-specific capabilities in order to optimize data memory. The ability to interact with auxiliary memory movement performance. Our specific contributions in this systems, such as GPU memory, has not been integrated into paper are as follows. such data movement standards, thus leaving applications • An extensible interface for integrating auxiliary mem- with no direct mechanism to perform end-to-end data ory systems (e.g., GPU memory) with MPI • An efficient runtime system, which is heavily opti- • A. M. Aji, L. S. Panwar, and W. Feng are with the Dept. of Comp. Sci., and K. Bisset is with Virginia Bioinformatics Inst., Virginia Tech. mized for a variety of vendors and platforms (CUDA E-mail: faaji, lokendra, [email protected], [email protected] and OpenCL) and carefully designed to minimize • P. Balaji, J. Dinan, and R. Thakur are with the Math. and contention with existing workloads Comp. Sci. Div., Argonne National Lab. E-mail: fbalaji, dinan, [email protected] • An in-depth study of high-performance simulation • K. Murthy, M. Chabbi, and J. Mellor-Crummey are with the Dept. of codes from two scientific application domains (compu- Comp. Sci., Rice Univ. E-mail: fksm2, mc29, [email protected] tational epidemiology [9], [10] and seismology mod- • F. Ji and X. Ma are with the Dept. of Comp. Sci., N. C. State Univ. E-mail: [email protected], [email protected] eling [11]) We evaluate our findings on HokieSpeed, a state-of- 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 2 the-art hybrid CPU-GPU cluster housed at Virginia Tech. 1 computation_on_GPU(gpu_buf); 2 cudaMemcpy(host_buf, gpu_buf, size, D2H ...); Microbenchmark results indicate that MPI-ACC can pro- 3 MPI_Send(host_buf, size, ...); vide up to 48% improvement in two-sided GPU-to-GPU communication latency. We show that MPI-ACC’s design (a) Basic hybrid MPI+GPU with synchronous execution – high does not oversubscribe the GPU, thereby minimizing con- productivity and low performance. tention with other concurrent GPU workloads. We demon- 1 int processed[chunks] = {0}; strate how MPI-ACC can be used in epidemiology and 2 for(j=0;j<chunks;j++) { 3 computation_on_GPU(gpu_buf+offset, streams[j]); seismology modeling applications to easily explore and 4 cudaMemcpyAsync(host_buf+offset, gpu_buf+offset, evaluate new optimizations at the application level. In 5 D2H, streams[j], ...); 6 } particular, we overlap MPI-ACC CPU-GPU communica- 7 numProcessed = 0; j = 0; flag = 1; tion calls with computation on the CPU as well as the 8 while (numProcessed < chunks) { 9 if(cudaStreamQuery(streams[j] == cudaSuccess) { GPU, thus resulting in better overall cluster utilization. 10 MPI_Isend(host_buf+offset,...);/* start MPI */ Results indicate that the MPI-ACC–driven communication- 11 numProcessed++; 12 processed[j] = 1; computation patterns can help improve the performance 13 } of the epidemiology simulation by up to 13.3% and the 14 MPI_Testany(...); /* check progress */ 15 if(numProcessed < chunks) /* find next chunk */ seismology modeling application by up to 44% over the 16 while(flag) { traditional hybrid MPI+GPU models. Moreover, MPI-ACC 17 j=(j+1)%chunks; flag=processed[j]; 18 } decouples the low-level memory optimizations from the 19 } applications, thereby making them scalable and portable 20 MPI_Waitall(); across several architecture generations. MPI-ACC enables the programmer to seamlessly choose between CPU, GPU, (b) Advanced hybrid MPI+GPU with pipelined execution – low productivity and high performance. or any accelerator device as the communication target, thus 1 for(j=0;j<chunks;j++) { enhancing programmer productivity. 2 computation_on_GPU(gpu_buf+offset, streams[j]); This paper is organized as follows. Section 2 intro- 3 MPI_Isend(gpu_buf+offset, ...); 4 } duces the current MPI and GPU programming models 5 MPI_Waitall(); and describes the current hybrid application programming approaches for CPU-GPU clusters. We discuss related work (c) GPU-integrated MPI with pipelined execution – high productivity in section 3. In Section 4, we present MPI-ACC’s design and high performance. and its optimized runtime system. Section 5 explains the Fig. 1: Designing hybrid CPU-GPU applications. For the execution profiles of the epidemiology and seismology manual MPI+GPU model with OpenCL, clEnqueueRead- modeling applications, their inefficient default MPI+GPU Buffer and clEnqueueWriteBuffer would be used in place of designs, and the way GPU-integrated MPI can be used cudaMemcpy. For MPI-ACC, the code remains the same for all to optimize their performances while improving produc- platforms (CUDA or OpenCL) and supported devices. tivity. In Section 6, we evaluate the communication and application-level performance of MPI-ACC. Section 7 eval- The Message Passing Interface (MPI) is one of the uates the contention impact of MPI-ACC on concurrent most widely adopted parallel programming models for GPU workloads. Section 8 summarizes

Load more