This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 1

MPI-ACC: Accelerator-Aware MPI for Scientific Applications

Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Karthik Murthy, Milind Chabbi, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur.

Abstract—Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement standards, thus providing applications with no direct mechanism to perform end-to- end data movement. We introduce MPI-ACC, an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC supports data transfer among CUDA, OpenCL and CPU memory spaces and is extensible to other offload models as well. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers, scalable memory management techniques, and balancing of communication based on accelerator and node architecture. MPI- ACC is designed to work concurrently with other GPU workloads with minimum contention. We describe how MPI-ACC can be used to design new communication-computation patterns in scientific applications from domains such as epidemiology simulation and seismology modeling, and we discuss the lessons learned. We present experimental results on a state-of-the-art cluster with hundreds of GPUs; and we compare the performance and productivity of MPI-ACC with MVAPICH, a popular CUDA-aware MPI solution. MPI-ACC encourages programmers to explore novel application-specific optimizations for improved overall cluster utilization.

Index Terms—Heterogeneous (hybrid) systems, Parallel systems, Distributed architectures, Concurrent programming !

1 INTRODUCTION movement. Currently, transmission of data from accelerator Graphics processing units (GPUs) have gained widespread memory must be done by explicitly copying data to host use as general-purpose computational accelerators and have memory before performing any communication operations. been studied extensively across a broad range of scientific This impacts productivity and can lead to a severe applications [1], [2], [3]. The presence of general-purpose loss in performance. Significant programmer effort would accelerators in high-performance computing (HPC) clusters be required to recover this performance through vendor- and has also steadily increased, and 15% of today’s top 500 system-specific optimizations, including GPUDirect [8] and fastest (as of November 2014) employ node and I/O topology awareness. general-purpose accelerators [4]. We introduce MPI-ACC, an integrated and extensible Nevertheless, despite the growing prominence of acceler- framework that provides end-to-end data movement in ators in HPC, data movement on systems with GPU accel- accelerator-based clusters. MPI-ACC significantly improves erators remains a significant problem. Hybrid programming productivity by providing a unified programming interface, with the Message Passing Interface (MPI) [5] and the compatible with both CUDA and OpenCL, that can allow Compute Unified Device Architecture (CUDA) [6] or Open end-to-end data movement irrespective of whether data re- Computing Language (OpenCL) [7] is the dominant means sides in host or accelerator memory. In addition, MPI-ACC of utilizing GPU clusters; however, data movement between allows applications to easily and portably leverage vendor- processes is currently limited to data residing in the host and platform-specific capabilities in order to optimize data memory. The ability to interact with auxiliary memory movement performance. Our specific contributions in this systems, such as GPU memory, has not been integrated into paper are as follows. such data movement standards, thus leaving applications • An extensible interface for integrating auxiliary mem- with no direct mechanism to perform end-to-end data ory systems (e.g., GPU memory) with MPI • An efficient runtime system, which is heavily opti- • A. M. Aji, L. S. Panwar, and W. Feng are with the Dept. of Comp. Sci., and K. Bisset is with Virginia Bioinformatics Inst., Virginia Tech. mized for a variety of vendors and platforms (CUDA E-mail: {aaji, lokendra, feng}@cs.vt.edu, [email protected] and OpenCL) and carefully designed to minimize • P. Balaji, J. Dinan, and R. Thakur are with the Math. and contention with existing workloads Comp. Sci. Div., Argonne National Lab. E-mail: {balaji, dinan, thakur}@mcs.anl.gov • An in-depth study of high-performance simulation • K. Murthy, M. Chabbi, and J. Mellor-Crummey are with the Dept. of codes from two scientific application domains (compu- Comp. Sci., Rice Univ. E-mail: {ksm2, mc29, johnmc}@rice.edu tational epidemiology [9], [10] and seismology mod- • F. Ji and X. Ma are with the Dept. of Comp. Sci., N. C. State Univ. E-mail: [email protected], [email protected] eling [11]) We evaluate our findings on HokieSpeed, a state-of- 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 2

the-art hybrid CPU-GPU cluster housed at Virginia Tech. 1 computation_on_GPU(gpu_buf); 2 cudaMemcpy(host_buf, gpu_buf, size, D2H ...); Microbenchmark results indicate that MPI-ACC can pro- 3 MPI_Send(host_buf, size, ...); vide up to 48% improvement in two-sided GPU-to-GPU communication latency. We show that MPI-ACC’s design (a) Basic hybrid MPI+GPU with synchronous execution – high does not oversubscribe the GPU, thereby minimizing con- productivity and low performance. tention with other concurrent GPU workloads. We demon- 1 int processed[chunks] = {0}; strate how MPI-ACC can be used in epidemiology and 2 for(j=0;j

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 3

this approach leads to tight coupling between the high- Broadband Engine. Also, we make no assumptions about level application logic and low-level data movement opti- the availability of key hardware features (e.g., UVA) in mizations; hence, the application developer has to maintain our interface design, thus making MPI-ACC a truly generic several code variants for different GPU architectures and framework for heterogeneous CPU-GPU systems. vendors. In addition, construction of such a sophisticated CudaMPI [18] and GMH [19] are new libraries that data movement scheme above the MPI runtime system are designed to improve programmer productivity when incurs repeated protocol overheads and eliminates oppor- managing data and compute among GPUs across the net- tunities for low-level optimizations. Moreover, users who work. Both these approaches are host-centric programming need high performance are faced with the complexity of models and provide new programming abstractions on top leveraging a multitude of platform-specific optimizations of existing MPI and CUDA calls. CaravelaMPI [20] is that continue to evolve with the underlying technology (e.g, another MPI-style solution for data management GPUDirect [8]). across GPUs, but the solution is limited to the custom Caravela API and not applicable to general MPI pro- 2.2 Application Design Using GPU-Integrated MPI grams. In contrast, MPI-ACC completely conforms to the Frameworks MPI standard itself, and our implementation removes the overhead of communication setup time, while maintaining To bridge the gap between the disjointed MPI and GPU productivity. programming models, researchers have recently developed DCGN [21] is a device-centric programming model GPU-integrated MPI solutions such as our MPI-ACC [14] that moves away from the GPU-as-a-worker programming framework and MVAPICH-GPU [15] by Wang et al. These model. DCGN assigns ranks to GPU threads in the system frameworks provide a unified MPI data transmission inter- and allows them to communicate among each other by face for both host and GPU memories; in other words, the using MPI-like library calls, and a CPU-based polling programmer can use either the CPU buffer or the GPU runtime handles GPU control and data transfers. MPI-ACC buffer directly as the communication parameter in MPI is orthogonal to DCGN in that we retain the host-centric routines. The goal of such GPU-integrated MPI platforms MPI communication and execution model while hiding the is to decouple the complex, low-level, GPU-specific data details of third-party CPU-GPU communication libraries movement optimizations from the application logic, thus from the end user. providing the following benefits: (1) portability – the ap- Partitioned Global Address Space (PGAS) models, such plication can be more portable across multiple accelerator as Chapel, , and OpenSHMEM, provide a platforms; and (2) forward compatibility – with the same globally abstraction to distributed-memory code, the application can automatically achieve perfor- systems. Researchers have explored extending PGAS mod- mance improvements from new GPU technologies (e.g., els to include GPUs as part of the shared-memory abstrac- GPUDirect RDMA) if applicable and supported by the MPI tion [22], [23]. PGAS models can use MPI itself as the implementation. In addition to enhanced programmability, underlying runtime layer [24], [25] and can be considered transparent architecture-specific and vendor-specific perfor- as complementary efforts to MPI. mance optimizations can be provided within the MPI layer. Using GPU-integrated MPI, programmers need only to write GPU kernels and regular host CPU codes for compu- 4 MPI-ACC: DESIGNAND OPTIMIZATIONS tation and invoke the standard MPI functions for CPU-GPU In this section, we describe the design, implementation, data communication, without worrying about the complex and optimizations of MPI-ACC, the first interface that data movement optimizations of the diverse accelerator integrates CUDA, OpenCL, and other models within an technologies (Figure 1c). MPI-compliant interface.

3 RELATED WORK 4.1 API Design MVAPICH [16] is an implementation of MPI based on We discuss MPI-ACC’s API design considerations and MPICH [17] and is optimized for RDMA networks such compare the tradeoffs of our solution with the the API as InfiniBand. From v1.8 onward, MVAPICH has included design of MVAPICH. In an MPI communication call, the support for transferring CUDA memory regions across the user passes a void pointer that indicates the location of network (point-to-point, collective, and one-sided commu- the data on which the user wishes to operate. To the MPI nications), but its design relies heavily on CUDA’s Unified library, a pointer to host memory is indistinguishable from Virtual Addressing (UVA) feature. On the other hand, a pointer to GPU memory. The MPI implementation needs MPI-ACC takes a more portable approach: we support a mechanism to determine whether the given pointer can data transfers among CUDA, OpenCL, and CPU memory be dereferenced directly or whether data must be explicitly regions; and our design is independent of library version or copied from the device by invoking GPU library functions. device family. By including OpenCL support in MPI-ACC, Moreover, memory is referenced differently in different we automatically enable data movement between a variety GPU programming models. For example, CUDA memory of devices, including GPUs from NVIDIA and AMD, IBM buffers are void pointers, but they cannot be dereferenced and Intel CPUs, Intel MICs, AMD Fusion, and IBM’s Cell by the host unless Unified Virtual Addressing (UVA) is 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 4

100 TABLE 1: Comparison of MPI-ACC with MVAPICH. Basic MPI MPI + Datatype attribute check 10 MPI-ACC MVAPICH MPI + Automatic detection GPU Buffer void* 1 MPI Datatype attributes

Latency (µs) Latency Identification (UVA–based) MPI+GPU Synch.

0.1 Implicit Explicit

2 0 1 4 8

16 32 64

2K 4K 8K

1K Method

256 128 512

16K 32K 64K 256K Data Size (bytes) 128K Performance/Progress Automatic (through the MPI Manual (through the for MPI+GPU Ops. implementation) programmer) Fig. 2: Overhead of runtime checks incurred by intranode CPU- CUDA and OpenCL (any CUDA v4.0 and Software Platforms CPU communication operations. The slowdown due to automatic version) newer only detection (via cuPointerGetAttribute) can be up to 205% (aver- AMD (CPU, GPU and APU), age: 127.8%), while the slowdown for the datatype attribute check Hardware Platforms NVIDIA GPU, Intel (CPU NVIDIA GPU only is at most 6.3% (average: 2.6%). and ), FPGA, etc

enabled. On the other hand, OpenCL memory buffers 4.2 Optimizations are represented as opaque cl mem handles that internally Once MPI-ACC has identified a device buffer, it leverages translate to the physical device memory location but cannot PCIe and network link parallelism to optimize the data be dereferenced by the host unless the buffer is mapped into transfer via pipelining. Pipelined data transfer parameters the host’s address space or explicitly copied to the host. are dynamically selected based on NUMA and PCIe affinity MVAPICH’s Automatic Detection Approach: MVAPICH to further improve communication performance. allows MPI to deal with accelerator buffers by leveraging the UVA feature of CUDA to automatically detect device 4.2.1 Data Pipelining buffers. This method requires no modifications to the We hide the PCIe latency between the CPU and GPU MPI interface. Also, we have shown previously that while by dividing the data into smaller chunks and performing their approach works well for standalone point-to-point pipelined data transfers between the GPU, the CPU, and communication, programmers have to explicitly synchro- the network. To orchestrate the pipelined data movement, nize between interleaved and dependent MPI and CUDA we create a temporary pool of host-side buffers that are operations, thereby requiring significant programmer effort registered with the GPU driver (CUDA or OpenCL) for to achieve ideal performance [26]. Moreover, as shown in faster DMA transfers. The buffer pool is created at MPI Init Figure 2, the penalty for runtime checking can be significant time and destroyed during MPI Finalize. The system ad- and is incurred by all operations, including those that ministrator can choose to enable CUDA and/or OpenCL require no GPU data movement at all. Furthermore, the when configuring the MPICH installation. Depending on automatic detection approach is not extensible for other the choice of the GPU library, the buffer pool is created by accelerator models such as OpenCL that do not map GPU calling either cudaMallocHost for CUDA or clCreateBuffer buffers into the host virtual address space. (with the CL MEM ALLOC HOST PTR flag) for OpenCL. MPI-ACC’s Datatype Attributes Approach: MPI To calculate the ideal packet size, we first datatypes are used to specify the type and layout of individually measure the network and PCIe bandwidths buffers passed to the MPI library. The MPI standard de- at different data sizes (Figure 3), then choose the packet fines an interface for attaching metadata to MPI datatypes size at the intersection point of the above channel rates, through datatype attributes. In MPI-ACC, we use these 64 KB for our experimental cluster (Section 6). If the MPI datatype attributes to indicate buffer type (e.g., CPU, performance at the intersection point is still latency bound CUDA, or OpenCL), buffer locality (e.g., which GPU), the for both data channels (network and PCIe), then we pick stream or the event to synchronize upon, or just any other the pipeline packet size to be the size of the smallest packet information to the MPI library. With the datatype attributes, at which the slower data channel reaches peak bandwidth. there is no restriction on the amount of information that the The end-to-end data transfer will then also work at the net user can pass to the MPI implementation. With our design, peak bandwidth of the slower data channel. Also, only two one can simply implicitly denote ordering of MPI and GPU packets are needed to do pipelining by double buffering: operations by associating GPU events or streams with MPI one channel receives the GPU packet to the host, while the calls, and the MPI-ACC implementation applies different other sends the previous GPU packet over the network. heuristics to synchronize and make efficient communication We therefore use two CUDA streams and two OpenCL progress. We have shown in our prior work [26] that this command queues per device per MPI request to facilitate approach improves productivity and performance, while pipelining. being compatible with the MPI standard. Moreover, our The basic pipeline loop for a “send” operation is as approach introduces a lightweight runtime attribute check to follows (“receive” works the same way, but the direction each MPI operation, but the overhead is much less than with of the operations is reversed). Every time we prepare automatic detection, as shown in Figure 2. Since MPI-ACC to send a packet over the network, we check for the supports both CUDA and OpenCL and since OpenCL is completion of the previous GPU-to-CPU transfer by calling compatible with several platforms and vendors, we consider cudaStreamSynchronize or a loop of cudaStreamQuery for MPI-ACC to be a more portable solution than MVAPICH. CUDA (or the corresponding OpenCL calls). However, Table 1 summarizes the above differences. we found that the GPU synchronization/query calls on 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 5

10000 9-12 1 8-10 a 8-9 a 2 1,2 1 2 10-12 Network Bandwidth 8-12 b b 1,3,4 3,4 2,3 1000 3 8-10 GPU-CPU PCIe Bandwidth 4 10-12 c

Bandwidth Bandwidth (MB/s) 4 c Effective Network Bandwidth (with redundant GPU synch 8:00 9:00 10:00 11:00 12:00 overhead) 100 (a) Bipartite graph represen- 16384 32768 65536 131072 262144 524288 1048576 2097152 (b) Temporal and spatial people-to- tation with people 1–4 and Packet Sizes (Bytes) people contacts. locations a–c. Fig. 3: Choosing the pipeline parameters: network – InfiniBand, Potential transfer protocol – R3. 1 sucept. Transmission 2 susceptible already completed CPU-GPU copies caused a significant 3 latent infectious overhead in our experimental cluster, which hurt the effec- 4 susceptible tive network bandwidth and forced us to choose a different 8:00 9:00 10:00 11:00 12:00 pipeline packet size. For example, we measured the cost of (c) Occupancy of location b overlaid with the health states of its occupants. stream query/synchronization operations as approximately The dotted area shows the time of possible disease transmission. 20 µs, even though the data transfer had been completed. Fig. 4: Computational epidemiology simulation model (figure Moreover, this overhead occurs every time a packet is sent adapted from [9]). over the network, as shown in Figure 3 by the “Effective mance. We cache the command queue and pipeline buffer Network Bandwidth” line. We observed that the impact of objects after the first communication call and reuse them if the synchronization overhead is huge for smaller packet the same OpenCL context and device ID are used for the sizes but becomes negligible for larger packet sizes (2 MB). subsequent calls, which is a plausible scenario. If any future Also, we found no overlap between the PCIe bandwidth call involves a different context or device ID, we clear and and the effective network bandwidth rates, and the PCIe replace our cache with the most recently used OpenCL was always faster for all packet sizes. Thus, we picked the objects. In this way, we can amortize the high OpenCL smallest packet size that could achieve the peak effective initialization cost across multiple calls and significantly network bandwidth (in our case, 256 KB) as the pipeline improve performance. We use a caching window of one, transfer size for MPI-ACC. Smaller packet sizes (<256 KB) which we consider to be sufficient in practice. caused the effective network bandwidth to be latency bound and were thus not chosen as the pipeline parameters. In PPLICATION ASE TUDIES MPI-ACC, we use the pipelining approach to transfer large 5 A C S messages—namely, messages that are at least as large as In this section, we perform an in-depth analysis of the de- the chosen packet size—and fall back to the nonpipelined fault MPI+GPU application design in scientific applications approach when transferring smaller messages. from computational epidemiology and seismology model- ing. We identify the inherent data movement inefficiencies 4.2.2 OpenCL Issues and Optimizations and show how MPI-ACC can be used to explore new design spaces and create novel application-specific optimizations. In OpenCL, device data is encapsulated as a cl mem object that is created by using a valid cl context object. To transfer the data to/from the host, the programmer needs valid 5.1 EpiSimdemics cl device id and cl command queue objects, which are all GPU-EpiSimdemics [9], [10] is a high-performance, agent- created by using the same context as the device data. At a based simulation program for studying the spread of epi- minimum, the MPI interface for OpenCL communication demics through large-scale social contact networks and requires the target OpenCL memory object, context, and the coevolution of disease, human behavior, and the so- device ID objects as parameters. The command queue cial contact network. The participating entities in GPU- parameter is optional and can be created by using the above EpiSimdemics are persons and locations, which are repre- parameters. Within the MPICH implementation, we either sented as a bipartite graph (Figure 4a) and interact with use the user-provided command queue or create several one another iteratively over a predetermined number of internal command queues for device-host data transfers. iterations (or simulation days). The output of the simulation Within MPICH, we also create a temporary OpenCL buffer is the relevant disease statistics of the contagion diffusion, pool of pinned host-side memory for pipelining. However, such as the total number of infected persons or an infection OpenCL requires that the internal command queues and the graph showing who infected whom and the time and pipeline buffers also be created by using the same context location of the infection. as the device data. Moreover, in theory, the OpenCL context could change for every MPI communication call, and so the 5.1.1 Phases internal OpenCL objects cannot be created at MPI Init time. Each iteration of GPU-EpiSimdemics consists of two Instead, they must be created at the beginning of every MPI phases: computeVisits and computeInteractions. During the call and destroyed at the end of it. computeVisits phase, all the person objects of every pro- The initialization of these temporary OpenCL objects is cessing element (or PE) first determine the schedules for expensive, and their repeated usage severely hurts perfor- the current day, namely, the locations to be visited and 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 6

the duration of each visit. These visit messages are sent to Network the destination location’s host PE (Figure 4a). Computation Network of the schedules is overlapped with communication of the PE (Host CPU) corresponding visit messages. i PEi (Host CPU) 1a. Pipelined or direct In the computeInteractions phase, each PE first groups data transfers to GPU the received visit messages by their target locations. Next, 1. Copy to GPU each PE computes the probability of infection transmission between every pair of spatially and temporally colocated 2. Pre-process on GPU 1b. Pipelined pre-processing with GPU communication people in its local location objects (Figure 4b), which GPUi (Device) determines the overall disease spread information of that GPUi (Device) location. The infection transmission function depends on Fig. 5: Exclusive GPU computation mode of GPU-EpiSimdemics. the current health states (e.g., susceptible, infectious, latent) Left: manual MPI+CUDA design. Right: MPI-ACC–enabled de- of the people involved in the interaction (Figure 4c) and sign, where the visit messages are transferred and preprocessed the transmissibility factor of the disease. These infection on the device in a pipelined manner. messages are sent back to the “home” PEs of the infected performance. However, we achieve the benefits at the cost persons. Each PE, upon receiving its infection messages, of a larger memory footprint for the person-timeline matrix, updates the health states of the infected individuals, which as well as a computational overhead for the data unpacking. will influence their schedules for the following simulation Basic MPI+GPU communication-computation pipeline: day. Thus, the messages that are computed as the output In the na¨ıve data movement approach, in each PE, we of one phase are transferred to the appropriate PEs as receive the visit messages in the main memory during the inputs of the next phase of the simulation. The system is computeVisits phase, transfer the aggregate data to the local synchronized by barriers after each simulation phase. GPU (device) memory across the PCIe bus, and then begin 5.1.2 Computation-Communication Patterns and the preprocessing step of the computeInteractions phase. MPI-ACC-Driven Optimizations While all PEs communicate the visit messages with every other PE, the number of pairwise visit message exchanges In GPU-EpiSimdemics, each PE in the simulation is im- is not known beforehand; in other words, it is not a typical plemented as a separate MPI process. Also, the computeIn- collective all-to-all or a scatter/gather operation. On the teractions phase of GPU-EpiSimdemics is offloaded and other hand, each PE preallocates CPU buffer fragments accelerated on the GPU while the rest of the computations and registers them as persistent receive requests with the are executed on the CPU [10]. The current implementation MPI library by using the MPI_Recv_init call to enable of GPU-EpiSimdemics assumes one-to-one mapping of persistent point-to-point communication among the PEs. GPUs to MPI processes. In accordance with the GPU- Moreover, each PE has to create persistent CPU receive EpiSimdemics algorithm, the output data elements from the buffers corresponding to every other participating PE in the computeVisits phase (i.e., visit messages) are first received simulation; that is, receive buffers increase with number of over the network, then merged, grouped, and preprocessed PEs and cannot be reused from a constant buffer pool. before the GPU can begin the computeInteractions phase of GPU-EpiSimdemics. Moreover, there are two GPU com- Whenever a buffer fragment is received into the corre- putation modes depending on how the visit messages are sponding receive buffer, it is copied into a contiguous visit processed on the GPUs. In this paper, we discuss the vector in the CPU’s main memory. The computeInteractions exclusive GPU computation mode, but discussion of the phase of the simulation then copies the aggregated visit cooperative CPU-GPU computation mode can be found in vector to the GPU memory for preprocessing. While the our prior work [27]. CPU-CPU communication of visit messages is somewhat Preprocessing phase on the GPU: As a preprocessing overlapped with their computation on the source CPUs, step in the computeInteractions phase, we modify the data the GPU and the PCIe bus will remain idle until the visit layout of the visit messages to be more amenable to the messages are completely received, merged, and ready to be massive parallel architecture of the GPU [10]. Specifically, transferred to the GPU. we unpack the visit message structures to a 2D time-bin Advanced MPI+GPU communication-computation matrix, where each row of the matrix represents a person- pipeline: In this optimization, we still preallocate location pair and the cells in the row represents fixed time persistent receive buffers on the CPU in each PE and slots of the day: that is, each visit message corresponds to a register them with the MPI library as the communication single row in the person-timeline matrix. Depending on the endpoints by calling MPI_Recv_init. But, we create the start time and duration of a person’s visit to a location, contiguous visit vector in GPU memory, so that whenever the corresponding row cells are marked as visited. The a PE receives a visit buffer fragment on the CPU, we preprocessing logic of data unpacking is implemented as immediately enqueue an asynchronous CPU-GPU data a separate GPU kernel at the beginning of the computeIn- transfer to the contiguous visit vector and also enqueue the teractions phase. The matrix data representation enables associated GPU preprocessing kernels, thereby manually a much better SIMDization of the computeInteractions creating the communication-computation pipeline. code execution, which significantly improves the GPU In order to enable asynchronous CPU-GPU data trans- 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 7

fers, however, the persistent receive buffers must be non- Visits phase. However, it still adds a non-negligible cost pageable (pinned) memory. Also, since the number of re- to the overall communication time when compared with ceive buffers increases with the number of PEs, the pinned the CPU-CPU data transfers of the default MPI+GPU im- memory footprint also increases with the number of PEs. plementation. Nevertheless, our experimental results show This design reduces the available pageable CPU memory, that the gains achieved in the computeInteractions phase a situation that could lead to poor CPU performance [6]. due to the preprocessing overlap outweigh the communi- The pinned memory management logic can be imple- cation overheads of the computeVisits phase for all system mented at the application level in a couple of ways. In the configurations and input data sizes. first approach, the pinned memory pool is created before In summary, MPI-ACC helps the programmer focus on the computeVisits phase begins and is destroyed once the the high-level application logic by enabling automatic and phase finishes, but the memory management routines are efficient low-level memory management techniques. More- invoked every simulation iteration. While this approach over, MPI-ACC exposes a natural interface to communicate is relatively simple to implement, repeated memory man- with the target device (CPU or GPU), without treating agement leads to significant performance overhead. In the CPUs as explicit communication relays. second approach, the pinned memory pool is created once, before the main simulation loop, and is destroyed after 5.2 FDM-Seismology the loop ends, thus avoiding the performance overhead of repeated memory management. However, this design FDM-Seismology is our MPI+GPU hybrid implementation reduces the available pageable CPU memory, not only for of an application that models the propagation of seismo- the computeVisits phase, but also for the other phases of the logical waves using the finite-difference method by taking simulation, including computeInteractions. We discuss the the Earth’s velocity structures and seismic source models as performance tradeoffs of these two memory management input [11]. The application implements a parallel velocity- techniques in Section 6.2. stress, staggered-grid finite-difference method for propaga- MPI-ACC–enabled communication-computation tion of waves in a layered medium. In this method, the pipeline: Since MPI-ACC handles both CPU and domain is divided into a three-dimensional grid, and a one- GPU buffers, in each PE we preallocate persistent buffer point-integration scheme is used for each grid cell. Since fragments on the GPU and register them with the MPI the computational domain is truncated in order to keep library by using MPI_Recv_init. Without MPI-ACC’s the computation tractable, absorbing boundary conditions GPU support, one cannot create persistent buffers on the (ABCs) are placed around the region of interest so as GPU as MPI communication endpoints. In this approach, to keep the reflections minimal when boundaries are im- the receive buffers on the GPU increase with number pinged by the outgoing waves. This strategy helps simulate of PEs, and the approach does not require a growing unbounded domains. In our application, PML (perfectly pinned memory pool on the host. Furthermore, MPI-ACC matched layers) absorbers [28] are being used as ABCs for internally creates a constant pool of pinned memory their superior efficiency and minimal reflection coefficient. during MPI_Init and automatically reuses it for all The use of a one-point integration scheme leads to an pipelined data communication, thereby providing better easy and efficient implementation of the PML absorbing programmability and application . Internally, boundaries and allows the use of irregular elements in the MPI-ACC may either pipeline the internode CPU-GPU PML region [11]. data transfers via the host CPU’s memory or use direct GPU transfer techniques (e.g., GPUDirect RDMA), if 5.2.1 Computation-Communication Patterns possible; but these details are hidden from the programmer. The simulation operates on the input finite-difference (FD) Along with the preallocated persistent GPU buffer model and generates a three-dimensional grid as a first step. fragments, the contiguous visit vector is created in the Our MPI-based parallel version of the application divides GPU memory itself. As soon as a PE receives the visit the input FD model into submodels along different axes buffer fragments on the GPU, we enqueue kernels to such that each submodel can be computed on different copy data within the device to the contiguous visit vec- CPUs (or nodes). This domain decomposition technique tor and also enqueue the associated GPU preprocessing helps the application scale to a large number of nodes. Each kernels, thereby creating the end-to-end communication- processor computes the velocity and stress wavefields in its computation pipeline. Thus, we completely overlap the own subdomain and then exchanges the wavefields with the visit data generation on the CPU with internode CPU- nodes operating on neighbor subdomains, after each set of GPU data transfers and GPU preprocessing. In this way, velocity or stress computation (Figure 6). Each processor the preprocessing overhead is completely hidden and re- updates its own wavefields after receiving all its neighbors’ moved from the computeInteractions phase. Moreover, the wavefields. CPU, GPU, and the interconnection networks are all kept These computations are run multiple times for better busy, performing either data transfers or the preprocessing accuracy and convergence of results. In every iteration, execution. each node computes the velocity components followed by MPI-ACC’s internal pipelined CPU-GPU data transfer the stress components of the seismic wave propagation. largely hides the PCIe transfer latency during the compute- The wavefield exchanges with neighbors take place after 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 8

Net Net GPU CPU I/O GPU CPU I/O 5.2.3 MPI-ACC–Enabled Optimizations cudaMemcpy Velocity Velocity GPU-based data marshaling suffers from the following dis- Kernels Kernels cudaMemcpy advantage in the absence of GPU-integrated MPI. All data- Data Data MPI MPI Marshaling Marshaling marshaling steps are separated by MPI communication, and cudaMemcpy Stress each data-marshaling step depends on the previously mar- Stress Kernels Kernels shaled data and the received MPI data from the neighbors. cudaMemcpy Data Data MPI In other words, after each data-marshaling step, data has to MPI Marshaling Marshaling cudaMemcpy be explicitly moved from the GPU to the CPU only for MPI Write Results Write Results communication. Similarly, the received MPI data has to be MPI+GPU MPI-ACC/MPI+GPU Adv explicitly moved back to the GPU before the next mar- Fig. 6: Communication-computation pattern in the FDM- shaling step. In this scenario, the application uses the CPU Seismology application. Left: basic MPI+GPU execution mode only as a communication relay. If the GPU communication with data marshaling on CPU. Right: execution modes with data marshaling on GPU. MPI-ACC automatically communicates technology changes (e.g., GPUDirect RDMA), we will have the GPU data; the MPI+GPU Adv case explicitly stages the to largely rewrite the FDM-Seismology communication communication via the CPU. code to achieve the expected performance. With MPI-ACC as the communication library, we still each set of velocity and stress computations. This MPI perform data marshaling on the GPU, but we communicate communication takes place in multiple stages wherein each the marshaled data directly to and from the GPU without communication is followed by an update of local wave- explicitly using the CPU for data staging. Also, the bulk fields and a small postcommunication computation on local transfer of data still happens only once at the end of each wavefields. At the end of each iteration, the updated local iteration, in order to write the results to a file. But, the data- wavefields are written to a file. marshaling step happens multiple times during a single iter- The velocity and stress wavefields are stored as large ation; and consequently the application launches a series of multidimensional arrays on each node. In order to optimize GPU kernels. While consecutive kernels entail launch and the MPI computation between neighbors of the FD domain synchronization overhead per kernel invocation, the benefits grid, only a few elements of the wavefields, those needed of faster data marshaling on the GPU and optimized MPI by the neighboring node for its own local update, are communication outweigh the kernel overheads. communicated to the neighbor, rather than whole arrays. Other than the benefits resulting from GPU-driven data Hence, each MPI communication is surrounded by data- marshaling, a GPU-integrated MPI library benefits the marshaling steps, where the required elements are packed FDM-Seismology application in the following ways: (1) into a smaller array at the source, communicated, and then it significantly enhances the productivity of the program- unpacked at the receiver in order to update its local data. mer, who is no longer constrained by the fixed CPU- only MPI communication and can easily choose the ap- 5.2.2 GPU Acceleration of FDM-Seismology propriate device as the communication target end-point; We describe a couple of GPU execution modes of FDM- (2) the pipelined data transfers within MPI-ACC further Seismology. improve the communication performance over the network; MPI+GPU with data marshaling on CPU (MPI+GPU): and (3) regardless of the GPU communication technology Our GPU-accelerated version of FDM-Seismology per- that may become available in the future, our MPI-ACC– forms the velocity and stress computations as GPU kernels. driven FDM-Seismology code will not change and will In order to transfer the wavefields to other nodes, it first automatically benefit from the performance upgrades that copies the bulk data from the GPU buffers to CPU memory are made available by the subsequent GPU-integrated MPI over the PCIe bus and then transfers the individual wave- implementations (e.g., support for GPU-Direct RDMA). fields over MPI to the neighboring nodes (Figure 6). All the data-marshaling operations and small postcommunication computations are performed on the CPU itself. The newly 6 EVALUATION updated local wavefields that are received over MPI are In this section, we describe our experimental setup followed then bulk transferred back to the GPU before the start of by the performance evaluation of MPI-ACC via latency the next stress or velocity computation on the GPU. microbenchmarks. Next, we demonstrate the efficacy of the MPI+GPU with data marshaling on GPU (MPI+GPU MPI-ACC–enabled optimizations in GPU-EpiSimdemics Adv): In this execution mode, the data-marshaling opera- and FDM-Seismology. Using both microbenchmarks and tions are moved to the GPU to leverage the faster GDDR5 GPU-EpiSimdemics, we discuss the impact of shared re- memory module and the GPU archi- source (hardware and software) contention on MPI-ACC’s tecture. Consequently, the CPU-GPU bulk data transfers communication performance. before and after each velocity-stress computation kernel are We conducted our experiments on HokieSpeed, a state- completely avoided. The need to explicitly bulk transfer of-the-art, 212-teraflop hybrid CPU-GPU data from the GPU to the CPU arises only at the end of housed at Virginia Tech. Each HokieSpeed node contains the iteration, when the results are transferred to the CPU two hex-core Intel Xeon E5645 CPUs running at 2.40 GHz to be written to a file (Figure 6). and two NVIDIA Tesla M2050 GPUs. The host memory 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 9

100000 180 MPI-ACC

160 Preprocessing on GPU

Manual Blocking 140 Rest of ComputeInteractions Phase 10000 Manual Non blocking (Pipelined) 120 ComputeVisits Phase

MPI Send/Receive (CPU-only; Lower Bound) 100 80 1000 60

Latency (µs) Latency 40 Execution Time(seconds) Execution 20 100

0

MPI-ACC MPI-ACC MPI-ACC MPI-ACC MPI-ACC MPI-ACC MPI-ACC MPI-ACC MPI-ACC

MPI+GPU MPI+GPU MPI+GPU MPI+GPU MPI+GPU MPI+GPU MPI+GPU MPI+GPU MPI+GPU

MVAPICH MVAPICH MVAPICH MVAPICH MVAPICH MVAPICH MVAPICH MVAPICH 10 MVAPICH

16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M

MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv Data Size (bytes) 8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2 Fig. 7: Internode communication latency for GPU-to-GPU Nodes x GPUs-per-node (CUDA) data transfers, InfiniBand transfer protocol: R3. Similar Fig. 8: Execution profile of GPU-EpiSimdemics over various node performance is observed for OpenCL data transfers. The chosen configurations. pipeline packet size for MPI-ACC is 256 KB. analysis of CA is described in our prior work [27]. We also capacity is 24 GB, and each GPU has a 3 GB device vary the number of compute nodes from 8 to 128 and the memory. The internode interconnect is QDR InfiniBand. number of GPU devices between 1 and 2. We begin from We used up to 128 HokieSpeed nodes and both GPUs the smallest node-GPU configuration that can fit the entire per node for our experiments. We used the GCC v4.4.7 problem in the available GPU memory. We also compare compiler and CUDA v5.0 with driver version 310.19. the application performance when MVAPICH (v2.0a) is used as the GPU-aware MPI implementation of choice. 6.1 Microbenchmark Analysis Our results in Figure 8 indicate that our MPI-ACC– Impact of Pipelined Data Transfer: In Figure 7, we driven optimizations perform better than the basic block- compare the performance of MPI-ACC with the manual ing MPI+GPU implementations by an average of 9.2% blocking and manual pipelined implementations. Our in- and by up to 13.3% for WA. The performance of the ternode GPU-to-GPU latency tests show that MPI-ACC MPI-ACC–driven solution is similar to the performance is better than the manual blocking approach by up to of the MVAPICH-based and the manual MPI+GPU (ad- 48.3% and is up to 18.2% better than the manual pipelined vanced) implementations. Since the CPU-GPU transfer is implementation, especially for larger data transfers. The not a bottleneck in GPU-EpiSimdemics, the specific data manual pipelined implementation repeatedly invokes MPI pipelining logic of either MPI-ACC or MVAPICH does calls, causing multiple handshake messages to be sent back not directly affect the performance gains. On the other and forth across the network and thus hurting performance. hand, the preprocessing step (data unpacking) of the com- On the other hand, we perform the handshake only once in puteInteractions phase is completely overlapped with the MPI-ACC to establish the send-receiver identities, followed asynchronous CPU-to-remote-GPU communication, for all by low-overhead pipelining. We perform pipelining in MPI- node configurations. Note that the advanced MPI+GPU im- ACC only for messages that are larger than the pipeline plementation uses the manual pinned memory management packet size, and we fall back to the default blocking techniques that we implemented at the application level, approach for smaller data sizes. Hence, we see that the which achieves better performance but with a much more performance of MPI-ACC is comparable to the manual complex code. blocking approach for smaller message sizes. For larger node configurations, the local operating dataset Impact of OpenCL Object Caching: Our OpenCL in the computeInteractions phase becomes smaller, and caching optimization improves the internode GPU-to-GPU hence that the basic MPI+GPU solution takes less time communication latency from 3% for larger data sizes (64 to execute the preprocessing stage; in other words, the MB) to 88.7% for smaller data sizes (< 256 KB). Even absolute gains achieved by hiding the preprocessing step where the programmers provide their custom command get diminished for GPU-EpiSimdemics. However, we have queue, the pipeline buffers still have to be created for shown that MPI-ACC can enable the developer to create every MPI communication call; hence, caching improves newer optimizations for better latency hiding and resource performance. utilization. Data management complexity vs. performance tradeoffs: 6.2 Case Study Analysis: EpiSimdemics While the advanced MPI+GPU implementation achieved We compare the combined performance of all the phases comparable performance to the MPI-ACC–based solution, of GPU-EpiSimdemics (computeVisits and computeInterac- it put the burden of explicit data management on the tions), with and without the MPI-ACC–driven optimizations application programmer. We discussed in Section 5.1.2 that, discussed in Section 5.1.2. We choose different-sized input on the other hand, the user can write simpler code and avoid datasets of synthetic populations from two U.S. states: explicit data management but has to repeatedly create and Washington (WA), with a population of 5.7 million, and destroy the receive buffers for every simulation iteration, California (CA), with a population of 33.1 million. In this thereby losing performance. Figure 9 shows the perfor- paper, we present results and detailed analysis of WA; mance tradeoffs of the two approaches. We observe that ex- 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 10

180 ComputeInteractions Phase 6000 160 Marshaling Mgmt. of GPU Communication Data 140 5000 MPI Communication 120 Rest of ComputeVisits Phase 4000 cudaMemcpy 100 Stress Computation 80 3000 60 Velocity Computation 40 2000

20 Iteration Time(ms) Iteration Execution Time(seconds) Execution 0 1000

0

No No Mgmt. No No Mgmt. No Mgmt. No Mgmt. No Mgmt. No Mgmt. No Mgmt. No Mgmt. No Mgmt.

MPI-ACC MPI-ACC MPI-ACC MPI-ACC

MPI+GPU MPI+GPU MPI+GPU MPI+GPU

MVAPICH MVAPICH MVAPICH MVAPICH

Manual Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt. Manual Mgmt.

MPI+GPU Adv MPI+GPU Adv MPI+GPU Adv 8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2 MPI+GPU Adv Nodes x GPUs-per-node 16x1 32x1 64x1 128x1 Fig. 9: Analysis of the data management complexity vs. per- Fig. 10: Analyzing the FDM-Seismology application with the formance tradeoffs. Manual data management achieves better larger input data (Dataset-2). Note that MPI Communication refers performance at the cost of high code complexity. The case with no to CPU-CPU data transfers for the MPI+GPU and MPI+GPU explicit data management has simpler code but performs poorly. Adv cases and GPU-GPU (pipelined) data transfers for MPI- ACC and MVAPICH. The performance difference between MPI- plicit data management is better for all node configurations ACC(CUDA) and MPI-ACC(OpenCL) is negligible and only the and can achieve up to 4.5× performance improvement. MPI-ACC(CUDA) result is shown in this figure. Without data management, the pinned memory footprint OpenCL, and while MPI-ACC is evaluated using both of the receive buffers increases with the number of MPI CUDA and OpenCL, MVAPICH can be evaluated only with processes, thereby entailing bigger performance losses for the CUDA-based code. We vary the nodes from 2 to 128 larger nodes. To quantify the degree of performance loss, with 1 GPU per node and use small and large datasets as we measured the individual memory allocation costs using input. Our scalability experiments begin from the smallest simple microbenchmarks and found that CUDA’s pinned number of nodes required to fit the given data in the device memory allocator (cudaMallocHost) was about 26% memory. For the larger input data (i.e., Dataset-2), the size slower than the vanilla CPU memory allocator (malloc) of the MPI transfers increases by 2×, while the size of data for single CUDA contexts. We also observed that the pinned to be marshaled increases by 4× when compared with the memory allocation cost increased linearly with the number smaller Dataset-1. of GPUs or CUDA contexts, whereas memory manage- ment in multiple processes and CUDA contexts should Figure 10 shows the performance of the FDM- ideally be handled independently in parallel. Consequently, Seismology application, with and without the GPU- in Figure 9, we see that for the same number of MPI based data marshaling. We report the average wall-clock processes, the node configuration with two MPI processes time across all the processes because the computation- (or GPUs) per node performs worse than the node with a communication costs vary depending on the virtual location single MPI process; for example, the 64×2 configuration is of the process in the application’s structured grid represen- slower than the 128×1 one. Thus, efficient pinned memory tation. The application’s running time is composed mainly management is essential for superior performance, and of velocity and stress computations (>60%) and does not MPI-ACC provides that automatically to the programmers. change for the three application designs. Discussion: The basic MPI+GPU solution has prepro- In the basic MPI+GPU case, we perform both data- cessing overhead but does not have significant memory marshaling operations and MPI communication from the management issues. While the advanced MPI+GPU imple- CPU. Thus, the application has to move large wavefield mentation gains from hiding the preprocessing overhead, it data between the CPU and the GPU for data marshaling and loses from either nonscalable pinned memory management MPI communication after every stress and velocity compu- or poor programmer productivity. On the other hand, MPI- tation phase over every iteration. In the MPI+GPU Adv, ACC provides a more scalable solution by (1) automatically MVAPICH, and MPI-ACC–driven scenarios, we perform managing a fixed-size pinned buffer pool for pipelining data marshaling on the GPU itself; hence, smaller-sized and (2) creating buffer pools just once at MPI_Init wavefield data is transferred from the GPU to the CPU only and destroying them at MPI_Finalize. MPI-ACC thus once per iteration for output generation. By performing data gains from both hiding the preprocessing overhead and marshaling on the GPU, we avoid the large bulk CPU-GPU efficient pinned memory management. MPI-ACC decouples data transfers and improve the overall performance over the the lower-level memory management logic from the high- basic MPI+GPU design with data marshaling on the CPU. level simulation implementation, thereby enabling both Data marshaling on the GPU provides performance gains performance and productivity. while MPI-ACC improves programmer productivity by directly communicating the GPU buffers (CUDA/OpenCL) in the application. 6.3 Case Study Analysis: FDM-Seismology Scalability analysis: Figure 11 shows the performance In this section, we analyze the performance of the different improvement due to the MPI-ACC–enabled GPU data mar- phases of the FDM-Seismology application and evaluate shaling strategy over the basic MPI+GPU implementation the effect of MPI-ACC and MVAPICH on the application. with CPU data marshaling. We see that the performance FDM-Seismology is implemented with both CUDA and benefits due to the GPU data marshaling decrease with 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 11

180 50%

160 ComputeInteractions Phase ACC

- 40% 140 ComputeVisits Phase 30% 120 20% 100 Dataset-1 (Small) 80 10% Dataset-2 (Large) 60 0% 40

Benefit Benefit MPI due to 2x1 4x1 8x1 16x1 32x1 64x1 128x1 20 Execution Time(seconds) Execution Nodes x GPUs-per-node 0

Fig. 11: Scalability analysis of FDM-Seismology application with

Stream0 Stream0 Stream0 Stream0 Stream0 Stream0 Stream0 Stream0

two datasets of different sizes. The baseline for is the Stream0

CustomStream CustomStream CustomStream CustomStream CustomStream CustomStream CustomStream CustomStream na¨ıve MPI+GPU programming model with CPU data marshaling. CustomStream 8x2 16x1 16x2 32x1 32x2 64x1 64x2 128x1 128x2 Nodes x GPUs-per-node increasing number of nodes, because of the following rea- sons. For a given dataset, the per-node data size decreases Fig. 12: Characterizing the contention impact of CUDA’s stream-0 in GPU-EpiSimdemics. with increasing number of nodes. Thus, the costly CPU- GPU bulk data transfers are reduced (Figure 10), and the In summary, contention among GPU operations can be overall benefits of GPU-based data marshaling itself are of two types: hardware contention, where one or more minimized. Also, for a larger number of nodes, the appli- hardware queues of the GPU are oversubscribed by the cation’s MPI communication cost becomes significant when same type of operation, or software contention, where compared with the computation and data marshaling costs. different types of operations may be issued but to the same In such a scenario, the CPU-to-CPU MPI communication GPU stream. In MPI-ACC, we have carefully minimized of the MPI+GPU and MPI+GPU Adv implementations both types of contention by staggered enqueueing of H2D will have less overhead than does the pipelined GPU- and D2H operations to different GPU streams, thereby to-GPU MPI communication of the MPI-ACC–enabled enabling maximum concurrency. design. If newer technologies such as GPUDirect-RDMA Microbenchmark design: We extended the SHOC bench- are integrated into MPI, we can expect the GPU-to-GPU mark suite’s contention-mt application for the mi- communication overhead to be reduced, but the overall crobenchmark study. The benchmark creates two MPI benefits of GPU data marshaling itself will still be limited processes, each on a separate node and controlling the because of the reduced per-process working set. two local GPUs. Each MPI process is also dual-threaded and concurrently runs one task per , where task-0 7 ANALYSIS OF CONTENTION by thread-0 does point-to-point GPU-to-GPU MPI com- munication with the other process and task-1 by thread- In this section, we provide some insights into the scalable 1 executes local non-MPI GPU tasks, such as compute design of MPI-ACC. Specifically, we show that MPI-ACC kernels or H2D and D2H data transfer loops. CUDA allows is designed to work concurrently with other existing GPU the same GPU context to be shared among all the threads workloads with minimum contention; that is, one should be (tasks) in the process. We share the local GPU between both able to perform MPI-ACC GPU-GPU communication and tasks. To measure the contention impact, we first execute other user-specified GPU tasks (kernel or data transfers) tasks 0 and 1 independently without contention and then simultaneously with minimum performance degradation for execute them concurrently to induce contention. Each task both tasks. We analyze the contention effects of MPI-ACC, is run for 100 loop iterations for both the independent and MVAPICH and manual MPI+GPU on concurrent GPU and concurrent runs. We measure and report the performance PCIe workloads. difference between the tasks’ independent and concurrent Sources of contention: NVIDIA Fermi GPUs have runs as the incurred contention. one hardware queue each for enqueueing GPU kernels, D2H data transfers, and H2D data transfers. Operations on different hardware queues can potentially overlap. For 7.1 Discussion of Software Contention example, a GPU kernel can overlap with H2D and D2H CUDA’s stream-0 (default stream) is unique in that it transfers simultaneously. However, operations enqueued to is completely ordered with all operations issued on any the same hardware queue will be processed serially. If a stream of the device. That is, issuing operations on stream-0 GPU task oversubscribes a hardware queue by aggressively would be functionally equivalent to synchronizing the entire enqueueing multiple operations of the same type, then it can device before and after each operation. Although MPI- severely slow other GPU tasks contending to use the same ACC privately creates and uses custom streams to minimize hardware queue. software contention with other streams, a concurrent user GPU streams are software workflow abstractions for a operation to stream-0 can inadvertently stall any MPI-ACC sequence of operations that execute in issue-order on the operation on that GPU until stream-0 has completed. On GPU. Stream operations are directed to the appropriate the other hand, OpenCL does not have special queues and hardware queue depending on the operation type. Op- does not incur software contention. erations from different streams can execute concurrently Contention due to stream-0 can be seen even in GPU- and may be interleaved, whereas operations within the EpiSimdemics, and we analyze its effect as follows. In same stream are processed serially, leading to software GPU-EpiSimdemics, the internode CPU-GPU communica- contention. tion of the visit messages is overlapped with a prepro- 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 12

cessing kernel that performs data layout transformation average slowdown of task-0 is 6% and the H2D task (task- (Section 5.1). While we use non-0 streams within MPI- 1) has a negligible slowdown. Ideally, if the bidirectional ACC for the internode communication of visit messages, bandwidth were to be twice the unidirectional bandwidth, the preprocessing kernel may be launched with the user’s then both the concurrent tasks would have no slowdown. chosen CUDA stream. Figure 12 shows that the perfor- In our experimental platform, however, the bidirectional mance of GPU-EpiSimdemics is about 6.6% slower when bandwidth is only about 19.3% more than the unidirectional the preprocessing kernels use stream-0 instead of a non- bandwidth according to the simpleMultiCopy CUDA 0 stream, and the slowdown can be up to 16.3% for SDK benchmark. Thus, task-0’s slowdown is due to slower some node configurations. While MPI-ACC’s streams are bidirectional bandwidth and not due to any possible MPI- designed to scale, a poor application design using stream- ACC–related contention effects. 0 can cause an apparent slowdown in MPI-ACC’s data PCIe contention analysis with data transfers in the same transfer performance. direction: For this study, we analyze contention effects when MPI_Send (task-0) is invoked concurrently with standalone D2H transfers on the same GPU (task-1). We 7.2 Minimizing the Hardware Contention analyze the contention impacts of three MPI_Send imple- MPI-ACC uses the D2H and H2D hardware queues of the mentations: MPI-ACC, MVAPICH, and manual pipelining GPU for send and receive, respectively. In theory, MPI- using asynchronous MPI and CUDA. Since the Fermi ACC communication can overlap with kernel invocations or GPUs have a single data transfer hardware queue in each other data transfer operations in the opposite direction, that direction, a task that oversubscribes the GPU can signif- is, using the other data transfer queue. However, MPI-ACC icantly slow any other task that uses the same queue. can cause contention with another data transfer operation in In fact, we show that MPI-ACC induces less contention the same direction. For example, MPI_Send can contend than MVAPICH and the manual asynchronous MPI+GPU with a concurrent D2H data transfer. MPI-ACC operations approaches of GPU data communication. We show that can also potentially contend with the on-device memory MPI-ACC enqueues commands to the GPU hardware queue controller. For example, MPI_Send or MPI_Recv can in a balanced manner, thereby minimizing the apparent slow a global-memory-intensive kernel that is accessing performance slowdown in the D2H task (task-1) while the same memory module. In this section, we quantify and incurring a modest slowdown to the MPI communication evaluate the global memory and PCIe contention effects. (task-0) itself. Global memory contention analysis: We study the Figure 13a shows the relative increase in the MPI latency impact of global memory contention by executing MPI- due to contention from the D2H task. For this experiment, ACC operations in task-0 and global memory read/write the D2H task consistently transfers 16 MB between the benchmarks in task-1 with custom CUDA streams. Our device and the host, whereas the data size for the MPI experiments indicate that the performance drop due to task is varied. MPI-ACC shows a maximum slowdown of contention in the MPI-ACC communication is about 4%, about 5% for relatively small MPI data transfers, and the whereas the global memory kernels slow by about 8%. The slowdown for larger MPI data transfers is negligible, on average MPI-ACC call runs longer than an average global average. The other implementations demonstrate less to memory access, so MPI-ACC has less relative performance negligible slowdown for all data sizes. reduction. On the other hand, the performance impact Figure 13b shows the normalized bandwidth of the D2H of MPI-ACC on on-chip (local) memory accesses and task when run concurrently with MPI. For this experiment, simple computational kernels using custom CUDA streams the MPI task consistently performs 16 MB GPU-GPU data is less, where the performance degradation in the MPI- transfers across the network, whereas the data size for the ACC communication is about 3% and the GPU workloads local D2H task is varied. We see that MPI-ACC causes a do not have any noticeable slowdown. Because of space maximum performance slowdown of 3.8% to the D2H task constraints, we omit explicit performance graphs. for relatively small data, and the performance slowdown for PCIe contention analysis with data transfers in the larger D2H data sizes is negligible. However, MVAPICH opposite direction: We study the impact of PCIe contention and the manual asynchronous MPI+GPU implementation by having task-0 perform MPI_Send or MPI_Recv com- causes a slowdown of about 18% for smaller D2H data munication operations with GPU-0, while task-1 executes transfers. This result indicates that MPI-ACC enqueues H2D or D2H calls. This approach gives four different task GPU tasks to the hardware queues in a more balanced man- combinations, of which two combinations perform bidirec- ner, whereas MVAPICH may oversubscribe the hardware tional data transfers and two combinations transfer data in queues thereby causing significant performance variations the same direction. In this paper, we report the results by to the other GPU tasks. running MPI_Send (task-0) concurrently with H2D and HPCToolkit analysis: HPCTOOLKIT [29], [30] is D2H transfers on the same GPU (task-1). The contention a sampling-based performance analysis toolkit capable analysis of MPI_Recv with H2D and D2H transfers is of quantifying scalability bottlenecks in parallel pro- identical, and we exclude it because of space constraints. grams. We use HPCTOOLKIT’s Hpctraceviewer in- If task-0 and task-1 perform bidirectional data transfers terface to understand why MPI-ACC causes less con- and use custom CUDA streams, then we find that the tention than the manual MPI+GPU implementations do. 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 13

6% MPI+GPU MVAPICH MPI-ACC Task 0: MPI_Send from GPU

4% (MPI-ACC’s Internal Stream 0) Latency Latency Balanced GPU Requests 2% Task 0: MPI_Send from GPU

(MPI-ACC’s Internal Stream 1)

D2HContention Relative Relative Increase withGPUIncrease 0% 256 512 1024 2048 4096 8192 16384 Task 1: D2H Loop Data Size for MPI (KB) (Custom Stream) (a) Impact on the MPI_Send communication latency. Task-1 (D2H) transfers 16MB for all cases. (a) Impact of MPI-ACC’s MPI_Send with concurrent D2H operations. 125% GPU D2H Only With MPI+GPU With MVAPICH With MPI-ACC

100% Task 0: MPI+CUDA send from GPU

75% (Custom Stream 0) 50% Aggressive GPU Requests 25% Task 0: MPI+CUDA send from GPU (Custom Stream 1) Bandwidths 0%

GPU D2H Normalized Normalized D2H GPU 256 512 1024 2048 4096 8192 16384 Task 1: D2H Loop Data Size for GPU D2H (KB) (Custom Stream)

(b) Impact on the local D2H GPU operations. Task-0 (MPI) transfers 16MB for all cases. MPI-ACC causes least contention for the D2H task. (b) Impact of manual MPI+GPU send task with concurrent D2H opera- tions. Fig. 13: Impact of contention due to concurrent MPI_Send and local D2H GPU operations. MPI-ACC is evaluated against Fig. 14: Using HPCTOOLKIT to understand the contention im- MVAPICH and the manual MPI+GPU implementations. pacts of MPI-ACC and local GPU data transfer operations.

Hpctraceviewer renders hierarchical, timeline-based we showed that MPI-ACC delivers maximum concurrency visualizations of parallel hybrid CPU-GPU programs. Fig- by carefully ordering multiple GPU streams and efficiently ure 14 presents screenshots of the detailed execution profile balancing the H2D and D2H hardware queues for data of our contention benchmark. The hpctraceviewer tool pipelining, without oversubscribing the GPU resource. presents the timeline information of all CPU processes, threads, and their corresponding CUDA streams. However, 8 CONCLUSION we zoom in and show only the timelines of the relevant In this paper, we introduced MPI-ACC, an integrated and CUDA streams associated with both tasks of the 0th pro- extensible framework that allows end-to-end data move- cess. The other process exhibits identical behavior and is ment in accelerator-connected systems. We discussed MPI- excluded from the figure. ACC’s API design choices and a comprehensive set of Figure 14a shows the effect of MPI-ACC’s send opera- optimizations including data pipelining and buffer manage- tion interacting with the D2H data transfers of task-1. Since ment. We studied the efficacy of MPI-ACC for scientific both tasks issue D2H commands and there is only one D2H applications from the domains of epidemiology (GPU- queue on Fermi, we can see that only one of the CUDA EpiSimdemics) and seismology (FDM-Seismology), and streams is active at any given point in time. Moreover, we presented the lessons learned and tradeoffs. We found the MPI-ACC’s pipelining logic has been designed to issue that MPI-ACC’s internal pipeline optimization not only GPU commands only when the next pipeline stage is ready. helps improve the end-to-end communication performance This design does not oversubscribe the GPU and leads to but also enables novel optimization opportunities at the balanced execution, which can be seen by the interleaved application level, which significantly enhance the CPU- bars in the MPI-related timeline. Figure 14b depicts the GPU and network utilization. With MPI-ACC, one can contention effect of the manual pipelined MPI+GPU im- naturally express the communication target without explic- plementation. In this example implementation, we enqueue itly treating the CPUs as communication relays. MPI-ACC all the pipeline stages upfront, which is an acceptable decouples the application logic from the low-level GPU design for standalone point-to-point communication. This communication optimizations, thereby significantly improv- design oversubscribes the GPU, however, and can be seen ing scalability and application portability across multiple as clusters of bars in the MPI-related timeline. Of course, if GPU platforms and generations. one designs the manual MPI+GPU implementation similar to our MPI-ACC design, then the associated timeline figure CKNOWLEDGMENT will look like Figure 14a. Since the manual MPI+GPU A This work was supported in part by the DOE grant DE-SC0001770; by implementation is more aggressive in enqueuing GPU oper- contracts DE-AC02-06CH11357, DE-AC05-00OR22725, and DE-ACO6- ations, the D2H operations of task-1 tend to wait more. That 76RL01830; by the DOE Office of Science under cooperative agreement is why, on average, MPI-ACC causes the least performance number DE-FC02-07ER25800; by the DOE GTO via grant EE0002758 from Fugro Consultants; by the DTRA CNIMS contract HDTRA1-11- perturbation to the D2H task (Figure 13b). D-0016-0001; by the NSF PetaApps grant OCI-0904844; by the NSF Blue Waters grant OCI-0832603; by the NSF grants CNS-0960081, CNS- 054630, and CNS-0958311; and by an NVIDIA Graduate Fellowship, 7.3 Summary NVIDIA Professor Partnership, and CUDA Research Center at Virginia In this section, we provided insights into the scalable design Tech. This research used the HokieSpeed heterogeneous computing re- source at Virginia Tech, which is supported by the National Science of MPI-ACC and compared its performance with MVA- Foundation under contract CNS-0960081. We thank Darius Buntinas and PICH and manual MPI+GPU implementations. Specifically, Hao Wang for their valuable technical discussions and feedback. 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TPDS.2015.2446479, IEEE Transactions on Parallel and Distributed Systems

MPI-ACC: ACCELERATOR-AWARE MPI FOR SCIENTIFIC APPLICATIONS 14

REFERENCES [25] J. Dinan, P. Balaji, J. R. Hammond, S. Krishnamoorthy, and V. Tip- paraju, “Supporting the Global Arrays PGAS Model Using MPI [1] L. Weiguo, B. Schmidt, G. Voss, and W. Muller-Wittig, “Streaming One-sided Communication,” in IEEE Intl. Parallel and Distributed Algorithms for Biological Sequence Alignment on GPUs,” IEEE Processing Symposium (IPDPS), 2012. Transactions on Parallel and Distributed Systems, vol. 18, no. 9, [26] A. M. Aji, P. Balaji, J. Dinan, W.-c. Feng, and R. Thakur, “Syn- pp. 1270–1281, Sept. 2007. chronization and Ordering Semantics in Hybrid MPI GPU Program- [2] A. Nere, A. Hashmi, and M. Lipasti, “Profiling Heterogeneous ming,” in 3rd Intl. Workshop on Accelerators and Hybrid Exascale Multi-GPU Systems to Accelerate Cortically Inspired Learning Systems (IPDPSW: AsHES). IEEE, 2013. Algorithms,” in IEEE Intl. Parallel and Distributed Processing [27] A. M. Aji, L. S. Panwar, F. Ji, M. Chabbi, K. Murthy, P. Balaji, Symposium (IPDPS), 2011. K. R. Bisset, J. Dinan, W.-c. Feng, J. Mellor-Crummy, X. Ma, and [3] W.-C. Feng, Y. Cao, D. Patnaik, and N. Ramakrishnan, “Temporal R. Thakur, “On the Efficacy of GPU-Integrated MPI for Scientific Data Mining for Neuroscience,” GPU Computing Gems, February Applications,” in ACM Intl. Symposium on High-Performance Par- 2011, Emerald Edition. allel and (HPDC), 2013. [4] “TOP500 Supercomputer Sites,” http://www.top500.org. [28] J. Berenger, “A Perfectly Matched Layer for the Absorption of [5] MPI: A Message-Passing Interface Standard Version 2.2. Message Electromagnetic Waves,” Journal of Computational Physics, vol. Passing Interface Forum, 2009. 114, no. 2, pp. 185–200, 1994. [6] “NVIDIA CUDA C Programming Guide,” 2014, http://docs.nvidia. [29] L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor- com//pdf/CUDA C Programming Guide.pdf. Crummey, and N. R. Tallent, “HPCToolkit: Tools for Performance [7] L. Howes and A. Munshi, The OpenCL Specification. Khronos Analysis of Optimized Parallel Programs,” Concurrency Computa- OpenCL Working Group, 2014, https://www.khronos.org/registry/cl/ tion: Practice and Experience, vol. 22, pp. 685–701, April 2010. specs/-2.0.pdf. [30] M. Chabbi, K. Murthy, M. Fagan, and J. Mellor-Crummey, “Effective [8] “NVIDIA GPUDirect,” https://developer.nvidia.com/gpudirect. Sampling-driven Performance Tools for GPU-accelerated Super- [9] C. L. Barrett, K. R. Bisset, S. G. Eubank, X. Feng, and M. V. computers,” in Intl. Conference for High Performance Computing, Marathe, “EpiSimdemics: An Efficient Algorithm for Simulating the Networking, Storage and Analysis (SC), 2013. Spread of Infectious Disease over Large Realistic Social Networks,” in Intl. Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2008. [10] K. R. Bisset, A. M. Aji, M. V. Marathe, and W.-c. Feng, “High- Performance Biocomputing for Simulating the Spread of Contagion Ashwin M. Aji is a Ph.D. candidate in the Dept. of Comp. Science, over Large Contact Networks,” BMC Genomics, vol. 13, no. S2, Virginia Tech, where he also received an M.S. degree. April 2012. [11] S. Ma and P. Liu, “Modeling of the Perfectly Matched Layer Absorbing Boundaries and Intrinsic Attenuation in Explicit Finite Lokendra S. Panwar is an M.S. student in the Dept. of Comp. Element Methods,” Bulletin of the Seismological Society of America, Science, Virginia Tech. vol. 96, no. 5, pp. 1779–1794, 2006. [12] W. M. Brown, P. Wang, S. J. Plimpton, and A. N. Tharrington, Feng Ji is a Ph.D. candidate in the Dept. of Comp. Science, North “Implementing Molecular Dynamics on Hybrid High Performance Carolina State University. Computers - Short Range Forces,” Computer Physics Communica- tions, vol. 182, no. 4, pp. 898–911, 2011. Karthik Murthy is a Ph.D. candidate in the Dept. of Comp. Science, [13] T. Endo, A. Nukada, S. Matsuoka, and N. Maruyama, “Linpack Rice University. He received an M.A. in computer science from the Evaluation on a Supercomputer with Heterogeneous Accelerators,” in University of Texas, Austin. IEEE Intl. Parallel and Distributed Processing Symposium (IPDPS), 2010. [14] A. M. Aji, J. Dinan, D. Buntinas, P. Balaji, W.-c. Feng, K. R. Bisset, Milind Chabbi is a Ph.D. candidate in the Dept. of Comp. Science, and R. Thakur, “MPI-ACC: An Integrated and Extensible Approach Rice University. He received an M.S. in computer science from the to Data Movement in Accelerator-Based Systems,” in 14th IEEE Intl. University of Arizona, Tucson. Conference on High Performance Computing and Communications (HPCC), 2012. Pavan Balaji is a computer scientist at Argonne National Laboratory, [15] H. Wang, S. Potluri, D. Bureddy, C. Rosales, and D. K. Panda, a Fellow of the Northwestern-Argonne Inst. of Science and Engi- “GPU-aware MPI on RDMA-enabled Clusters: Design, Implementa- neering at Northwestern University, and a Research Fellow of the tion and Evaluation,” IEEE Transactions on Parallel and Distributed Computation Inst. at the Univ. of Chicago. Systems (TPDS), vol. 25, no. 10, pp. 2595–2605, 2014. [16] “MVAPICH: MPI over InfiniBand, 10GigE/iWARP and RoCE,” http: Keith R. Bisset is a simulation and system software development //mvapich.cse.ohio-state.edu/. scientist at the Virginia Bioinformatics Institute, Virginia Tech. [17] “MPICH: High-Performance Portable MPI,” http://www.mpich.org. [18] O. S. Lawlor, “Message Passing for GPGPU Clusters: cudaMPI,” in IEEE Intl. Conference on Cluster Computing and Workshops James Dinan received his Ph.D. from the Ohio State University and (CLUSTER), 2009. completed a postdoctoral fellowship at Argonne National Laboratory. [19] J. Chen, I. Watson, W., and W. Mao, “GMH: A Message Passing Toolkit for GPU Clusters,” in IEEE 16th Intl. Conference on Parallel Wu-chun Feng is a professor and Elizabeth & James E. Turner Fel- and Distributed Systems (ICPADS), 2010. low in the Dept. of Comp. Science, Dept. of Electrical and Computer [20] S. Yamagiwa and L. Sousa, “CaravelaMPI: Message Passing Inter- Engineering, and Virginia Bioinformatics Institute at Virginia Tech. face for Parallel GPU-Based Applications,” in 8th Intl. Symposium on Parallel and Distributed Computing (ISPDC), 2009. John Mellor-Crummey is a professor in the Dept. of Comp. Science [21] J. Stuart and J. Owens, “Message Passing on Data-Parallel Architec- and the Dept. of Electrical and Computer Engineering, Rice Univer- tures,” in IEEE Intl. Parallel and Distributed Processing Symposium sity. (IPDPS), 2009. [22] S. Potluri, D. Bureddy, H. Wang, H. Subramoni, and D. Panda, Xiaosong Ma is a senior scientist at Qatar Computing Research “Extending OpenSHMEM for GPU Computing,” in IEEE Intl. Institute and an associate professor in the Dept. of Comp. Science, Parallel and Distributed Processing Symposium (IPDPS), 2013. North Carolina State University. [23] A. Sidelnik, B. L. Chamberlain, M. J. Garzaran, and D. Padua, “Using the High Productivity Language Chapel to target GPGPU Architectures,” UIUC Dept. of Comp. Sci. Technical Report, 2011. Rajeev Thakur is the Deputy Director of the Math. and Comp. [24] J. Hammond, S. Ghosh, and B. Chapman, “Implementing OpenSH- Sci. Division at Argonne National Laboratory, a Senior Fellow in MEM Using MPI-3 One-Sided Communication,” in OpenSHMEM the Computation Institute at the Univ. of Chicago, and an adjunct and Related Technologies. Experiences, Implementations, and Tools, professor in the Dept. of Elect. Eng. and Comp. Sci. at Northwestern 2014. University. 1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.