Gpuiterator: Bridging the Gap Between Chapel and GPU Platforms

GPUIterator: Bridging the Gap between Chapel and GPU Platforms Akihiro Hayashi Sri Raj Paul Vivek Sarkar Department of Computer Science College of Computing College of Computing Rice University Georgia Institute of Technology Georgia Institute of Technology USA USA USA [email protected] [email protected] [email protected] Abstract Our preliminary performance evaluations show that the PGAS (Partitioned Global Address Space) programming mod- use of the GPUIterator is a promising approach for Chapel els were originally designed to facilitate productive parallel programmers to easily utilize a single or multiple CPU+GPU programming at both the intra-node and inter-node levels in node(s) while maintaining portability. homogeneous parallel machines. However, there is a growing need to support accelerators, especially GPU accelerators, CCS Concepts • Software and its engineering → Dis- in heterogeneous nodes in a cluster. Among high-level PGAS tributed programming languages. programming languages, Chapel is well suited for this task Keywords Chapel, GPU, Parallel Iterators due to its use of locales and domains to help abstract away low-level details of data and compute mappings for different ACM Reference Format: compute nodes, as well as for different processing units (CPU Akihiro Hayashi, Sri Raj Paul, and Vivek Sarkar. 2019. GPUIter- vs. GPU) within a node. ator: Bridging the Gap between Chapel and GPU Platforms. In In this paper, we address some of the key limitations of Proceedings of the ACM SIGPLAN 6th Chapel Implementers and Users past approaches on mapping Chapel on to GPUs as follows. Workshop (CHIUW ’19), June 22, 2019, Phoenix, AZ, USA. ACM, New First, we introduce a Chapel module, GPUIterator, which is York, NY, USA, 10 pages. https://doi.org/10.1145/3329722.3330142 a portable programming interface that supports GPU execution of a Chapel forall loop. This module makes it possible for Chapel programmers to easily use hand-tuned native 1 Introduction GPU programs/libraries, which is an important requirement Software productivity and portability is a profound issue for in practice since there is still a big performance gap between large scale systems. While conventional message-passing compiler-generated GPU code and hand-turned GPU code; programming models such as MPI [11] are widely used hand-optimization of CPU-GPU data transfers is also an im- in distributed-memory programs, orchestrating their low- portant contributor to this performance gap. Second, though level APIs imposes significant burdens on programmers. One Chapel programs are regularly executed on multi-node clus- promising solution is the use of PGAS (Partitioned Global ters, past work on GPU enablement of Chapel programs Space) programming languages such as Chapel, Co-array mainly focused on single-node execution. In contrast, our Fortran, Habanero-C, Unified Parallel C (UPC), UPC++, and work supports execution across multiple CPU+GPU nodes by X10 [2, 8, 9, 12, 13, 16] since they are designed to mitigate accepting Chapel’s distributed domains. Third, our approach productivity burdens by introducing high-level parallel lan- supports hybrid execution of a Chapel parallel (forall) loop guage constructs that support globally accessible data, data across both a GPU and CPU cores, which is beneficial for parallelism, task parallelism, synchronization, and mutual specific platforms. exclusion. However, there is a growing need to support accelerators, especially GPU accelerators, in heterogeneous nodes since Permission to make digital or hard copies of all or part of this work for they are now a common source of performance improvement personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear in HPC clusters. According to the Top500 lists [17], 138 of this notice and the full citation on the first page. Copyrights for components the top 500 systems currently include accelerators. Thus, to of this work owned by others than ACM must be honored. Abstracting with keep up with the enhancements of hardware resources, a key credit is permitted. To copy otherwise, or republish, to post on servers or to challenge in the future development of PGAS programming redistribute to lists, requires prior specific permission and/or a fee. Request models is to improve the programmability of accelerators. permissions from [email protected]. For enabling GPU programming in PGAS programming CHIUW ’19, June 22, 2019, Phoenix, AZ, USA © 2019 Association for Computing Machinery. models, past approaches focus on compiling and optimizing ACM ISBN 978-1-4503-6800-1/19/06...$15.00 high-level data-parallel constructs for GPU execution. For ex- https://doi.org/10.1145/3329722.3330142 ample, Sidelnik et al. [3] and Chu et al. [7] compile Chapel’s CHIUW ’19, June 22, 2019, Phoenix, AZ, USA Akihiro Hayashi, Sri Raj Paul, and Vivek Sarkar Listing 1. Problem: The user has to switch back and forth for three reasons. First, when using one version (either the between CPU (forall) and GPU versions (an external C CPU or GPU version), another version should be somehow function call myGPUCode()) when exploring higher perfor- removed or commented out, which is not very portable. Sec- mance. ond, it is non-trivial or tedious to run the GPU version on multiple CPU+GPU nodes. Third, a choice of device is either // CPU version forall i in 1..n {...} the CPU or GPU version, while hybrid execution of these m // The user has to manually switch between CPU and GPU versions two versions can be more flexible and deliver higher perfor- // GPU version(invoking an externalC function) myGPUCode(...); mance. It is worth noting that the second and third points were not addressed in the past work on mapping Chapel on to GPUs [3,7]. Listing 2. Our Proposal: GPUIterator provides an appro- A primary goal of this paper is to provide an appropriate priate interface between Chapel and accelerator programs. interface between Chapel and accelerator programs such that expert accelerator programmers can explore different vari- 1 var GPUWrapper = lambda (lo: int, hi: int, n: int){ 2 // The GPU portion(lo and hi) is automatically computed ants in a portable way (i.e., CPU-only, GPU-only, X% for CPU 3 // even in multi-locale settings. + Y% for GPU on a single or multiple CPU+GPU node(s)). 4 myGPUCode(lo, hi, n, ...); 5 }; To address these challenges, we introduce a Chapel mod- 6 var CPUpercent = x;//X% goes to the CPU ule, GPUIterator, which provides the capability of creating 7 // (100-X)% goes to the GPU and distributing tasks across a single or multiple CPU+GPU 8 //D can bea distributed domain 9 forall i in GPU(D, GPUWrapper, CPUPercent) {...} node(s). As shown in Listing 2 our approach enables running the forall loop on CPU+GPU with minimal changes - i.e., just wrapping the original loop range in GPU() (Line 9) with forall construct, to GPUs. Similarly, X10CUDA [10] com- some extra code including a callback function (GPUWrapper(), piles X10’s forasync construct to GPUs. While such compiler- Line 1-5) that eventually calls the GPU function with an au- driven approaches significantly increase productivity and tomatically computed subrange (lo, and hi) for the GPU. portability, they often fall behind in delivering the best pos- This paper makes the following contributions by address- sible performance on GPUs. Thus, it is important to provide ing some of the key limitations of past work: an additional mechanism through which programmers can • Design and implementation of the GPUIterator mod- utilize low-level GPU kernels written in CUDA/OpenCL and ule for Chapel, which highly tuned libraries like cuBLAS to achieve the highest 1. provides a portable programming interface that sup- possible performance. ports CPU+GPU execution of a Chapel forall loop Interestingly, Chapel inherently addresses this "perfor- without requiring any modifications to the Chapel mance" vs. "portability" issue through an approach that sup- compiler. ports separation of concerns [1]. More specifically, Chapel’s 2. supports execution across multiple CPU+GPU nodes multi-resolution concept allows programmers not only to by accepting Chapel’s distributed domains to sup- stick with a high-level specification but also to dive into port multi-node GPUs. low-level details so that they can incrementally evolve their 3. supports hybrid execution of a forall across both implementations with small changes. For example, [1] dis- CPU and GPU processors. cusses multi-resolution support for arrays. • Performance evaluations of different CPU+GPU exe- As for GPU programming with Chapel, typically program- cution strategies for Chapel on three CPU+GPU plat- forall mers first start with writing loops and run these forms. loops on CPUs as a proof-of-concept (see notional CPU version in Listing 1). If the resulting CPU performance is not 2 Chapel sufficient for their needs, their next step could be to trythe automatic compiler-based GPU code generation techniques The Chapel [2] language is a productive parallel program- discussed earlier. For portions that remain as performance ming language developed by Cray Inc. Chapel was initially bottlenecks, even after automatic compilation approaches, developed as part of the DARPA High Productivity Comput- the next step is to consider writing GPU kernels using CU- ing Systems program (HPCS) to create highly productive lan- DA/OpenCL and invoking these kernels from the Chapel guages for next-generation supercomputers. Chapel provides program using Chapel’s C interoperability (GPU version in high-level abstractions to express multithreaded execution Listing 1). More details of the C interoperability feature is via data parallelism, task parallelism, and concurrency. discussed in Section 2.2. However, we believe the current GPU development flow 2.1 Iterators with the C interoperability feature is not a good abstraction An iterator [6] is a high-level abstraction that gives pro- from the viewpoint of Chapel’s multi-resolution concept grammers control over the scheduling of the loops in a very GPUIterator: Bridging the Gap between Chapel and GPU Platforms CHIUW ’19, June 22, 2019, Phoenix, AZ, USA Listing 3.

Load more