2012 DOE Exascale Research Conference - Working Session Position Paper: Track 2-2 (Productivity Tools and Programming Models)

Moving Heterogeneous GPU Computing into the Mainstream with Directive-Based, High-Level Programming Models (Position Paper)

DOE X-Stack Vancouver Project http://ft.ornl.gov/trac/vancouver Seyong Lee and Jeffrey S. Vetter Oak Ridge National Laboratory, Oak Ridge, TN 37831 USA Email: {lees2,vetter}@ornl.gov

I.INTRODUCTION To understand the level of the abstraction that each directive- Graphics Processing Units (GPUs)-based heterogeneous based model gives, TableI summarizes the type of information systems have emerged as promising alternatives for high- that GPU directives can provide. The table implies that R- performance computing. However, their programming com- Stream offers the highest abstraction, and hiCUDA provides plexity poses significant challenges for developers. Recently, the lowest abstraction among the compared models, since several directive-based, GPU programming models have been in the R-Stream model, the covers most features proposed by both academia and industry ( [1], [2], [3], [4], [5], implicitly without a programmer’s involvement, while pro- [6]) to provide better productivity than existing ones, such as grammers using hiCUDA should control most of the features CUDA and OpenCL. Directive-based models provide different explicitly. However, lower level of abstraction is not always levels of abstraction, and programming efforts required to bad, since low level of abstraction may allow enough control conform to their models and optimize the performance also over various optimizations and the features specific to the vary. Understanding the differences in these new models underlying GPUs to achieve optimal performance. Moreover, will give valuable insights on the general applicability and actual programming efforts required to use each model may performance of the directive-based GPU programming models. not be directly related to the level of abstraction that the model This position paper evaluates the existing high-level GPU offers, and high-level abstraction of a model sometimes puts programming models to identify current issues and future limits on its application coverage. directions that stimulate further discussion and research to III.CURRENT ISSUESAND FUTURE DIRECTIONS push the state of the art in productive GPU programming A. Functionality models. The existing models target structured blocks with parallel II.DIRECTIVE-BASED,HIGH-LEVEL GPU loops for offloading to a GPU device, but there exist several PROGRAMMING MODELS limits on their applicability; first, most of the existing imple- General directive-based programming systems consist of mentations work only on loops, but not on general structured directives, library routines, and designated . In the blocks, such as OpenMP’s parallel regions. Second, most of directive-based GPU programming models, a set of directives them either do not provide reduction clauses or can handle are used to augment information available to the designated only scalar reductions. Third, none of existing models sup- compilers, such as guidance on mapping of loops onto GPU ports critical sections or atomic operations (OpenMPC accepts and data sharing rules. The most important advantage of using OpenMP critical sections, but only if they have reduction directive-based GPU programming models is that they provide patterns.). Fourth, all existing models support only course- very high-level abstraction on GPU programming, since the grained synchronizations. Fifth, most of them do not allow designated compiler hides most of the complex details specific function calls in mappable regions, except for simple functions to the underlying GPU architectures. Another benefit is that that compilers can inline automatically. Sixth, all work on the directive approaches provide incremental parallelization array-based computations but support pointer operations in of applications, like OpenMP, so that a user can specify the very limited ways. Last, some compilers have limits on the regions of a host program to be offloaded to a GPU device complexity of mappable regions, such as the shape and depth in an incremental way, and then the compiler automatically of nested loops and memory allocation patterns. creates corresponding host+device programs. Because each model has different limits on its applicability, Several directive-based, GPU programming models have programming efforts required to conform to their models also been proposed, such as OpenMPC [1], hiCUDA [2], PGI differ in each model, incurring portability problems. OpenACC Accelerator [3], HMPP [4], R-Stream [5], OpenACC [6], etc. is the first step toward a single standard on directive-based

The submitted manuscript has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. TABLE I FEATURE TABLE, WHICH SHOWS THE TYPE OF INFORMATION THAT GPU DIRECTIVESCANPROVIDE.IN THE TABLE, explicit MEANSTHATDIRECTIVESEXISTTOCONTROLTHE FEATURE EXPLICITLY, implicit INDICATES THAT A COMPILER IMPLICITLY HANDLES THE FEATURE, indirect IS THAT PROGRAMMERS CAN INDIRECTLY GUIDE THE WAY FOR THE COMPILERTOHANDLETHEFEATUREUSINGDIRECTIVES, AND imp-dep MEANS THAT THE FEATURE IS IMPLEMENTATION-DEPENDENT.

Features OpenMPC hiCUDA PGI HMPP R-Stream OpenACC Code regions to be offloaded structured structured loops loops loops loops blocks blocks Loop mapping parallel parallel parallel parallel parallel parallel vector vector GPU memory allocation and free explicit explicit explicit explicit implicit explicit Data management implicit implicit implicit implicit Data movement between CPU and GPU explicit explicit explicit explicit implicit explicit implicit implicit implicit implicit Loop transformations explicit implicit explicit implicit imp-dep Compiler optimizations Data management optimizations explicit implicit explicit explicit implicit imp-dep implicit implicit implicit Thread batching explicit explicit indirect explicit explicit indirect GPU-specific features implicit implicit implicit implicit implicit Utilization of special memories explicit explicit indirect explicit implicit indirect implicit implicit imp-dep

GPU programming, but many technical details are still left by unparsing low-level intermediate representation (IR), which unspecified. Therefore, in-depth evaluation and research on contain implementation-specific code structures and thus are existing models will be essential to address these issues. very difficult to understand. More research on traceability mechanisms and high-level IR-based compiler-transformation B. Scalability techniques will be needed for better debuggability. All existing models assume host+accelerator systems where one or small number of GPUs are attached to the host CPU. IV. SUMMARY However, these models will be applicable only to small scale. This paper evaluates directive-based GPU programming To program systems consisting of clusters of GPUs, hybrid ap- models. The directive-based models provide very high-level proaches such as MPI + X will be needed. To enable seamless abstraction on GPU programming, since the designated com- porting to large scale systems, research on unified, directive- pilers hide most of the complex details specific to the un- based programming models, which integrate data distribution, derlying GPU architectures. The evaluation of the existing parallelization, synchronization, and other additional features, directive-based GPU programming models shows that these will be needed. models provide different levels of abstraction, and program- . Tunability ming efforts required to conform to their models and optimize the performance also vary. The evaluation also reveals that the Tuning GPU programs is known to be difficult, due to current high-level GPU programming models may need to be the complex relationship between programs and performance. further extended to support an alternative way to express GPU- Directive-based GPU programming models may enable an specific programming model and memory model to achieve the easy tuning environment that assists users in generating GPU best performance in some applications. programs in many optimization variants without detailed knowledge of the complex GPU programming and memory ACKNOWLEDGMENT models. However, most of the existing models do not provide This position paper was developed with funding from the enough control over various compiler-optimizations and GPU- Vancouver project, X-Stack program, U.S. Department of specific features, posing a limit on their tunability. To achieve Energy Office of Advanced Scientific Computing Research. the best performance on some applications, research on alter- native, but still high-level interface to express GPU-specific REFERENCES programming model and memory model will be necessary. [1] S. Lee and R. Eigenmann, “OpenMPC: Extended OpenMP programming and tuning for GPUs,” in SC’10: Proceedings of the 2010 ACM/IEEE . Debuggability conference on Supercomputing. IEEE press, 2010. [2] T. D. Han and T. S. Abdelrahman, “hicuda: High-level gpgpu program- The high-level abstraction offered by directive models puts ming,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, significant burdens on debugging. The existing models do pp. 78–90, 2011. not give users an idea on how the translation works, and [3] “The Portland Group, PGI Fortran and C Accelarator Programming Model [Online]. Available: http://www.pgroup.com/resources/accel.htm.” some implementations either do not always observe directives [4] “HMPP Workbench, a directive-based compiler for hybrid computing inserted by programmers or translate them incorrectly, if [Online]. Available: http://www.caps-entreprise.com/hmpp.html.” they conflict with internal compiler analyses, adding more [5] “R-Stream, a High Level Compiler for embedded signal/- knowledge processing and HPC algorithms [Online]. Available: dimensions to the complex debugging space. For debugging https://www.reservoir.com/rstream.” purpose, all existing models can generate CUDA codes as [6] “OpenACC: Directives for Accelerators [Online]. Available: outputs, but most of existing compilers generate CUDA codes http://www.openacc-standard.org.”