Directive-Based, High-Level GPU Programming Models

Total Page:16

File Type:pdf, Size:1020Kb

Directive-Based, High-Level GPU Programming Models 2012 DOE Exascale Research Conference - Working Session Position Paper: Track 2-2 (Productivity Tools and Programming Models) Moving Heterogeneous GPU Computing into the Mainstream with Directive-Based, High-Level Programming Models (Position Paper) DOE X-Stack Vancouver Project http://ft.ornl.gov/trac/vancouver Seyong Lee and Jeffrey S. Vetter Oak Ridge National Laboratory, Oak Ridge, TN 37831 USA Email: flees2,[email protected] I. INTRODUCTION To understand the level of the abstraction that each directive- Graphics Processing Units (GPUs)-based heterogeneous based model gives, TableI summarizes the type of information systems have emerged as promising alternatives for high- that GPU directives can provide. The table implies that R- performance computing. However, their programming com- Stream offers the highest abstraction, and hiCUDA provides plexity poses significant challenges for developers. Recently, the lowest abstraction among the compared models, since several directive-based, GPU programming models have been in the R-Stream model, the compiler covers most features proposed by both academia and industry ( [1], [2], [3], [4], [5], implicitly without a programmer’s involvement, while pro- [6]) to provide better productivity than existing ones, such as grammers using hiCUDA should control most of the features CUDA and OpenCL. Directive-based models provide different explicitly. However, lower level of abstraction is not always levels of abstraction, and programming efforts required to bad, since low level of abstraction may allow enough control conform to their models and optimize the performance also over various optimizations and the features specific to the vary. Understanding the differences in these new models underlying GPUs to achieve optimal performance. Moreover, will give valuable insights on the general applicability and actual programming efforts required to use each model may performance of the directive-based GPU programming models. not be directly related to the level of abstraction that the model This position paper evaluates the existing high-level GPU offers, and high-level abstraction of a model sometimes puts programming models to identify current issues and future limits on its application coverage. directions that stimulate further discussion and research to III. CURRENT ISSUES AND FUTURE DIRECTIONS push the state of the art in productive GPU programming A. Functionality models. The existing models target structured blocks with parallel II. DIRECTIVE-BASED,HIGH-LEVEL GPU loops for offloading to a GPU device, but there exist several PROGRAMMING MODELS limits on their applicability; first, most of the existing imple- General directive-based programming systems consist of mentations work only on loops, but not on general structured directives, library routines, and designated compilers. In the blocks, such as OpenMP’s parallel regions. Second, most of directive-based GPU programming models, a set of directives them either do not provide reduction clauses or can handle are used to augment information available to the designated only scalar reductions. Third, none of existing models sup- compilers, such as guidance on mapping of loops onto GPU ports critical sections or atomic operations (OpenMPC accepts and data sharing rules. The most important advantage of using OpenMP critical sections, but only if they have reduction directive-based GPU programming models is that they provide patterns.). Fourth, all existing models support only course- very high-level abstraction on GPU programming, since the grained synchronizations. Fifth, most of them do not allow designated compiler hides most of the complex details specific function calls in mappable regions, except for simple functions to the underlying GPU architectures. Another benefit is that that compilers can inline automatically. Sixth, all work on the directive approaches provide incremental parallelization array-based computations but support pointer operations in of applications, like OpenMP, so that a user can specify the very limited ways. Last, some compilers have limits on the regions of a host program to be offloaded to a GPU device complexity of mappable regions, such as the shape and depth in an incremental way, and then the compiler automatically of nested loops and memory allocation patterns. creates corresponding host+device programs. Because each model has different limits on its applicability, Several directive-based, GPU programming models have programming efforts required to conform to their models also been proposed, such as OpenMPC [1], hiCUDA [2], PGI differ in each model, incurring portability problems. OpenACC Accelerator [3], HMPP [4], R-Stream [5], OpenACC [6], etc. is the first step toward a single standard on directive-based The submitted manuscript has been authored by a contractor of the U.S. Government under Contract No. DE-AC05-00OR22725. Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes. TABLE I FEATURE TABLE, WHICH SHOWS THE TYPE OF INFORMATION THAT GPU DIRECTIVES CAN PROVIDE.IN THE TABLE, explicit MEANS THAT DIRECTIVES EXIST TO CONTROL THE FEATURE EXPLICITLY, implicit INDICATES THAT A COMPILER IMPLICITLY HANDLES THE FEATURE, indirect IS THAT PROGRAMMERS CAN INDIRECTLY GUIDE THE WAY FOR THE COMPILER TO HANDLE THE FEATURE USING DIRECTIVES, AND imp-dep MEANS THAT THE FEATURE IS IMPLEMENTATION-DEPENDENT. Features OpenMPC hiCUDA PGI HMPP R-Stream OpenACC Code regions to be offloaded structured structured loops loops loops loops blocks blocks Loop mapping parallel parallel parallel parallel parallel parallel vector vector GPU memory allocation and free explicit explicit explicit explicit implicit explicit Data management implicit implicit implicit implicit Data movement between CPU and GPU explicit explicit explicit explicit implicit explicit implicit implicit implicit implicit Loop transformations explicit implicit explicit implicit imp-dep Compiler optimizations Data management optimizations explicit implicit explicit explicit implicit imp-dep implicit implicit implicit Thread batching explicit explicit indirect explicit explicit indirect GPU-specific features implicit implicit implicit implicit implicit Utilization of special memories explicit explicit indirect explicit implicit indirect implicit implicit imp-dep GPU programming, but many technical details are still left by unparsing low-level intermediate representation (IR), which unspecified. Therefore, in-depth evaluation and research on contain implementation-specific code structures and thus are existing models will be essential to address these issues. very difficult to understand. More research on traceability mechanisms and high-level IR-based compiler-transformation B. Scalability techniques will be needed for better debuggability. All existing models assume host+accelerator systems where one or small number of GPUs are attached to the host CPU. IV. SUMMARY However, these models will be applicable only to small scale. This paper evaluates directive-based GPU programming To program systems consisting of clusters of GPUs, hybrid ap- models. The directive-based models provide very high-level proaches such as MPI + X will be needed. To enable seamless abstraction on GPU programming, since the designated com- porting to large scale systems, research on unified, directive- pilers hide most of the complex details specific to the un- based programming models, which integrate data distribution, derlying GPU architectures. The evaluation of the existing parallelization, synchronization, and other additional features, directive-based GPU programming models shows that these will be needed. models provide different levels of abstraction, and program- C. Tunability ming efforts required to conform to their models and optimize the performance also vary. The evaluation also reveals that the Tuning GPU programs is known to be difficult, due to current high-level GPU programming models may need to be the complex relationship between programs and performance. further extended to support an alternative way to express GPU- Directive-based GPU programming models may enable an specific programming model and memory model to achieve the easy tuning environment that assists users in generating GPU best performance in some applications. programs in many optimization variants without detailed knowledge of the complex GPU programming and memory ACKNOWLEDGMENT models. However, most of the existing models do not provide This position paper was developed with funding from the enough control over various compiler-optimizations and GPU- Vancouver project, X-Stack program, U.S. Department of specific features, posing a limit on their tunability. To achieve Energy Office of Advanced Scientific Computing Research. the best performance on some applications, research on alter- native, but still high-level interface to express GPU-specific REFERENCES programming model and memory model will be necessary. [1] S. Lee and R. Eigenmann, “OpenMPC: Extended OpenMP programming and tuning for GPUs,” in SC’10: Proceedings of the 2010 ACM/IEEE D. Debuggability conference on Supercomputing. IEEE press, 2010. [2] T. D. Han and T. S. Abdelrahman, “hicuda: High-level gpgpu program- The high-level abstraction offered by directive models puts ming,” IEEE Transactions on Parallel and Distributed Systems, vol. 22, significant burdens on debugging. The existing models do pp. 78–90, 2011. not
Recommended publications
  • PRE PROCESSOR DIRECTIVES in –C LANGUAGE. 2401 – Course
    PRE PROCESSOR DIRECTIVES IN –C LANGUAGE. 2401 – Course Notes. 1 The C preprocessor is a macro processor that is used automatically by the C compiler to transform programmer defined programs before actual compilation takes place. It is called a macro processor because it allows the user to define macros, which are short abbreviations for longer constructs. This functionality allows for some very useful and practical tools to design our programs. To include header files(Header files are files with pre defined function definitions, user defined data types, declarations that can be included.) To include macro expansions. We can take random fragments of C code and abbreviate them into smaller definitions namely macros, and once they are defined and included via header files or directly onto the program itself it(the user defined definition) can be used everywhere in the program. In other words it works like a blank tile in the game scrabble, where if one player possesses a blank tile and uses it to state a particular letter for a given word he chose to play then that blank piece is used as that letter throughout the game. It is pretty similar to that rule. Special pre processing directives can be used to, include or exclude parts of the program according to various conditions. To sum up, preprocessing directives occurs before program compilation. So, it can be also be referred to as pre compiled fragments of code. Some possible actions are the inclusions of other files in the file being compiled, definitions of symbolic constants and macros and conditional compilation of program code and conditional execution of preprocessor directives.
    [Show full text]
  • C and C++ Preprocessor Directives #Include #Define Macros Inline
    MODULE 10 PREPROCESSOR DIRECTIVES My Training Period: hours Abilities ▪ Able to understand and use #include. ▪ Able to understand and use #define. ▪ Able to understand and use macros and inline functions. ▪ Able to understand and use the conditional compilation – #if, #endif, #ifdef, #else, #ifndef and #undef. ▪ Able to understand and use #error, #pragma, # and ## operators and #line. ▪ Able to display error messages during conditional compilation. ▪ Able to understand and use assertions. 10.1 Introduction - For C/C++ preprocessor, preprocessing occurs before a program is compiled. A complete process involved during the preprocessing, compiling and linking can be read in Module W. - Some possible actions are: ▪ Inclusion of other files in the file being compiled. ▪ Definition of symbolic constants and macros. ▪ Conditional compilation of program code or code segment. ▪ Conditional execution of preprocessor directives. - All preprocessor directives begin with #, and only white space characters may appear before a preprocessor directive on a line. 10.2 The #include Preprocessor Directive - The #include directive causes copy of a specified file to be included in place of the directive. The two forms of the #include directive are: //searches for header files and replaces this directive //with the entire contents of the header file here #include <header_file> - Or #include "header_file" e.g. #include <stdio.h> #include "myheader.h" - If the file name is enclosed in double quotes, the preprocessor searches in the same directory (local) as the source file being compiled for the file to be included, if not found then looks in the subdirectory associated with standard header files as specified using angle bracket. - This method is normally used to include user or programmer-defined header files.
    [Show full text]
  • Section “Common Predefined Macros” in the C Preprocessor
    The C Preprocessor For gcc version 12.0.0 (pre-release) (GCC) Richard M. Stallman, Zachary Weinberg Copyright c 1987-2021 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation. A copy of the license is included in the section entitled \GNU Free Documentation License". This manual contains no Invariant Sections. The Front-Cover Texts are (a) (see below), and the Back-Cover Texts are (b) (see below). (a) The FSF's Front-Cover Text is: A GNU Manual (b) The FSF's Back-Cover Text is: You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development. i Table of Contents 1 Overview :::::::::::::::::::::::::::::::::::::::: 1 1.1 Character sets:::::::::::::::::::::::::::::::::::::::::::::::::: 1 1.2 Initial processing ::::::::::::::::::::::::::::::::::::::::::::::: 2 1.3 Tokenization ::::::::::::::::::::::::::::::::::::::::::::::::::: 4 1.4 The preprocessing language :::::::::::::::::::::::::::::::::::: 6 2 Header Files::::::::::::::::::::::::::::::::::::: 7 2.1 Include Syntax ::::::::::::::::::::::::::::::::::::::::::::::::: 7 2.2 Include Operation :::::::::::::::::::::::::::::::::::::::::::::: 8 2.3 Search Path :::::::::::::::::::::::::::::::::::::::::::::::::::: 9 2.4 Once-Only Headers::::::::::::::::::::::::::::::::::::::::::::: 9 2.5 Alternatives to Wrapper #ifndef ::::::::::::::::::::::::::::::
    [Show full text]
  • GPU Directives
    OPENACC® DIRECTIVES FOR ACCELERATORS NVIDIA GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Universities Data Analytics Supercomputing Centers Life Sciences 100,000’s Oil & Gas Defense Weather Research Climate Early Adopters Plasma Physics 2004 Present Time 2 3 Ways to Accelerate Applications with GPUs Applications Programming Libraries Directives Languages “Drop-in” Quickly Accelerate Maximum Acceleration Existing Applications Performance 3 Directives: Add A Few Lines of Code OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc parallel #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf(“pi = %f\n”, pi/N); printf(“pi = %f\n”, pi/N); } } OpenACC: Open Programming Standard for Parallel Computing “ OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan. ” Buddy Bland Titan Project Director Easy, Fast, Portable Oak Ridge National Lab “ OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP. ” Michael Wong CEO, OpenMP http://www.openacc-standard.org/ Directives Board OpenACC Compiler directives to specify parallel regions
    [Show full text]
  • Advanced Perl
    Advanced Perl Boston University Information Services & Technology Course Coordinator: Timothy Kohl Last Modified: 05/12/15 Outline • more on functions • references and anonymous variables • local vs. global variables • packages • modules • objects 1 Functions Functions are defined as follows: sub f{ # do something } and invoked within a script by &f(parameter list) or f(parameter list) parameters passed to a function arrive in the array @_ sub print_sum{ my ($a,$b)=@_; my $sum; $sum=$a+$b; print "The sum is $sum\n"; } And so, in the invocation $a = $_[0] = 2 print_sum(2,3); $b = $_[1] = 3 2 The directive my indicates that the variable is lexically scoped. That is, it is defined only for the duration of the given code block between { and } which is usually the body of the function anyway. * When this is done, one is assured that the variables so defined are local to the given function. We'll discuss the difference between local and global variables later. * with some exceptions which we'll discuss To return a value from a function, you can either use an explicit return statement or simply put the value to be returned as the last line of the function. Typically though, it’s better to use the return statement. Ex: sub sum{ my ($a,$b)=@_; my $sum; $sum=$a+$b; return $sum; } $s=sum(2,3); One can also return an array or associative array from a function. 3 References and Anonymous Variables In languages like C one has the notion of a pointer to a given data type, which can be used later on to read and manipulate the contents of the variable pointed to by the pointer.
    [Show full text]
  • Perl
    -I<directory> directories specified by –I are Perl prepended to @INC -I<oct_name> enable automatic line-end processing Ming- Hwa Wang, Ph.D. -(m|M)[- executes use <module> before COEN 388 Principles of Computer- Aided Engineering Design ]<module>[=arg{,arg}] executing your script Department of Computer Engineering -n causes Perl to assume a loop around Santa Clara University -p your script -P causes your script through the C Scripting Languages preprocessor before compilation interpreter vs. compiler -s enable switch parsing efficiency concerns -S makes Perl use the PATH environment strong typed vs. type-less variable to search for the script Perl: Practical Extraction and Report Language -T forces taint checks to be turned on so pattern matching capability you can test them -u causes Perl to dump core after Resources compiling your script Unix Perl manpages -U allow Perl to do unsafe operations Usenet newsgroups: comp.lang.perl -v print version Perl homepage: http://www.perl/com/perl/, http://www.perl.org -V print Perl configuration and @INC Download: http://www.perl.com/CPAN/ values -V:<name> print the value of the named To Run Perl Script configuration value run as command: perl –e ‘print “Hello, world!\n”;’ -w print warning for unused variable, run as scripts (with chmod +x): perl <script> using variable without setting the #!/usr/local/bin/perl –w value, undefined functions, references use strict; to undefined filehandle, write to print “Hello, world!\n”; filehandle which is read-only, using a exit 0; non-number as it were a number, or switches subroutine recurse too deep, etc.
    [Show full text]
  • Python Guide Documentation 0.0.1
    Python Guide Documentation 0.0.1 Kenneth Reitz 2015 09 13 Contents 1 Getting Started 3 1.1 Picking an Interpreter..........................................3 1.2 Installing Python on Mac OS X.....................................5 1.3 Installing Python on Windows......................................6 1.4 Installing Python on Linux........................................7 2 Writing Great Code 9 2.1 Structuring Your Project.........................................9 2.2 Code Style................................................ 15 2.3 Reading Great Code........................................... 24 2.4 Documentation.............................................. 24 2.5 Testing Your Code............................................ 26 2.6 Common Gotchas............................................ 30 2.7 Choosing a License............................................ 33 3 Scenario Guide 35 3.1 Network Applications.......................................... 35 3.2 Web Applications............................................ 36 3.3 HTML Scraping............................................. 41 3.4 Command Line Applications....................................... 42 3.5 GUI Applications............................................. 43 3.6 Databases................................................. 45 3.7 Networking................................................ 45 3.8 Systems Administration......................................... 46 3.9 Continuous Integration.......................................... 49 3.10 Speed..................................................
    [Show full text]
  • C Programming Tutorial
    C Programming Tutorial C PROGRAMMING TUTORIAL Simply Easy Learning by tutorialspoint.com tutorialspoint.com i COPYRIGHT & DISCLAIMER NOTICE All the content and graphics on this tutorial are the property of tutorialspoint.com. Any content from tutorialspoint.com or this tutorial may not be redistributed or reproduced in any way, shape, or form without the written permission of tutorialspoint.com. Failure to do so is a violation of copyright laws. This tutorial may contain inaccuracies or errors and tutorialspoint provides no guarantee regarding the accuracy of the site or its contents including this tutorial. If you discover that the tutorialspoint.com site or this tutorial content contains some errors, please contact us at [email protected] ii Table of Contents C Language Overview .............................................................. 1 Facts about C ............................................................................................... 1 Why to use C ? ............................................................................................. 2 C Programs .................................................................................................. 2 C Environment Setup ............................................................... 3 Text Editor ................................................................................................... 3 The C Compiler ............................................................................................ 3 Installation on Unix/Linux ............................................................................
    [Show full text]
  • Assembly Language Programming
    Experiment 3 Assembly Language Programming Every computer, no matter how simple or complex, has a microprocessor that manages the computer's arithmetical, logical and control activities. A computer program is a collection of numbers stored in memory in the form of ones and zeros. CPU reads these numbers one at a time, decodes them and perform the required action. We term these numbers as machine language. Although machine language instructions make perfect sense to computer but humans cannot comprehend them. A long time ago, someone came up with the idea that computer programs could be written using words instead of numbers and a new language of mnemonics was de- veloped and named as assembly language. An assembly language is a low-level programming language and there is a very strong (generally one-to-one) correspondence between the language and the architecture's machine code instructions. We know that computer cannot interpret assembly language instructions. A program should be developed that performs the task of translating the assembly language instructions to ma- chine language. A computer program that translates the source code written with mnemonics into the executable machine language instructions is called an assembler. The input of an assembler is a source code and its output is an executable object code. Assembly language programs are not portable because assembly language instructions are specific to a particular computer architecture. Similarly, a different assembler is required to make an object code for each platform. The ARM Architecture The ARM is a Reduced Instruction Set Computer(RISC) with a relatively simple implementation of a load/store architecture, i.e.
    [Show full text]
  • Introduction to GPU Programming with CUDA and Openacc
    Introduction to GPU Programming with CUDA and OpenACC Alabama Supercomputer Center 1 Alabama Research and Education Network Contents Topics § Why GPU chips and CUDA? § GPU chip architecture overview § CUDA programming § Queue system commands § Other GPU programming options § OpenACC programming § Comparing GPUs to other processors 2 What is a GPU chip? GPU § A Graphic Processing Unit (GPU) chips is an adaptation of the technology in a video rendering chip to be used as a math coprocessor. § The earliest graphic cards simply mapped memory bytes to screen pixels – i.e. the Apple ][ in 1980. § The next generation of graphics cards (1990s) had 2D rendering capabilities for rendering lines and shaded areas. § Graphics cards started accelerating 3D rendering with standards like OpenGL and DirectX in the early 2000s. § The most recent graphics cards have programmable processors, so that game physics can be offloaded from the main processor to the GPU. § A series of GPU chips sometimes called GPGPU (General Purpose GPU) have double precision capability so that they can be used as math coprocessors. 3 Why GPUs? GPU Comparison of peak theoretical GFLOPs and memory bandwidth for NVIDIA GPUs and Intel CPUs over the past few years. Graphs from the NVIDIA CUDA C Programming Guide 4.0. 4 CUDA Programming Language CUDA The GPU chips are massive multithreaded, manycore SIMD processors. SIMD stands for Single Instruction Multiple Data. Previously chips were programmed using standard graphics APIs (DirectX, OpenGL). CUDA, an extension of C, is the most popular GPU programming language. CUDA can also be called from a C++ program. The CUDA standard has no FORTRAN support, but Portland Group sells a third party CUDA FORTRAN.
    [Show full text]
  • CUDA C++ Programming Guide
    CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous
    [Show full text]
  • Directive-Based Programming with Openmp Shared Memory Programming
    Directive-Based Programming with OpenMP Shared Memory Programming • Explicit thread creation (pthreads): pthread t thread; pthread create(&thread, &attr, some_function, (void ∗) &result); … pthread join(thread, NULL); • Tasks (C++ 11): auto handle = std::async(std::launch::async, some_code, ...); auto result = handle.get(); • Kernels (CUDA): __global__ void someKernel(...) { int idx = blockIdx.x*blockDim.x + threadIdx.x // execute code for given index } 23 OpenMP • API for shared memory parallel programming targeting Fortran, C and C++ • specifications maintained by OpenMP Architecture Review Board (ARB) • members include AMD, ARM, Cray, Intel, Fujitsu, IBM, NVIDIA – versions 1.0 (Fortran ’97, C ’98) - 3.1 (2011) shared memory – 4.0 (2013) accelerators, NUMA – 4.5 (2015) improved memory mapping, SIMD – 5.0 (2018) improved accelerator support 25 OpenMP • comprises compiler directives, library routines and environment variables – C directives (case sensitive) #pragma omp directive-name [clause-list] – library calls begin with omp void omp_set_num_threads(int nthreads); – environment variables begin with OMP export OMP_NUM_THREADS=4 • requires compiler support – activated via -fopenmp (gcc/clang) or -openmp (icc) compiler flags 26 The parallel Directive • OpenMP uses a fork/join model: programs execute serially until they encounter a parallel directive: – this creates a group of threads – number of threads is dependent on the OMP_NUM_THREADS environment variable or set via function call, e.g. omp_set_num_threads(nthreads) – main thread becomes
    [Show full text]