Directives-Based Parallel Programming

Total Page:16

File Type:pdf, Size:1020Kb

Directives-Based Parallel Programming DIRECTIVES-BASED PARALLEL PROGRAMMING Jeff Larkin <[email protected]>, 2/12/2020 HANDS-ON 1. You should have received an email from Okta, please activate your account. 2. Install SFT client on your laptop for your OS: https://help.okta.com/en/prod/Content/Topics /Adv_Server_Access/docs/client.htm 3. Open terminal and run: sft enroll --team nv- demo 4. Approve in browser 5. Back in your terminal, log in to the hackathon head node: sft ssh -L 9090:localhost:9090 raplab- hackathon 6. Open http://localhost:9090 2 WHAT’S A COMPILER DIRECTIVE? Compiler Directives are instructions, hints, or other information given to the compiler beyond the base language source code. Examples: • GCC, unroll this loop 4 times (#pragma GCC unroll 4) • Ignore vector dependencies in this loop (#pragma ivdep) These are different from pre-processor macros, since they affect the actual code generation during compilation, can’t always be pre-processed. These are different from attributes because they are not defined by the language. 3 DIRECTIVES FOR PARALLEL PROGRAMMING OpenMP – Established 1997 OpenMP was founded to create a unified set of compiler directives for shared-memory-parallel computers (SMP) that were becoming commonplace in the late 90s. In 2008 this was expanded to include task-based parallelism. In 2013 this was expanded again to include offloading to co-processors. OpenACC – Established 2011 OpenACC was founded to create a unified set of compiler directives for “accelerators” that began emerging around 2010 (primarily GPUs) with potentially disjoint memories, non-smp architectures, and offloading. Both directives support C, C++, and Fortran and require support from the compiler. 4 A BRIEF HISTORY OF OPENMP OpenMP 2.5 OpenMP 4.5 OpenMP 1.0 Improved Offloading Unified into single OpenMP 3.1 Task Priority Basic shared- spec. memory parallelism, Tasking C/C++ and Fortran Improvements two specs. OpenMP 4.0 OpenMP 5.0 OpenMP 3.0 Target Offloading Loop Construct Teams Metadirective OpenMP 2.0 Tasking Added SIMD Base language updates Atomics Requires directive Incorporation Clarifications and Tasking Improvements Improvements 1996 1997 2000, 2002 2005 2008 2011 2013 2015 2018 5 A BRIEF HISTORY OF OPENACC OpenACC 2.5 Reference Counting, OpenACC 1.0 Profiling Interface, OpenACC 2.7 Additional Compute on Self, Basic parallelism, Improvements from readonly, Array structured data, and User Feedback Reductions, Lots of async/wait Clarifications, Misc. semantics OpenACC 2.6 User Feedback OpenACC 3.0 Updated Base Serial Construct, OpenACC 2.0 Languages, C++ Attach/Detach Incorporation Lambdas, Zero Unstructured Data (Manual Deep Copy), modifier, Improved Lifetimes, Routines, Misc. User Feedback ORNL asks CAPS, multi-device support Cray, & PGI to unify Atomic, efforts with the help Clarifications & of NVIDIA Improvements Nov. June Oct. Nov Nov. Nov. 2011 2011 2013 2015 2016 2018 2019 6 3 WAYS TO ACCELERATE APPLICATIONS Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility 7 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages __global__ void saxpy(int n, float a, float *x, float *y) { #pragma acc data copy(x,y) { int i = blockIdx.x*blockDim.x + threadIdx.x; std::for_each_n(POL, idx(0), n, ... if (i < n) y[i] += a*x[i]; [&](Index_t i){ } y[i] += a*x[i]; std::for_each_n(POL, idx(0), n, }); [&](Index_t i){ int main(void) { ... y[i] += a*x[i]; cudaMemcpy(d_x, x, ...); }); cudaMemcpy(d_y, y, ...); do concurrent (i = 1:n) ... saxpy<<<(N+255)/256,256>>>(...); y(i) = y(i) + a*x(i) enddo } cudaMemcpy(y, d_y, ...); Drive Base Languages to Better Augment Base Languages with Maximize Performance with Support Parallelism Directives Specialized Languages & Intrinsics 8 DIRECTIVE SYNTAX 9 DIRECTIVE SYNTAX Syntax for using compiler directives in code C/C++ Fortran #pragma sentinel directive clauses !$sentinel directive clauses <code> <code> A pragma in C/C++ gives instructions to the compiler on how to compile the code. Compilers that do not understand a particular pragma can freely ignore it. A directive in Fortran is a specially formatted comment that likewise instructions the compiler in the compilation of the code and can be freely ignored. The sentinel informs the compiler the directive language that will follow (acc = OpenACC, omp = OpenMP) Directives are commands and information to the compiler for altering or interpreting the code Clauses are specifiers or additions to directives, like function parameters. 10 INTRODUCTION TO OPENACC 11 OpenACC Directives Manage #pragma acc data copyin(a,b) copyout(c) Incremental Data { Movement ... Single source #pragma acc parallel { Initiate #pragma acc loop gang vector Interoperable Parallel for (i = 0; i < n; ++i) { Execution c[i] = a[i] + b[i]; Performance portable ... } CPU, GPU, Manycore Optimize } Loop ... Mappings } 12 OPENACC Incremental Single Source Low Learning Curve ▪ Rebuild the same code ▪ OpenACC is meant to ▪ Maintain existing on multiple be easy to use, and sequential code architectures easy to learn ▪ Add annotations to ▪ Compiler determines ▪ Programmer remains expose parallelism how to parallelize for in familiar C, C++, or ▪ After verifying the desired machine Fortran correctness, annotate ▪ Sequential code is ▪ No reason to learn low- more of the code maintained level details of the hardware. 13 OPENACC DIRECTIVES a directive-based parallel programming model designed for usability, performance and portability 3 OF TOP 5 HPC 18% OF INCITE AT SUMMIT PLATFORMS SUPPORTED NVIDIA GPU X86 CPU POWER CPU Sunway ARM CPU AMD GPU OPENACC APPS OPENACC SLACK MEMBERS >200K DOWNLOADS 1724 236 200 1154 150 692 87 107 361 39 150 305 SC15 SC16 SC17 SC18 ISC19 SC19 ISC17 SC17 ISC18 SC18 ISC19 SC19 14 EXAMPLE CODE 15 LAPLACE HEAT TRANSFER Introduction to lab code - visual Very Hot Room Temp We will observe a simple simulation of heat distributing across a metal plate. We will apply a consistent heat to the top of the plate. Then, we will simulate the heat distributing across the plate. 16 EXAMPLE: JACOBI ITERATION ▪ Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. ▪ Common, useful algorithm ▪ Example: Solve Laplace equation in 2D: 훁ퟐ풇(풙, 풚) = ퟎ A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) 퐴 (푖 − 1, 푗) + 퐴 푖 + 1, 푗 + 퐴 푖, 푗 − 1 + 퐴 푖, 푗 + 1 퐴 푖, 푗 = 푘 푘 푘 푘 A(i,j-1) 푘+1 4 17 JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { elements Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from A[j-1][i] + A[j+1][i]); neighbors err = max(err, abs(Anew[j][i] - A[j][i])); } Compute max error for } convergence for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Swap input/output arrays } } iter++; } 18 18 PROFILE-DRIVEN DEVELOPMENT 19 OPENACC DEVELOPMENT CYCLE ▪ Analyze your code to determine most likely places needing Analyze parallelization or optimization. ▪ Parallelize your code by starting with the most time consuming parts and check for correctness. ▪ Optimize your code to improve observed speed-up from parallelization. Optimize Parallelize 20 PROFILING SEQUENTIAL CODE Profile Your Code Lab Code: Laplace Heat Transfer Obtain detailed information about how the code ran. Total Runtime: 39.43 seconds This can include information such as: ▪ Total runtime ▪ Runtime of individual routines swap 19.04s calcNext ▪ Hardware counters 21.49s Identify the portions of code that took the longest to run. We want to focus on these “hotspots” when parallelizing. 21 OPENACC PARALLEL LOOP DIRECTIVE 22 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang When encountering the parallel directive, the compiler will generate gang gang 1 or more parallel gangs, which execute redundantly. } gang gang 23 OPENACC PARALLEL DIRECTIVE Expressing parallelism loop #pragma acc parallel loop { gang gang loop loop for(int i = 0; i < N; i++) loop { gang gang // Do Something } loop loop This loop will be gang gang } executed redundantly on each gang 24 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang for(int i = 0; i < N; i++) { gang gang // Do Something } This means that each gang gang } gang will execute the entire loop 25 OPENACC PARALLEL DIRECTIVE Parallelizing a single loop C/C++ #pragma acc parallel ▪ Use a parallel directive to mark a region of { #pragma acc loop code where you want parallel execution to occur for(int i = 0; j < N; i++) a[i] = 0; ▪ This parallel region is marked by curly braces in } C/C++ or a start and end directive in Fortran Fortran ▪ The loop directive is used to instruct the compiler to parallelize the iterations of the next !$acc parallel !$acc loop loop to run across the parallel gangs do i = 1, N a(i) = 0 end do !$acc end parallel 26 OPENACC PARALLEL DIRECTIVE Parallelizing a single loop C/C++ ▪ This pattern is so common that you can do all of #pragma acc parallel loop this in a single line of code for(int i = 0; j < N; i++) a[i] = 0; ▪ In this example, the parallel loop directive applies to the next loop ▪ This directive both marks the region for parallel Fortran execution and distributes the iterations of the loop. !$acc parallel loop do i = 1, N ▪ When applied to a loop with a data dependency, a(i) = 0 parallel loop may produce incorrect results end do 27 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { #pragma acc loop for(int i = 0; i < N; i++) { // Do Something } The loop directive informs the compiler } which loops to parallelize.
Recommended publications
  • PRE PROCESSOR DIRECTIVES in –C LANGUAGE. 2401 – Course
    PRE PROCESSOR DIRECTIVES IN –C LANGUAGE. 2401 – Course Notes. 1 The C preprocessor is a macro processor that is used automatically by the C compiler to transform programmer defined programs before actual compilation takes place. It is called a macro processor because it allows the user to define macros, which are short abbreviations for longer constructs. This functionality allows for some very useful and practical tools to design our programs. To include header files(Header files are files with pre defined function definitions, user defined data types, declarations that can be included.) To include macro expansions. We can take random fragments of C code and abbreviate them into smaller definitions namely macros, and once they are defined and included via header files or directly onto the program itself it(the user defined definition) can be used everywhere in the program. In other words it works like a blank tile in the game scrabble, where if one player possesses a blank tile and uses it to state a particular letter for a given word he chose to play then that blank piece is used as that letter throughout the game. It is pretty similar to that rule. Special pre processing directives can be used to, include or exclude parts of the program according to various conditions. To sum up, preprocessing directives occurs before program compilation. So, it can be also be referred to as pre compiled fragments of code. Some possible actions are the inclusions of other files in the file being compiled, definitions of symbolic constants and macros and conditional compilation of program code and conditional execution of preprocessor directives.
    [Show full text]
  • C and C++ Preprocessor Directives #Include #Define Macros Inline
    MODULE 10 PREPROCESSOR DIRECTIVES My Training Period: hours Abilities ▪ Able to understand and use #include. ▪ Able to understand and use #define. ▪ Able to understand and use macros and inline functions. ▪ Able to understand and use the conditional compilation – #if, #endif, #ifdef, #else, #ifndef and #undef. ▪ Able to understand and use #error, #pragma, # and ## operators and #line. ▪ Able to display error messages during conditional compilation. ▪ Able to understand and use assertions. 10.1 Introduction - For C/C++ preprocessor, preprocessing occurs before a program is compiled. A complete process involved during the preprocessing, compiling and linking can be read in Module W. - Some possible actions are: ▪ Inclusion of other files in the file being compiled. ▪ Definition of symbolic constants and macros. ▪ Conditional compilation of program code or code segment. ▪ Conditional execution of preprocessor directives. - All preprocessor directives begin with #, and only white space characters may appear before a preprocessor directive on a line. 10.2 The #include Preprocessor Directive - The #include directive causes copy of a specified file to be included in place of the directive. The two forms of the #include directive are: //searches for header files and replaces this directive //with the entire contents of the header file here #include <header_file> - Or #include "header_file" e.g. #include <stdio.h> #include "myheader.h" - If the file name is enclosed in double quotes, the preprocessor searches in the same directory (local) as the source file being compiled for the file to be included, if not found then looks in the subdirectory associated with standard header files as specified using angle bracket. - This method is normally used to include user or programmer-defined header files.
    [Show full text]
  • Section “Common Predefined Macros” in the C Preprocessor
    The C Preprocessor For gcc version 12.0.0 (pre-release) (GCC) Richard M. Stallman, Zachary Weinberg Copyright c 1987-2021 Free Software Foundation, Inc. Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation. A copy of the license is included in the section entitled \GNU Free Documentation License". This manual contains no Invariant Sections. The Front-Cover Texts are (a) (see below), and the Back-Cover Texts are (b) (see below). (a) The FSF's Front-Cover Text is: A GNU Manual (b) The FSF's Back-Cover Text is: You have freedom to copy and modify this GNU Manual, like GNU software. Copies published by the Free Software Foundation raise funds for GNU development. i Table of Contents 1 Overview :::::::::::::::::::::::::::::::::::::::: 1 1.1 Character sets:::::::::::::::::::::::::::::::::::::::::::::::::: 1 1.2 Initial processing ::::::::::::::::::::::::::::::::::::::::::::::: 2 1.3 Tokenization ::::::::::::::::::::::::::::::::::::::::::::::::::: 4 1.4 The preprocessing language :::::::::::::::::::::::::::::::::::: 6 2 Header Files::::::::::::::::::::::::::::::::::::: 7 2.1 Include Syntax ::::::::::::::::::::::::::::::::::::::::::::::::: 7 2.2 Include Operation :::::::::::::::::::::::::::::::::::::::::::::: 8 2.3 Search Path :::::::::::::::::::::::::::::::::::::::::::::::::::: 9 2.4 Once-Only Headers::::::::::::::::::::::::::::::::::::::::::::: 9 2.5 Alternatives to Wrapper #ifndef ::::::::::::::::::::::::::::::
    [Show full text]
  • GPU Directives
    OPENACC® DIRECTIVES FOR ACCELERATORS NVIDIA GPUs Reaching Broader Set of Developers 1,000,000’s CAE CFD Finance Rendering Universities Data Analytics Supercomputing Centers Life Sciences 100,000’s Oil & Gas Defense Weather Research Climate Early Adopters Plasma Physics 2004 Present Time 2 3 Ways to Accelerate Applications with GPUs Applications Programming Libraries Directives Languages “Drop-in” Quickly Accelerate Maximum Acceleration Existing Applications Performance 3 Directives: Add A Few Lines of Code OpenMP OpenACC CPU CPU GPU main() { main() { double pi = 0.0; long i; double pi = 0.0; long i; #pragma acc parallel #pragma omp parallel for reduction(+:pi) for (i=0; i<N; i++) for (i=0; i<N; i++) { { double t = (double)((i+0.05)/N); double t = (double)((i+0.05)/N); pi += 4.0/(1.0+t*t); pi += 4.0/(1.0+t*t); } } printf(“pi = %f\n”, pi/N); printf(“pi = %f\n”, pi/N); } } OpenACC: Open Programming Standard for Parallel Computing “ OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan. ” Buddy Bland Titan Project Director Easy, Fast, Portable Oak Ridge National Lab “ OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP. ” Michael Wong CEO, OpenMP http://www.openacc-standard.org/ Directives Board OpenACC Compiler directives to specify parallel regions
    [Show full text]
  • Advanced Perl
    Advanced Perl Boston University Information Services & Technology Course Coordinator: Timothy Kohl Last Modified: 05/12/15 Outline • more on functions • references and anonymous variables • local vs. global variables • packages • modules • objects 1 Functions Functions are defined as follows: sub f{ # do something } and invoked within a script by &f(parameter list) or f(parameter list) parameters passed to a function arrive in the array @_ sub print_sum{ my ($a,$b)=@_; my $sum; $sum=$a+$b; print "The sum is $sum\n"; } And so, in the invocation $a = $_[0] = 2 print_sum(2,3); $b = $_[1] = 3 2 The directive my indicates that the variable is lexically scoped. That is, it is defined only for the duration of the given code block between { and } which is usually the body of the function anyway. * When this is done, one is assured that the variables so defined are local to the given function. We'll discuss the difference between local and global variables later. * with some exceptions which we'll discuss To return a value from a function, you can either use an explicit return statement or simply put the value to be returned as the last line of the function. Typically though, it’s better to use the return statement. Ex: sub sum{ my ($a,$b)=@_; my $sum; $sum=$a+$b; return $sum; } $s=sum(2,3); One can also return an array or associative array from a function. 3 References and Anonymous Variables In languages like C one has the notion of a pointer to a given data type, which can be used later on to read and manipulate the contents of the variable pointed to by the pointer.
    [Show full text]
  • Perl
    -I<directory> directories specified by –I are Perl prepended to @INC -I<oct_name> enable automatic line-end processing Ming- Hwa Wang, Ph.D. -(m|M)[- executes use <module> before COEN 388 Principles of Computer- Aided Engineering Design ]<module>[=arg{,arg}] executing your script Department of Computer Engineering -n causes Perl to assume a loop around Santa Clara University -p your script -P causes your script through the C Scripting Languages preprocessor before compilation interpreter vs. compiler -s enable switch parsing efficiency concerns -S makes Perl use the PATH environment strong typed vs. type-less variable to search for the script Perl: Practical Extraction and Report Language -T forces taint checks to be turned on so pattern matching capability you can test them -u causes Perl to dump core after Resources compiling your script Unix Perl manpages -U allow Perl to do unsafe operations Usenet newsgroups: comp.lang.perl -v print version Perl homepage: http://www.perl/com/perl/, http://www.perl.org -V print Perl configuration and @INC Download: http://www.perl.com/CPAN/ values -V:<name> print the value of the named To Run Perl Script configuration value run as command: perl –e ‘print “Hello, world!\n”;’ -w print warning for unused variable, run as scripts (with chmod +x): perl <script> using variable without setting the #!/usr/local/bin/perl –w value, undefined functions, references use strict; to undefined filehandle, write to print “Hello, world!\n”; filehandle which is read-only, using a exit 0; non-number as it were a number, or switches subroutine recurse too deep, etc.
    [Show full text]
  • Python Guide Documentation 0.0.1
    Python Guide Documentation 0.0.1 Kenneth Reitz 2015 09 13 Contents 1 Getting Started 3 1.1 Picking an Interpreter..........................................3 1.2 Installing Python on Mac OS X.....................................5 1.3 Installing Python on Windows......................................6 1.4 Installing Python on Linux........................................7 2 Writing Great Code 9 2.1 Structuring Your Project.........................................9 2.2 Code Style................................................ 15 2.3 Reading Great Code........................................... 24 2.4 Documentation.............................................. 24 2.5 Testing Your Code............................................ 26 2.6 Common Gotchas............................................ 30 2.7 Choosing a License............................................ 33 3 Scenario Guide 35 3.1 Network Applications.......................................... 35 3.2 Web Applications............................................ 36 3.3 HTML Scraping............................................. 41 3.4 Command Line Applications....................................... 42 3.5 GUI Applications............................................. 43 3.6 Databases................................................. 45 3.7 Networking................................................ 45 3.8 Systems Administration......................................... 46 3.9 Continuous Integration.......................................... 49 3.10 Speed..................................................
    [Show full text]
  • C Programming Tutorial
    C Programming Tutorial C PROGRAMMING TUTORIAL Simply Easy Learning by tutorialspoint.com tutorialspoint.com i COPYRIGHT & DISCLAIMER NOTICE All the content and graphics on this tutorial are the property of tutorialspoint.com. Any content from tutorialspoint.com or this tutorial may not be redistributed or reproduced in any way, shape, or form without the written permission of tutorialspoint.com. Failure to do so is a violation of copyright laws. This tutorial may contain inaccuracies or errors and tutorialspoint provides no guarantee regarding the accuracy of the site or its contents including this tutorial. If you discover that the tutorialspoint.com site or this tutorial content contains some errors, please contact us at [email protected] ii Table of Contents C Language Overview .............................................................. 1 Facts about C ............................................................................................... 1 Why to use C ? ............................................................................................. 2 C Programs .................................................................................................. 2 C Environment Setup ............................................................... 3 Text Editor ................................................................................................... 3 The C Compiler ............................................................................................ 3 Installation on Unix/Linux ............................................................................
    [Show full text]
  • Assembly Language Programming
    Experiment 3 Assembly Language Programming Every computer, no matter how simple or complex, has a microprocessor that manages the computer's arithmetical, logical and control activities. A computer program is a collection of numbers stored in memory in the form of ones and zeros. CPU reads these numbers one at a time, decodes them and perform the required action. We term these numbers as machine language. Although machine language instructions make perfect sense to computer but humans cannot comprehend them. A long time ago, someone came up with the idea that computer programs could be written using words instead of numbers and a new language of mnemonics was de- veloped and named as assembly language. An assembly language is a low-level programming language and there is a very strong (generally one-to-one) correspondence between the language and the architecture's machine code instructions. We know that computer cannot interpret assembly language instructions. A program should be developed that performs the task of translating the assembly language instructions to ma- chine language. A computer program that translates the source code written with mnemonics into the executable machine language instructions is called an assembler. The input of an assembler is a source code and its output is an executable object code. Assembly language programs are not portable because assembly language instructions are specific to a particular computer architecture. Similarly, a different assembler is required to make an object code for each platform. The ARM Architecture The ARM is a Reduced Instruction Set Computer(RISC) with a relatively simple implementation of a load/store architecture, i.e.
    [Show full text]
  • Introduction to GPU Programming with CUDA and Openacc
    Introduction to GPU Programming with CUDA and OpenACC Alabama Supercomputer Center 1 Alabama Research and Education Network Contents Topics § Why GPU chips and CUDA? § GPU chip architecture overview § CUDA programming § Queue system commands § Other GPU programming options § OpenACC programming § Comparing GPUs to other processors 2 What is a GPU chip? GPU § A Graphic Processing Unit (GPU) chips is an adaptation of the technology in a video rendering chip to be used as a math coprocessor. § The earliest graphic cards simply mapped memory bytes to screen pixels – i.e. the Apple ][ in 1980. § The next generation of graphics cards (1990s) had 2D rendering capabilities for rendering lines and shaded areas. § Graphics cards started accelerating 3D rendering with standards like OpenGL and DirectX in the early 2000s. § The most recent graphics cards have programmable processors, so that game physics can be offloaded from the main processor to the GPU. § A series of GPU chips sometimes called GPGPU (General Purpose GPU) have double precision capability so that they can be used as math coprocessors. 3 Why GPUs? GPU Comparison of peak theoretical GFLOPs and memory bandwidth for NVIDIA GPUs and Intel CPUs over the past few years. Graphs from the NVIDIA CUDA C Programming Guide 4.0. 4 CUDA Programming Language CUDA The GPU chips are massive multithreaded, manycore SIMD processors. SIMD stands for Single Instruction Multiple Data. Previously chips were programmed using standard graphics APIs (DirectX, OpenGL). CUDA, an extension of C, is the most popular GPU programming language. CUDA can also be called from a C++ program. The CUDA standard has no FORTRAN support, but Portland Group sells a third party CUDA FORTRAN.
    [Show full text]
  • CUDA C++ Programming Guide
    CUDA C++ Programming Guide Design Guide PG-02829-001_v11.4 | September 2021 Changes from Version 11.3 ‣ Added Graph Memory Nodes. ‣ Formalized Asynchronous SIMT Programming Model. CUDA C++ Programming Guide PG-02829-001_v11.4 | ii Table of Contents Chapter 1. Introduction........................................................................................................ 1 1.1. The Benefits of Using GPUs.....................................................................................................1 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model....... 2 1.3. A Scalable Programming Model.............................................................................................. 3 1.4. Document Structure................................................................................................................. 5 Chapter 2. Programming Model.......................................................................................... 7 2.1. Kernels.......................................................................................................................................7 2.2. Thread Hierarchy...................................................................................................................... 8 2.3. Memory Hierarchy...................................................................................................................10 2.4. Heterogeneous Programming................................................................................................11 2.5. Asynchronous
    [Show full text]
  • Directive-Based Programming with Openmp Shared Memory Programming
    Directive-Based Programming with OpenMP Shared Memory Programming • Explicit thread creation (pthreads): pthread t thread; pthread create(&thread, &attr, some_function, (void ∗) &result); … pthread join(thread, NULL); • Tasks (C++ 11): auto handle = std::async(std::launch::async, some_code, ...); auto result = handle.get(); • Kernels (CUDA): __global__ void someKernel(...) { int idx = blockIdx.x*blockDim.x + threadIdx.x // execute code for given index } 23 OpenMP • API for shared memory parallel programming targeting Fortran, C and C++ • specifications maintained by OpenMP Architecture Review Board (ARB) • members include AMD, ARM, Cray, Intel, Fujitsu, IBM, NVIDIA – versions 1.0 (Fortran ’97, C ’98) - 3.1 (2011) shared memory – 4.0 (2013) accelerators, NUMA – 4.5 (2015) improved memory mapping, SIMD – 5.0 (2018) improved accelerator support 25 OpenMP • comprises compiler directives, library routines and environment variables – C directives (case sensitive) #pragma omp directive-name [clause-list] – library calls begin with omp void omp_set_num_threads(int nthreads); – environment variables begin with OMP export OMP_NUM_THREADS=4 • requires compiler support – activated via -fopenmp (gcc/clang) or -openmp (icc) compiler flags 26 The parallel Directive • OpenMP uses a fork/join model: programs execute serially until they encounter a parallel directive: – this creates a group of threads – number of threads is dependent on the OMP_NUM_THREADS environment variable or set via function call, e.g. omp_set_num_threads(nthreads) – main thread becomes
    [Show full text]