Directives-Based Parallel Programming

DIRECTIVES-BASED PARALLEL PROGRAMMING Jeff Larkin <[email protected]>, 2/12/2020 HANDS-ON 1. You should have received an email from Okta, please activate your account. 2. Install SFT client on your laptop for your OS: https://help.okta.com/en/prod/Content/Topics /Adv_Server_Access/docs/client.htm 3. Open terminal and run: sft enroll --team nv- demo 4. Approve in browser 5. Back in your terminal, log in to the hackathon head node: sft ssh -L 9090:localhost:9090 raplab- hackathon 6. Open http://localhost:9090 2 WHAT’S A COMPILER DIRECTIVE? Compiler Directives are instructions, hints, or other information given to the compiler beyond the base language source code. Examples: • GCC, unroll this loop 4 times (#pragma GCC unroll 4) • Ignore vector dependencies in this loop (#pragma ivdep) These are different from pre-processor macros, since they affect the actual code generation during compilation, can’t always be pre-processed. These are different from attributes because they are not defined by the language. 3 DIRECTIVES FOR PARALLEL PROGRAMMING OpenMP – Established 1997 OpenMP was founded to create a unified set of compiler directives for shared-memory-parallel computers (SMP) that were becoming commonplace in the late 90s. In 2008 this was expanded to include task-based parallelism. In 2013 this was expanded again to include offloading to co-processors. OpenACC – Established 2011 OpenACC was founded to create a unified set of compiler directives for “accelerators” that began emerging around 2010 (primarily GPUs) with potentially disjoint memories, non-smp architectures, and offloading. Both directives support C, C++, and Fortran and require support from the compiler. 4 A BRIEF HISTORY OF OPENMP OpenMP 2.5 OpenMP 4.5 OpenMP 1.0 Improved Offloading Unified into single OpenMP 3.1 Task Priority Basic shared- spec. memory parallelism, Tasking C/C++ and Fortran Improvements two specs. OpenMP 4.0 OpenMP 5.0 OpenMP 3.0 Target Offloading Loop Construct Teams Metadirective OpenMP 2.0 Tasking Added SIMD Base language updates Atomics Requires directive Incorporation Clarifications and Tasking Improvements Improvements 1996 1997 2000, 2002 2005 2008 2011 2013 2015 2018 5 A BRIEF HISTORY OF OPENACC OpenACC 2.5 Reference Counting, OpenACC 1.0 Profiling Interface, OpenACC 2.7 Additional Compute on Self, Basic parallelism, Improvements from readonly, Array structured data, and User Feedback Reductions, Lots of async/wait Clarifications, Misc. semantics OpenACC 2.6 User Feedback OpenACC 3.0 Updated Base Serial Construct, OpenACC 2.0 Languages, C++ Attach/Detach Incorporation Lambdas, Zero Unstructured Data (Manual Deep Copy), modifier, Improved Lifetimes, Routines, Misc. User Feedback ORNL asks CAPS, multi-device support Cray, & PGI to unify Atomic, efforts with the help Clarifications & of NVIDIA Improvements Nov. June Oct. Nov Nov. Nov. 2011 2011 2013 2015 2016 2018 2019 6 3 WAYS TO ACCELERATE APPLICATIONS Applications Compiler Programming Libraries Directives Languages Easy to use Easy to use Most Performance Most Performance Portable code Most Flexibility 7 THE FUTURE OF PARALLEL PROGRAMMING Standard Languages | Directives | Specialized Languages __global__ void saxpy(int n, float a, float *x, float *y) { #pragma acc data copy(x,y) { int i = blockIdx.x*blockDim.x + threadIdx.x; std::for_each_n(POL, idx(0), n, ... if (i < n) y[i] += a*x[i]; [&](Index_t i){ } y[i] += a*x[i]; std::for_each_n(POL, idx(0), n, }); [&](Index_t i){ int main(void) { ... y[i] += a*x[i]; cudaMemcpy(d_x, x, ...); }); cudaMemcpy(d_y, y, ...); do concurrent (i = 1:n) ... saxpy<<<(N+255)/256,256>>>(...); y(i) = y(i) + a*x(i) enddo } cudaMemcpy(y, d_y, ...); Drive Base Languages to Better Augment Base Languages with Maximize Performance with Support Parallelism Directives Specialized Languages & Intrinsics 8 DIRECTIVE SYNTAX 9 DIRECTIVE SYNTAX Syntax for using compiler directives in code C/C++ Fortran #pragma sentinel directive clauses !$sentinel directive clauses <code> <code> A pragma in C/C++ gives instructions to the compiler on how to compile the code. Compilers that do not understand a particular pragma can freely ignore it. A directive in Fortran is a specially formatted comment that likewise instructions the compiler in the compilation of the code and can be freely ignored. The sentinel informs the compiler the directive language that will follow (acc = OpenACC, omp = OpenMP) Directives are commands and information to the compiler for altering or interpreting the code Clauses are specifiers or additions to directives, like function parameters. 10 INTRODUCTION TO OPENACC 11 OpenACC Directives Manage #pragma acc data copyin(a,b) copyout(c) Incremental Data { Movement ... Single source #pragma acc parallel { Initiate #pragma acc loop gang vector Interoperable Parallel for (i = 0; i < n; ++i) { Execution c[i] = a[i] + b[i]; Performance portable ... } CPU, GPU, Manycore Optimize } Loop ... Mappings } 12 OPENACC Incremental Single Source Low Learning Curve ▪ Rebuild the same code ▪ OpenACC is meant to ▪ Maintain existing on multiple be easy to use, and sequential code architectures easy to learn ▪ Add annotations to ▪ Compiler determines ▪ Programmer remains expose parallelism how to parallelize for in familiar C, C++, or ▪ After verifying the desired machine Fortran correctness, annotate ▪ Sequential code is ▪ No reason to learn low- more of the code maintained level details of the hardware. 13 OPENACC DIRECTIVES a directive-based parallel programming model designed for usability, performance and portability 3 OF TOP 5 HPC 18% OF INCITE AT SUMMIT PLATFORMS SUPPORTED NVIDIA GPU X86 CPU POWER CPU Sunway ARM CPU AMD GPU OPENACC APPS OPENACC SLACK MEMBERS >200K DOWNLOADS 1724 236 200 1154 150 692 87 107 361 39 150 305 SC15 SC16 SC17 SC18 ISC19 SC19 ISC17 SC17 ISC18 SC18 ISC19 SC19 14 EXAMPLE CODE 15 LAPLACE HEAT TRANSFER Introduction to lab code - visual Very Hot Room Temp We will observe a simple simulation of heat distributing across a metal plate. We will apply a consistent heat to the top of the plate. Then, we will simulate the heat distributing across the plate. 16 EXAMPLE: JACOBI ITERATION ▪ Iteratively converges to correct value (e.g. Temperature), by computing new values at each point from the average of neighboring points. ▪ Common, useful algorithm ▪ Example: Solve Laplace equation in 2D: 훁ퟐ풇(풙, 풚) = ퟎ A(i,j+1) A(i-1,j) A(i+1,j) A(i,j) 퐴 (푖 − 1, 푗) + 퐴 푖 + 1, 푗 + 퐴 푖, 푗 − 1 + 퐴 푖, 푗 + 1 퐴 푖, 푗 = 푘 푘 푘 푘 A(i,j-1) 푘+1 4 17 JACOBI ITERATION: C CODE while ( err > tol && iter < iter_max ) { Iterate until converged err=0.0; Iterate across matrix for( int j = 1; j < n-1; j++) { for(int i = 1; i < m-1; i++) { elements Anew[j][i] = 0.25 * (A[j][i+1] + A[j][i-1] + Calculate new value from A[j-1][i] + A[j+1][i]); neighbors err = max(err, abs(Anew[j][i] - A[j][i])); } Compute max error for } convergence for( int j = 1; j < n-1; j++) { for( int i = 1; i < m-1; i++ ) { A[j][i] = Anew[j][i]; Swap input/output arrays } } iter++; } 18 18 PROFILE-DRIVEN DEVELOPMENT 19 OPENACC DEVELOPMENT CYCLE ▪ Analyze your code to determine most likely places needing Analyze parallelization or optimization. ▪ Parallelize your code by starting with the most time consuming parts and check for correctness. ▪ Optimize your code to improve observed speed-up from parallelization. Optimize Parallelize 20 PROFILING SEQUENTIAL CODE Profile Your Code Lab Code: Laplace Heat Transfer Obtain detailed information about how the code ran. Total Runtime: 39.43 seconds This can include information such as: ▪ Total runtime ▪ Runtime of individual routines swap 19.04s calcNext ▪ Hardware counters 21.49s Identify the portions of code that took the longest to run. We want to focus on these “hotspots” when parallelizing. 21 OPENACC PARALLEL LOOP DIRECTIVE 22 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang When encountering the parallel directive, the compiler will generate gang gang 1 or more parallel gangs, which execute redundantly. } gang gang 23 OPENACC PARALLEL DIRECTIVE Expressing parallelism loop #pragma acc parallel loop { gang gang loop loop for(int i = 0; i < N; i++) loop { gang gang // Do Something } loop loop This loop will be gang gang } executed redundantly on each gang 24 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { gang gang for(int i = 0; i < N; i++) { gang gang // Do Something } This means that each gang gang } gang will execute the entire loop 25 OPENACC PARALLEL DIRECTIVE Parallelizing a single loop C/C++ #pragma acc parallel ▪ Use a parallel directive to mark a region of { #pragma acc loop code where you want parallel execution to occur for(int i = 0; j < N; i++) a[i] = 0; ▪ This parallel region is marked by curly braces in } C/C++ or a start and end directive in Fortran Fortran ▪ The loop directive is used to instruct the compiler to parallelize the iterations of the next !$acc parallel !$acc loop loop to run across the parallel gangs do i = 1, N a(i) = 0 end do !$acc end parallel 26 OPENACC PARALLEL DIRECTIVE Parallelizing a single loop C/C++ ▪ This pattern is so common that you can do all of #pragma acc parallel loop this in a single line of code for(int i = 0; j < N; i++) a[i] = 0; ▪ In this example, the parallel loop directive applies to the next loop ▪ This directive both marks the region for parallel Fortran execution and distributes the iterations of the loop. !$acc parallel loop do i = 1, N ▪ When applied to a loop with a data dependency, a(i) = 0 parallel loop may produce incorrect results end do 27 OPENACC PARALLEL DIRECTIVE Expressing parallelism #pragma acc parallel { #pragma acc loop for(int i = 0; i < N; i++) { // Do Something } The loop directive informs the compiler } which loops to parallelize.

Load more