Programming Your GPU with Openmp*

Programming Your GPU with OpenMP* Simon McIntosh-Smith, Matt Martineau, Andrei Poenaru, Patrick Atkinson University of Bristol [email protected] 1 * The name “OpenMP” is the property of the OpenMP Architecture Review Board. 1 Preliminaries: Part 2 • Our plan for the day .. Active learning! – We will mix short lectures with short exercises. – You will use your laptop to connect to a multiprocessor server. • Please follow these simple rules – Do the exercises that we assign and then change things around and experiment. – EmBrace active learning! –Don’t cheat: Do Not look at the solutions before you complete an exercise … even if you get really frustrated. 2 • Collaboration between GW4 Alliance (universities of Bristol, Bath, Cardiff, Exeter), the UK Met Office, Cray, and Arm • ~£3 million, funded by EPSRC • Expected to be the first large-scale, Arm-based production supercomputer • 10,000+ Armv8 cores • Also contains some Intel Xeon Phi (KNL), Intel Xeon (Broadwell), and NVIDIA P100 GPUs • Cray compilers provide the OpenMP implementation for GPU offloading 3 Plan Module Concepts Exercises OpenMP overview • OpenMP recap • None … we’ll use demo • Hardcore jargon of OpenMP mode so we can move fast The device model in • Intro to the Target directive • Vadd program OpenMP with default data movement Understanding • Intro to nvprof • nvprof ./vadd execution Working with the • CPU and GPU execution • Vadd with combined target directive models and the fundamental directive using nvprof to combined directive understand execution Basic memory • The map clause • Jacobi solver movement Optimizing memory • Target data regions • Jacobi with explicit data movement movement Optimizing GPU code • Common GPU optimizations • Optimized Jacobi solver 4 Agenda • OpenMP overview • The device model in OpenMP • Understanding execution on the GPU with nvprof • Working with the target directive • Controlling memory movement • Optimizing GPU code • CPU/GPU portability OpenMP: C$OMP FLUSH #pragma omp critical C$OMP THREADPRIVATE(/ABC/) CALL OMP_SET_NUM_THREADS(10) OpenMP: An API for Writing Multithreaded C$OMP parallel do shared(a, b, c) call omp_test_lock(jlok) Applications call OMP_INIT_LOCK (ilok) C$OMP MASTER C$OMP ATOMIC C$OMP SINGLE§A setPRIVATE(X) of compiler directives and library routines for setenv OMP_SCHEDULE “dynamic” parallel application programmers C$OMP PARALLEL§Greatly DO simplifiesORDERED writingPRIVATE multi (A,- threadedB, C) (MT)C$OMP programs ORDERED in Fortran, C and C++ C$OMP PARALLEL REDUCTION (+: A, B) §Standardizes established SMP practice +C$OMP vectorization SECTIONSand heterogeneous device programming #pragma omp parallel for private(A, B) !$OMP BARRIER C$OMP PARALLEL COPYIN(/blk/) C$OMP DO lastprivate(XX) Nthrds = OMP_GET_NUM_PROCS() omp_set_lock(lck) * The name “OpenMP” is the property of the OpenMP Architecture Review Board. 6 The OpenMP Common Core: Most OpenMP programs only use these 19 items OpenMP pragma, function, or clause Concepts #pragma omp parallel Parallel region, teams of threads, structured block, interleaved execution across threads int omp_get_thread_num() Create threads with a parallel region and split up the work using int omp_get_num_threads() the number of threads and thread ID double omp_get_wtime() Speedup and Amdahl's law. False Sharing and other performance issues setenv OMP_NUM_THREADS N Internal control variables. Setting the default number of threads with an environment variable #pragma omp barrier Synchronization and race conditions. Revisit interleaved #pragma omp critical execution. #pragma omp for Worksharing, parallel loops, loop carried dependencies #pragma omp parallel for reduction(op:list) Reductions of values across a team of threads schedule(dynamic [,chunk]) Loop schedules, loop overheads and load balance schedule (static [,chunk]) private(list), firstprivate(list), shared(list) Data environment nowait Disabling implied barriers on workshare constructs, the high cost of barriers. The flush concept (but not the concept) #pragma omp single Workshare with a single thread #pragma omp task Tasks including the data environment for tasks. #pragma omp taskwait 7 Agenda • OpenMP overview • The device model in OpenMP • Understanding execution on the GPU with nvprof • Working with the target directive • Controlling memory movement • Optimizing GPU code • CPU/GPU portability The OpenMP device programming model • OpenMP uses a host/device model – The host is where the initial thread of the program begins execution – Zero or more devices are connected to the host … … … #include <omp.h> … #include <stdio.h> … ……… int main() … Host ……… { … …… printf(“There are %d devices\n”, omp_get_num_devices()); } Device 9 OpenMP with target devices • The target construct offloads execution to a device. #pragma omp target {….} // a structured block of code 1. Program begins. Launches Initial thread 10. Initial task on host running on the host device. continues once execution associated 2. Implicit parallel region surrounds entire program with the target region 3. Initial task begins execution completes 4. Initial thread encounters the target directive. 5. Initial task generates a target task which is a mergable, included task 7. A new initial thread runs on the device. 8. Implicit parallel region surrounds device program 6. Target task launches target region on the 9. Initial task executes code in the target region. device 10 Commonly used clauses with target #pragma omp target [clause[[,]clause]...] structured-block if(scalar-expression) – If the scalar-expression evaluates to false then the target region is executed by the host device in the host data environment. device(integer-expression) – The value of the integer-expression selects the device when a device other than the default device is desired private(list) firstprivate(list) – creates variables with the same name as those in the list on the device. In the case of firstprivate, the value of the original variable on the host is copied into the private variable created on the device. map(map-type: list[0:N]) – map-type may be to, from, tofrom, or alloc. The clause defines how the variables in list are moved between the host and the device. nowait – The target task is deferred which means the host can run code in parallel to the target region. 11 The target data environment Scalars and statically allocated arrays that are referenced in the Host thread target region are moved onto the Generating Task device implicitly before execution float A[N], B[N]; A, B and N mapped to the #pragma omp target device Initial task { Host thread Target task target region, waits for the Device Initial can use A, B and N task region to thread complete } the arrays A and B mapped back to the host Only the statically allocated arrays are moved back to the host after the target region completes Based on figure 6.4 in Using OpenMP – The Next Step by van der Pas, Stotzer and Terboven, MIT Press, 2017 Using Isambard (1/2) # Log in to the Isambard bastion node # This node acts as a gateway to the rest of the system # You will be assigned an account ID (01, 02, …) # Your password is openmpGPU18 ssh [email protected] # Log in to Isambard Phase 1 ssh phase1 # Change to directory containing exercises cd tutorial # List files Job submission scripts ls [br-train01@login-01 tutorial]$ ls jac_solv.c makefile mm_utils.h submit_jac_solv vadd.c make.def makefile pi.c submit_pi Make_def_files mm_utils.c Solutions submit_vadd Jacobi exercise Pi exercise vadd exercise starting code starting code starting code 13 Using Isambard (2/2) # Build exercises make Job status: Q = Queued # Submit job R = Running qsub submit_vadd C = Complete (job will disappear # Check job status shortly after completion) qstat -u $USER Req'd Req'd Elap Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time --------------- -------- -------- ---------- ------ --- --- ------ ----- - ----- 9376.master.gw4 br-trai pascalq vadd 154770 1 36 -- 00:02 R 00:00 # Check output from job # Output file will have the job identifier in name # This will be different each time you run a job cat vadd.oXXXX 14 Exercise • Use the provided serial vadd program which adds two vectors and tests the results. • Use the target construct to run the code on a GPU. • Experiment with the other OpenMP constructs if time. #pragma omp target #pragma omp parallel #pragma omp for #pragma omp for reduction(op:list) #pragma omp for private(list) #pragma omp target 15 Agenda • OpenMP overview • The device model in OpenMP • Understanding execution on the GPU with nvprof • Working with the target directive • Controlling memory movement • Optimizing GPU code • CPU/GPU portability How do we execute code on a GPU: The SIMT model (Single Instruction Multiple Thread) 1. Turn source code into a scalar work-2. Map work-items onto an 4. Run on hardware item an N dim index space. designed around the same SIMT extern void reduce( __local float*, __global float*); execution model __kernel void pi( const int niters, float step_size, __local float* l_sums, __global float* p_sums) { int n_wrk_items = get_local_size(0); int loc_id = get_local_id(0); int grp_id = get_group_id(0); float x, accum = 0.0f; int i,istart,iend; istart = (grp_id * n_wrk_items + loc_id) * niters; iend = istart+niters; for(i= istart; i<iend; i++){ x = (i+0.5f)*step_size; accum += 4.0f/(1.0f+x*x); } l_sums[local_id] = accum; barrier(CLK_LOCAL_MEM_FENCE); reduce(l_sums, p_sums); 3. Map data structures } onto the same index This is OpenCL kernel code … the sort space of code the OpenMP

Programming Your GPU with Openmp*

Heterogeneous Task Scheduling for Accelerated Openmp

Parallel Programming

Openmp API 5.1 Specification

Openmp Made Easy with INTEL® ADVISOR

Introduction to Openmp Paul Edmon ITC Research Computing Associate

Parallel Programming with Openmp

Intel Threading Building Blocks

Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation

Lecture 12: Introduction to Openmp (Part 1)

A “Hands-On” Introduction to Openmp*

Openmp Tutorial

Parallel Programming in Fortran 95 Using Openmp