OPENMP USERS MONTHLY TELECON

LEVERAGING IMPLICIT CUDA STREAMS AND ASYNCHRONOUS OPENMP OFFLOAD FEATURES IN LLVM

erhtjhtyhy

YE LUO Division, Leadership Computing Facility Argonne National Laboratory

March 26th OUTLINE

▪ Using implicit CUDA streams – Maximize asynchronous operations on 1 stream – Using concurrent offload regions from multiple threads – Overlap computation with data transfer for free

▪ Asynchronous tasking – Using helper threads (LLVM12) – Leverage additional OpenMP threads – Coordinate tasks with dependency

2 USING IMPLICIT CUDA STREAMS

3 EXPLICIT VS IMPLICIT Pros and cons

EXPLICIT IMPLICIT ▪ Close to metal programming ▪ Programming OpenMP – Calls to native runtime directly – The OpenMP runtime handles – Full control but lengthy code calls to the native runtime ▪ Asynchronous execution and – Less control but more portable synchronization managed by the – Performance depends developer. In-order queue/streams ▪ Asynchronous tasking with and events. dependency handled by the OpenMP runtime

4 OPENMP OFFLOAD RUNTIME MOTIONS Poor performance if every step is synchronous

▪ Allocate array on device (very slow) // simple case ▪ Transfer H2D (slow) #pragma omp target \ (array[:100]) ▪ Launch the kernel for(int i ...) ▪ Transfer D2H (slow) { // operations on array } ▪ Deallocate array on device (very slow)

5 PRE-ARRANGE MEMORY ALLOCATION Move beyond textbook example

▪ Accelerator memory resource // optimized case allocation/deallocation is orders of // pre-arrange allocation magnitude slower than that on the #pragma omp target enter data \ host. map(alloc: array[:100]) ▪ These operations may also block … // use always to enforce transfer asynchronous execution. #pragma omp target map(always, array[:100]) ▪ Allocate array on host pinned for(int i ...) { // operations on array } memory to make the transfer asynchronous. // free memory #pragma omp target exit data \ https://github.com/ye- luo/miniqmc/blob/OMP_offload/src/Platforms/OMPTar map(delete: array[:100]) get/OMPallocator.hpp

6 IMPLICIT ASYNCHRONOUS DISPATCH Using queues/streams

CUDA supports streams for // optimized case asynchronous computing // pre-arrange allocation ▪ IBM XL and LLVM OpenMP #pragma omp target enter data \ runtime enqueue non-blocking H2D, map(alloc: array[:100]) kernel, D2H with only one … // use always to enforce transfer synchronization in the end. #pragma omp target map(always, array[:100]) – Async transfer H2D for(int i ...) { // operations on array } – Async enqueue the kernel – Async transfer D2H // free memory – Synchronization if ‘nowait’ is not #pragma omp target exit data \ used map(delete: array[:100])

7 IMPLICIT ASYNCHRONOUS DISPATCH (CONT) Maximize asynchronous calls within one target region

Host CUDA driver API calls from LLVM libomptaget are all asynchronous.

8 CONCURRENT EXECUTION AND TRANSFER Overlapping computation and data transfer

▪ Need multiple concurrent target #pragma omp parallel for regions for (int iw …) ▪ IBM XL and LLVM Clang { int* array = all_arrays[iw].data(); OpenMP runtime select #pragma omp target \ independent CUDA streams for map(always, tofrom: array[:100]) each offload region. for(int i ...) ▪ Kernel execution from one target { // operations on array } region may overlap with kernel } execution or data transfer from offload 0 H2D compute D2H another target region offload 1 H2D compute D2H https://github.com/ye-luo/openmp-target/blob/master/hands- offload 2 H2D compute D2H on/gemv/6-gemv-omp-target-many-matrices-no- hierachy/gemv-omp-target-many-matrices-no-hierachy.cpp 9 CONCURRENT EXECUTION (CONT) From multiple OpenMP threads, in miniQMC

Overlapping kernel and Concurrent transfer 10 kernel execution ASYNCHRONOUS TASKING

11 ASYNCHRONOUS TASKING Ideal case ▪ Ideal scenario. Only need the master to have full asynchronous kernel for (int iw …) { execution. int* array = all_arrays[iw].data(); ▪ LLVM 12 uses helper threads. // target task LIBOMP_USE_HIDDEN_HELPER_TASK=TRUE #pragma omp target nowait \ LIBOMP_NUM_HIDDEN_HELPER_THREADS=8 map(always, tofrom: array[:100]) for(int i ...) { // operations on array } ▪ Pros: } – No need of parallel region #pragma omp taskwait – Fast turnaround ▪ Cons: – Helper threads are actively waiting – They can be “noisy”

12 ASYNCHRONOUS TASKING When threads are available to CPU tasks ▪ No need of helper threads. ▪ All the threads used for task #pragma omp parallel // start workers for tasks dispatching are regular OpenMP { threads with all the usual affinity #pragma omp single control applied. for (int iw …) { int* array = all_arrays[iw].data(); ▪ Works better with CPU tasks on // target task going as well. #pragma omp target nowait \ map(always, tofrom: array[:100]) for(int i ...) { // operations on array } } // implicit barrier to wait for all tasks }

13 ASYNCHRONOUS TASKING When threads are available to process CPU tasks ▪ Reserve more threads than “for” loop iterations OMP_NUM_THREADS=16 ▪ Target task goes to idle threads. n=8 #pragma omp parallel for ▪ Each iw iteration remains for (int iw = 0; iw

14 ASYNCHRONOUS TASKING DEPENDENCY Coordinating host and offload computation #pragma omp target map(from: a) \ ▪ Not all the features are worth the depend(out: a) nowait effort porting to accelerators { ▪ Using tasking to leverage idle host int sum = 0; resource for non-blocking host for (int i = 0; i < 100000; i++) computation as Ncores>>Ngpu sum++; a = sum; ▪ * performance heavily depends on } runtime implementation. #pragma omp task {some independent work on CPU} #pragma omp task depend(in: a) shared(a) https://github.com/ye-luo/openmp- { // some postprocessing on CPU} target/blob/master/hands- on/tests/target_task/target_nowait_task.cpp #pragma omp taskwait

15