Leveraging Implicit Cuda Streams and Asynchronous Openmp Offload Features in Llvm

OPENMP USERS MONTHLY TELECON LEVERAGING IMPLICIT CUDA STREAMS AND ASYNCHRONOUS OPENMP OFFLOAD FEATURES IN LLVM erhtjhtyhy YE LUO Computational Science Division, Leadership Computing Facility Argonne National Laboratory March 26th OUTLINE ▪ Using implicit CUDA streams – Maximize asynchronous operations on 1 stream – Using concurrent offload regions from multiple threads – Overlap computation with data transfer for free ▪ Asynchronous tasking – Using helper threads (LLVM12) – Leverage additional OpenMP threads – Coordinate tasks with dependency 2 USING IMPLICIT CUDA STREAMS 3 EXPLICIT VS IMPLICIT Pros and cons EXPLICIT IMPLICIT ▪ Close to metal programming ▪ Programming OpenMP – Calls to native runtime directly – The OpenMP runtime handles – Full control but lengthy code calls to the native runtime ▪ Asynchronous execution and – Less control but more portable synchronization managed by the – Performance depends developer. In-order queue/streams ▪ Asynchronous tasking with and events. dependency handled by the OpenMP runtime 4 OPENMP OFFLOAD RUNTIME MOTIONS Poor performance if every step is synchronous ▪ Allocate array on device (very slow) // simple case ▪ Transfer H2D (slow) #pragma omp target \ map(array[:100]) ▪ Launch the kernel for(int i ...) ▪ Transfer D2H (slow) { // operations on array } ▪ Deallocate array on device (very slow) 5 PRE-ARRANGE MEMORY ALLOCATION Move beyond textbook example ▪ Accelerator memory resource // optimized case allocation/deallocation is orders of // pre-arrange allocation magnitude slower than that on the #pragma omp target enter data \ host. map(alloc: array[:100]) ▪ These operations may also block … // use always to enforce transfer asynchronous execution. #pragma omp target map(always, array[:100]) ▪ Allocate array on host pinned for(int i ...) { // operations on array } memory to make the transfer asynchronous. // free memory #pragma omp target exit data \ https://github.com/ye- luo/miniqmc/blob/OMP_offload/src/Platforms/OMPTar map(delete: array[:100]) get/OMPallocator.hpp 6 IMPLICIT ASYNCHRONOUS DISPATCH Using queues/streams ▪ NVIDIA CUDA supports streams for // optimized case asynchronous computing // pre-arrange allocation ▪ IBM XL and LLVM Clang OpenMP #pragma omp target enter data \ runtime enqueue non-blocking H2D, map(alloc: array[:100]) kernel, D2H with only one … // use always to enforce transfer synchronization in the end. #pragma omp target map(always, array[:100]) – Async transfer H2D for(int i ...) { // operations on array } – Async enqueue the kernel – Async transfer D2H // free memory – Synchronization if ‘nowait’ is not #pragma omp target exit data \ used map(delete: array[:100]) 7 IMPLICIT ASYNCHRONOUS DISPATCH (CONT) Maximize asynchronous calls within one target region Host CUDA driver API calls from LLVM libomptaget are all asynchronous. 8 CONCURRENT EXECUTION AND TRANSFER Overlapping computation and data transfer ▪ Need multiple concurrent target #pragma omp parallel for regions for (int iw …) ▪ IBM XL and LLVM Clang { int* array = all_arrays[iw].data(); OpenMP runtime select #pragma omp target \ independent CUDA streams for map(always, tofrom: array[:100]) each offload region. for(int i ...) ▪ Kernel execution from one target { // operations on array } region may overlap with kernel } execution or data transfer from offload 0 H2D compute D2H another target region offload 1 H2D compute D2H https://github.com/ye-luo/openmp-target/blob/master/hands- offload 2 H2D compute D2H on/gemv/6-gemv-omp-target-many-matrices-no- hierachy/gemv-omp-target-many-matrices-no-hierachy.cpp 9 CONCURRENT EXECUTION (CONT) From multiple OpenMP threads, in miniQMC Overlapping kernel and Concurrent transfer 10 kernel execution ASYNCHRONOUS TASKING 11 ASYNCHRONOUS TASKING Ideal case ▪ Ideal scenario. Only need the master thread to have full asynchronous kernel for (int iw …) { execution. int* array = all_arrays[iw].data(); ▪ LLVM 12 uses helper threads. // target task LIBOMP_USE_HIDDEN_HELPER_TASK=TRUE #pragma omp target nowait \ LIBOMP_NUM_HIDDEN_HELPER_THREADS=8 map(always, tofrom: array[:100]) for(int i ...) { // operations on array } ▪ Pros: } – No need of parallel region #pragma omp taskwait – Fast turnaround ▪ Cons: – Helper threads are actively waiting – They can be “noisy” 12 ASYNCHRONOUS TASKING When threads are available to process CPU tasks ▪ No need of helper threads. ▪ All the threads used for task #pragma omp parallel // start workers for tasks dispatching are regular OpenMP { threads with all the usual affinity #pragma omp single control applied. for (int iw …) { int* array = all_arrays[iw].data(); ▪ Works better with CPU tasks on // target task going as well. #pragma omp target nowait \ map(always, tofrom: array[:100]) for(int i ...) { // operations on array } } // implicit barrier to wait for all tasks } 13 ASYNCHRONOUS TASKING When threads are available to process CPU tasks ▪ Reserve more threads than “for” loop iterations OMP_NUM_THREADS=16 ▪ Target task goes to idle threads. n=8 #pragma omp parallel for ▪ Each iw iteration remains for (int iw = 0; iw<n; iw++) { independent int* array = all_arrays[iw].data(); // target task #pragma omp target nowait \ map(always, tofrom: array[:100]) for(int i ...) { // operations on array } //something else to do on the current thread } 14 ASYNCHRONOUS TASKING DEPENDENCY Coordinating host and offload computation #pragma omp target map(from: a) \ ▪ Not all the features are worth the depend(out: a) nowait effort porting to accelerators { ▪ Using tasking to leverage idle host int sum = 0; resource for non-blocking host for (int i = 0; i < 100000; i++) computation as Ncores>>Ngpu sum++; a = sum; ▪ * performance heavily depends on } compiler runtime implementation. #pragma omp task {some independent work on CPU} #pragma omp task depend(in: a) shared(a) https://github.com/ye-luo/openmp- { // some postprocessing on CPU} target/blob/master/hands- on/tests/target_task/target_nowait_task.cpp #pragma omp taskwait 15 .

Load more