Advanced Openmp Tutorial Openmp for Heterogeneous Computing Slides Designed by / Based On

Advanced OpenMP Tutorial OpenMP for Heterogeneous Computing slides designed by / based on: Christian Terboven Michael Klemm James C. Beyer Kelvin Li Bronis R. de Supinski 1 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Topics n Heterogeneous device execution model n Mapping variables to a device n Accelerated workshare n Examples 2 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Model n OpenMP supports heterogeneous systems n Device model: àOne host device and àOne or more target devices Heterogeneous SoC Host and Co-processors Host and GPUs 3 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology n Device: An implementation-defined (logical) execution unit. n Device data environment: The storage associated with a device. The execution model is host-centric such that the host device offloads target regions to target devices. 4 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; } } 5 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY #pragma omp target Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } } 6 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y x,y // Run SAXPY #pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } } 7 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Device Constructs Execute code on a target device Map variables to a target device • omp target [clause[[,]clause]…] • map([[map-type-modifier[,]]map-type:] list) map-type := alloc | tofrom | to | from | release | delete structured-block map-type-modifier := always • omp declare target • omp target data clause[[[,] clause]…] [function-definitions-or-declarations] structured-block • omp target enter data clause[[[,]clause]…] • omp target exit data clause[[[,]clause]…] Workshare for acceleration • omp target update clause[[[,]clause]…] • omp teams [clause[[,]clause]…] • omp declare target structured-block [variable-definitions-or-declarations] • omp distribute [clause[[,]clause]…] for-loops 8 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Runtime Support Runtime routines • void omp_set_default_device(int dev_num) • int omp_get_default_device(void) • int omp_get_num_devices(void) • int omp_get_num_teams(void) • int omp_get_team_num(void) • int omp_is_initial_device(void) • int omp_get_initial_device(void) Environment variable • Control default device through OMP_DEFAULT_DEVICE • Control offloading behaviour OMP_TARGET_OFFLOAD 9 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Offloading Computation host n Use target construct to count = 500; à Transfer control from the host to the #pragma omp target map(to:b,c,d) map(from:a) target device { #pragma omp parallel for target n Use map clause to for (i=0; i<count; i++) { à Map variables between the host and a[i] = b[i] * c + d; target device data environments } } n Host thread waits until offloaded host region completed by default a0 = a[0]; à Use the nowait clause for asynchronous execution 10 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Asynchronous offloading host n A host task is generated that count = 500; encloses the target region. #pragma omp target map(to:b,c,d) map(from:a) nowait { #pragma omp parallel for n The nowait clause indicates that the target for (i=0; i<count; i++) { encountering thread does not wait for a[i] = b[i] * c + d; the target region to complete. } } n The depend clause can be used for do_some_other_work(); host synchronization with other tasks a0 = a[0]; //Synchronization missing here! target task A mergeable and untied task that is generated by a target, target enter data, target exit data or target update construct. 11 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target Construct n Transfer control from the host to the target device n Syntax (C/C++) #pragma omp target [clause[[,] clause]…] structured-block n Syntax (Fortran) !$omp target [clause[[,] clause]…] structured-block !$omp end target n Clauses device(scalar-integer-expression) map([always[,]] alloc | to | from | tofrom | delete | release: list) if([target: ]scalar-expr) private(list) firstprivate(list) is_device_ptr(list) defaultmap(tofrom: scalar) nowait depend(dependence-type: list) 12 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Tasks and Target Example #pragma omp declare target #include <stdlib.h> #include <omp.h> extern void compute(float*, float*, int); #pragma omp end declare target All aspects of this void vec_mult_async(float* p, float* v1, float* v2, int N) { example will be int i; explained in the #pragma omp target enter data map(alloc: v1[:N], v2[:N]) following. #pragma omp target nowait depend(out: v1, v2) compute(v1, v2, N); #pragma omp task other_work(); // execute asynchronously on host device #pragma omp target map(from:p[0:N]) nowait depend(in: v1, v2) { #pragma omp distribute parallel for for (i=0; i<N; i++) p[i] = v1[i] * v2[i]; } #pragma omp taskwait #pragma omp target exit data map(release: v1[:N], v2[:N]) } 13 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology n Mapped variable: The corresponding variable in a device data environment to an original variable in a (host) data environment. n Mappable type: A type that is valid for mapped variables. (Bitwise copyable plus additional restrictions.) 14 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause n Map a variable or an array section to a device data environment n Syntax: map([[map-type-modifier[,]] map-type:] list) n Where map-type is: à alloc: allocate storage for corresponding variable à to: alloc and assign value of original variable to corresponding variable on entry à from: alloc and assign value of corresponding variable to original variable on exit à tofrom: default, both to and form à delete: the corresponding variable is removed à release: the reference count is decremented n Where map-type-modifier is: à always: the mapping operation is always performed 15 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Data Environment n The map clauses determine how an original variable in a data environment is mapped to a corresponding variable in a device data environment. Host Device pA 1 alloc(…) 2 to(…) #pragma omp target \ 4 map(alloc:...) \ from(…) map(to:...) \ map(from:...) { ... } 3 16 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause void vec_mult(float *p, float *v1, float *v2, int N) { • int i; The array sections for v1, v2, init(v1, v2, N); and p are explicitly mapped #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) into the device data { #pragma omp parallel for environment. for (i=0; i<N; i++) p[i] = v1[i] * v2[i]; • The variable N is implicitly output(p, N); firstprivate into the device } } data environment. subroutine vec_mult(p, v1, v2, N) real, dimension(*) :: p, v1, v2 integer :: N, i call init(v1, v2, N) !$omp target map(to: v1(1:N), v2(1:N)) map(from:p(1:N)) Note: !$omp parallel do In 4.0, a scalar variable referenced inside a target do i=1, N construct is implicitly mapped as map(inout:...). p(i) = v1(i) * v2(i) end do In 4.5, a scalar variable referenced inside a target !omp end target construct is implicitly firstprivate. call output(p, N) end subroutine 17 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause void vec_mult(float *p, float *v1, float *v2, int N) { • On entry to the target region: int i; – init(v1, v2, N); Allocate corresponding variables v1, v2, and p in the device data environment. #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) – Assign the corresponding variables v1 { and v2 the value of their respective #pragma omp parallel for for (i=0; i<N; i++) original variables. p[i] = v1[i] * v2[i]; – The corresponding variable p is undefined. output(p, N); } • On exit from the target region: } – Assign the original variable p the value subroutine vec_mult(p, v1, v2, N) of its corresponding variable. real, dimension(*) :: p, v1, v2 – The original variables v1 and v2 are integer :: N, i call init(v1, v2, N) undefined. – Remove the corresponding variables v1, !$omp target map(to: v1(1:N), v2(1:N)) map(from:p(1:N)) v2, and p from the device data !$omp parallel do environment. do i=1, N p(i) = v1(i) * v2(i) end do !omp end target call output(p, N) end subroutine 18 Advanced OpenMP Tutorial – Device Constructs Tim Cramer MAP is not necessarily a copy Shared memory Memory Processor X Accelerator Y Cache Cache A A A Distributed memory Accelerator Memory X Y • The corresponding variable in the device Processor data environment may share storage with X Memory Y the original variable. Cache A A • Writes to the corresponding variable may A change the value of the original variable. 19 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Map variables across multiple target regions n Optimize sharing data between host and device. n The target data, target enter data, and target exit data constructs map variables but do not offload code. n Corresponding variables remain in the device data environment for the extent of the target data region.

Advanced Openmp Tutorial Openmp for Heterogeneous Computing Slides Designed by / Based On

Map, Reduce and Mapreduce, the Skeleton Way$

From Sequential to Parallel Programming with Patterns

Opencl: a Hands-On Introduction

Mapreduce Basics

CUDA C++ Programming Guide

Programming Massively Parallel Processors

A Case Study of Openmp Applied to Map/Reduce-Style Computations

A Review of CUDA, Mapreduce, and Pthreads Parallel Computing Models

Parallel Design Patterns

Memcuda: Map Device Memory to Host Memory on GPGPU Platform*

Openmp Advanced Overview SIMD and Target Offload

A Brief Introduction to Opencl • Reference - Programming Massively Parallel Processors: a Hands-On Approach, David Kirk and Wen-Mei W