Advanced OpenMP Tutorial OpenMP for Heterogeneous Computing slides designed by / based on:

Christian Terboven Michael Klemm James C. Beyer Kelvin Li Bronis R. de Supinski

1 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Topics

n Heterogeneous device execution model n Mapping variables to a device n Accelerated workshare n Examples

2 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Model n OpenMP supports heterogeneous systems n Device model: àOne host device and àOne or more target devices

Heterogeneous SoC

Host and Co-processors Host and GPUs

3 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology

n Device: An implementation-defined (logical) execution unit. n Device data environment: The storage associated with a device.

The execution model is host-centric such that the host device offloads target regions to target devices.

4 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading

Target Device Offloading Host Device void saxpy(){ main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY

Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; } }

5 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading

Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y // Run SAXPY

#pragma omp target Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } }

6 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Offloading

Target Device Offloading Host Device void saxpy(){ offloading main(){ int n = 10240; float a = 42.0f; float b = 23.0f; saxpy(); float *x, *y; } // Allocate and initialize x, y x,y // Run SAXPY

#pragma omp target map(to:x[0:n]) map(tofrom:y[0:n]) Target Device #pragma omp parallel for for (int i = 0; i < n; ++i){ y[i] = a*x[i] + y[i]; saxpy(); } }

7 Advanced OpenMP Tutorial – Device Constructs Tim Cramer OpenMP Device Constructs

Execute code on a target device Map variables to a target device • omp target [clause[[,]clause]…] • map([[map-type-modifier[,]]map-type:] list) map-type := alloc | tofrom | to | from | release | delete structured-block map-type-modifier := always • omp declare target • omp target data clause[[[,] clause]…] [function-definitions-or-declarations] structured-block • omp target enter data clause[[[,]clause]…] • omp target exit data clause[[[,]clause]…] Workshare for acceleration • omp target update clause[[[,]clause]…] • omp teams [clause[[,]clause]…] • omp declare target structured-block [variable-definitions-or-declarations] • omp distribute [clause[[,]clause]…] for-loops

8 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Runtime Support

Runtime routines • void omp_set_default_device(int dev_num) • int omp_get_default_device(void) • int omp_get_num_devices(void) • int omp_get_num_teams(void) • int omp_get_team_num(void) • int omp_is_initial_device(void) • int omp_get_initial_device(void)

Environment variable • Control default device through OMP_DEFAULT_DEVICE • Control offloading behaviour OMP_TARGET_OFFLOAD

9 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Offloading Computation host n Use target construct to count = 500; à Transfer control from the host to the #pragma omp target map(to:b,c,d) map(from:a) target device { #pragma omp parallel for target n Use map clause to for (i=0; i

n Host waits until offloaded host region completed by default a0 = a[0]; à Use the nowait clause for asynchronous execution

10 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Asynchronous offloading host n A host task is generated that count = 500; encloses the target region. #pragma omp target map(to:b,c,d) map(from:a) nowait { #pragma omp parallel for n The nowait clause indicates that the target for (i=0; i

n The depend clause can be used for do_some_other_work(); host synchronization with other tasks a0 = a[0]; //Synchronization missing here!

target task A mergeable and untied task that is generated by a target, target enter data, target exit data or target update construct.

11 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target Construct

n Transfer control from the host to the target device n Syntax (C/C++) #pragma omp target [clause[[,] clause]…] structured-block n Syntax (Fortran) !$omp target [clause[[,] clause]…] structured-block !$omp end target n Clauses device(scalar-integer-expression) map([always[,]] alloc | to | from | tofrom | delete | release: list) if([target: ]scalar-expr) private(list) firstprivate(list) is_device_ptr(list) defaultmap(tofrom: scalar) nowait depend(dependence-type: list)

12 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Tasks and Target Example

#pragma omp declare target #include #include extern void compute(float*, float*, int); #pragma omp end declare target All aspects of this void vec_mult_async(float* p, float* v1, float* v2, int N) { example will be int i; explained in the #pragma omp target enter data map(alloc: v1[:N], v2[:N]) following. #pragma omp target nowait depend(out: v1, v2) compute(v1, v2, N);

#pragma omp task other_work(); // execute asynchronously on host device

#pragma omp target map(from:p[0:N]) nowait depend(in: v1, v2) { #pragma omp distribute parallel for for (i=0; i

#pragma omp taskwait

#pragma omp target exit data map(release: v1[:N], v2[:N]) }

13 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology

n Mapped variable: The corresponding variable in a device data environment to an original variable in a (host) data environment. n Mappable type: A type that is valid for mapped variables. (Bitwise copyable plus additional restrictions.)

14 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause

n Map a variable or an array section to a device data environment n Syntax: map([[map-type-modifier[,]] map-type:] list) n Where map-type is: à alloc: allocate storage for corresponding variable à to: alloc and assign value of original variable to corresponding variable on entry à from: alloc and assign value of corresponding variable to original variable on exit à tofrom: default, both to and form à delete: the corresponding variable is removed à release: the reference count is decremented n Where map-type-modifier is: à always: the mapping operation is always performed

15 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Device Data Environment

n The map clauses determine how an original variable in a data environment is mapped to a corresponding variable in a device data environment.

Host Device

pA 1 alloc(…) 2 to(…)

#pragma omp target \ 4 map(alloc:...) \ from(…) map(to:...) \ map(from:...) { ... } 3

16 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause

void vec_mult(float *p, float *v1, float *v2, int N) { • int i; The array sections for v1, v2, init(v1, v2, N); and p are explicitly mapped #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) into the device data { #pragma omp parallel for environment. for (i=0; i

subroutine vec_mult(p, v1, v2, N) real, dimension(*) :: p, v1, v2 integer :: N, i call init(v1, v2, N)

!$omp target map(to: v1(1:N), v2(1:N)) map(from:p(1:N)) Note: !$omp parallel do In 4.0, a scalar variable referenced inside a target do i=1, N construct is implicitly mapped as map(inout:...). p(i) = v1(i) * v2(i) end do In 4.5, a scalar variable referenced inside a target !omp end target construct is implicitly firstprivate.

call output(p, N) end subroutine

17 Advanced OpenMP Tutorial – Device Constructs Tim Cramer map Clause

void vec_mult(float *p, float *v1, float *v2, int N) { • On entry to the target region: int i; – init(v1, v2, N); Allocate corresponding variables v1, v2, and p in the device data environment. #pragma omp target map(to:v1[0:N],v2[0:N]) map(from:p[0:N]) – Assign the corresponding variables v1 { and v2 the value of their respective #pragma omp parallel for for (i=0; i

call output(p, N) end subroutine

18 Advanced OpenMP Tutorial – Device Constructs Tim Cramer MAP is not necessarily a copy Shared memory

Memory

Processor X Accelerator Y

Cache Cache A A A Distributed memory

Accelerator Memory X Y

• The corresponding variable in the device Processor data environment may share storage with X Memory Y the original variable. Cache A A • Writes to the corresponding variable may A change the value of the original variable.

19 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Map variables across multiple target regions

n Optimize sharing data between host and device. n The target data, target enter data, and target exit data constructs map variables but do not offload code. n Corresponding variables remain in the device data environment for the extent of the target data region. n Useful to map variables across multiple target regions. n The target update synchronizes an original variable with its corresponding variable.

20 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target data Construct Example

void vec_mult(float* p, float* v1, float* v2, int N) { int i; • The target data construct init(v1, v2, N); maps variables to the device data

#pragma omp target data map(from: p[0:N]) environment. { – structured mapping – the device #pragma omp target map(to: v1[:N], v2[:N]) data environment is created for #pragma omp parallel for the block of code enclosed by the for (i=0; i

output(p, N); }

21 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target enter/exit data Construct Example void vec_mult(float *p, float *v1, float *v2, int N) { void vec_mult1(float *p, float *v1, float *v2, int N) { int i; #pragma omp target map(to: v1[:N], v2[:N]) init(v1, v2, N); #pragma omp parallel for for (int i=0; i

init_again(v1, v2, N); void vec_mult2(float *p, float *v1, float *v2, int N) { #pragma omp target map(to: v1[:N], v2[:N]) vec_mult2(p, v1, v2, N); #pragma omp parallel for #pragma omp target exit data map(from: p[0:N]) for (int i=0; i

• The target enter/exit data construct maps variables to the device data environment. – unstructured mapping – the device data environment can span more than one function • v1 and v2 are mapped at each target construct. • p is allocated and remains undefined in the device data environment by the target enter data map(alloc:...) construct. • The value of p in the device data environment is assigned to the original variable on the host by the target exit data map(from:...) construct.

22 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Map variables to a device data environment

n The host thread executes the data region n Be careful when using the device clause host #pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(res) { #pragma omp target device(0) target #pragma omp parallel for for (i=0; i

#pragma omp target device(0) map(res) target #pragma omp parallel for reduction(+:res) for (i=0; i

23 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target data Construct

n Map variables to a device data environment for the extent of the region. n Syntax (C/C++) #pragma omp target data clause[[[,] clause]…] structured-block n Syntax (Fortran) !$omp target data clause[[[,] clause]…] structured-block !$omp end target data n Clauses device(scalar-integer-expression) map([always[,]] alloc | to | from | tofrom | delete | release: list) if(scalar-expr) use_device_ptr(list)

24 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target enter data Construct

n Map variables to a device data environment. n Syntax (C/C++) #pragma omp target enter data clause[[[,] clause]…] n Syntax (Fortran) !$omp target enter data clause[[[,] clause]…] n Clauses device(scalar-integer-expression) map([always[,]] alloc | to | : list) if(scalar-expr) depend(dependence-type: list) nowait

25 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target exit data Construct

n Map variables to a device data environment. n Syntax (C/C++) #pragma omp target exit data clause[[[,] clause]…] n Syntax (Fortran) !$omp target exit data clause[[[,] clause]…] n Clauses device(scalar-integer-expression) map([always[,]] delete | from | release: list) if(scalar-expr) depend(dependence-type: list) nowait

26 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Synchronize mapped variables n Synchronize the value of an original variable in a host data environment with a corresponding variable in a device data environment

#pragma omp target data map(alloc:tmp[:N]) map(to:input[:N]) map(tofrom:res) host { #pragma omp target

#pragma omp parallel for target for (i=0; i

update_input_array_on_the_host(input); host Note: Mapping the variable res with #pragma omp target update to(input[:N]) the tofrom type-modifier is important. In the target data region, it is required to have the #pragma omp target map(tofrom:res) initial value available in the #pragma omp parallel for reduction(+:res) target device data environment for the for (i=0; i

27 Advanced OpenMP Tutorial – Device Constructs Tim Cramer target update Construct

n Issue data transfers between host and devices n Syntax (C/C++) #pragma omp target update clause[[[,] clause]…] n Syntax (Fortran) !$omp target update clause[[[,] clause]…] n Clauses device(scalar-integer-expression) to(list) from(list) if(scalar-expr) nowait depend(dependence-type: list)

28 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Map a variable for the whole program

#define N 1000 #pragma omp declare target float p[N], v1[N], v2[N]; • Indicate that global variables are #pragma omp end declare target mapped to a device data void vec_mult() { environment for the whole program int i; – The mapped variables are available for init(v1, v2, N); the lif of the whole program. #pragma omp target update to(v1, v2) #pragma omp target • Use target update to maintain #pragma omp parallel for for (i=0; i

#pragma omp target update from(p) output(p, N); }

29 Advanced OpenMP Tutorial – Device Constructs Tim Cramer declare target Construct

n Map variables to a device data environment for the whole program n Syntax (C/C++) #pragma omp declare target [variable-definitions-or-declarations] #pragma omp end declare target or #pragma omp declare target (extended-list) or #pragma omp declare target clause[[[,] clause]…] n Syntax (Fortran) !$omp declare target (extended-list) or !$omp declare target clause[[[,]clause]…] n Clauses to(list) link(extended-list)

30 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Call functions in a target region

n Function declaration must appear in a declare target construct n The functions will be compiled for à Host execution (as usual) à Target execution (to be invoked from target regions)

#pragma omp declare target float some_computation(float fl, int in) { // ... code ... }

float final_computation(float fl, int in) { // ... code ... } #pragma omp end declare target

31 Advanced OpenMP Tutorial – Device Constructs Tim Cramer declare target Construct

extern int a; #pragma omp declare target link(a) The list items of a link clause are not mapped by the declare target directive. Instead, their #pragma omp declare target int foo() mapping is deferred until they are mapped by { target data, target enter data, or target a = 1; constructs. They are mapped only for such } regions. #pragma omp end declare target

main() { ... #pragma omp target map(a) { foo(); } }

32 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Terminology

n League: the set of threads teams created by a teams construct n Contention group: threads of a team in a league and their descendant threads

target device

Team Team

...

33 Advanced OpenMP Tutorial – Device Constructs Tim Cramer teams Construct

• The teams construct creates a league of thread teams – The master thread of each team executes the teams region – The number of teams is specified by the num_teams clause – Each team executes with thread_limit threads (or less) – Threads in different teams cannot synchronize with each other

34 Advanced OpenMP Tutorial – Device Constructs Tim Cramer teams Construct – Restrictions

n A teams constructs must be “perfectly” nested in a target construct: à No statements or directives outside the teams construct n Only special OpenMP constructs or routines can be strictly nested inside a teams construct: à distribute (see next slides), distribute , distribute parallel worksharing-loop, distribute parallel worksharing-loop SIMD This restriction was relaxed in à parallel regions (parallel for/do, parallel sections) 5.0. à omp_get_num_teams(), omp_get_teams_num()

35 Advanced OpenMP Tutorial – Device Constructs Tim Cramer teams Construct

n Syntax (C/C++): #pragma omp teams [clause[[,] clause]…] structured-block n Syntax (Fortran): !$omp teams [clause[[,] clause]…] structured-block n Clauses num_teams(integer-expression) thread_limit(integer-expression) default(shared | none) private(list) firstprivate(list) shared(list) reduction(reduction-identifier : list)

36 Advanced OpenMP Tutorial – Device Constructs Tim Cramer distribute Construct

n work sharing among the teams à Distribute the iterations of the associated loops across the master threads of each team executing the region à No implicit barrier at the end of the construct

n dist_schedule(kind[, chunk_size]) à If specified scheduling kind must be static à Chunks are distributed in round-robin fashion of chunks with size chunk_size à If no chunk size specified, chunks are of (almost) equal size; each team receives at most one chunk à When no dist_schedule clause is specified, the schedule is implementation defined.

37 Advanced OpenMP Tutorial – Device Constructs Tim Cramer distribute Construct

n Syntax (C/C++): #pragma omp distribute [clause[[,] clause]…] for-loops n Syntax (Fortran): !$omp distribute [clause[[,] clause]…] do-loops n Clauses private(list) firstprivate(list) lastprivate(list) collapse(n) dist_schedule(kind[, chunk_size])

38 Advanced OpenMP Tutorial – Device Constructs Tim Cramer SAXPY: on device

void saxpy(float* restrict y, float* restrict x, float a, int n) { #pragma omp target map(to:n,a,x[:n]) map(y[:n]) { #pragma omp parallel for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } }

• How to run this loop in parallel on massively parallel hardware which typically has many clusters of execution units (or cores)? • Chunk loop level 1: distribute big chunks of loop iterations to each cluster (thread blocks, coherency domains, card, etc...) – to each team • Chunk loop level 2: loop workshare the iterations in a distributed chunk across threads in a team. • Chunk loop level 3: Use simd level parallelism inside each thread.

39 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Distribute Parallel SIMD 64 iterations assigned to 2 teams; Each team has 4 threads; Each thread has 2 simd lanes

Distribute Iterations across 2 teams

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

In a team workshare iterations across 4 threads

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

In a thread use simd parallelism

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

40 Advanced OpenMP Tutorial – Device Constructs Tim Cramer SAXPY: Accelerated worksharing

void saxpy(float * restrict y, float * restrict x, float a, int n) { #pragma omp target teams map(to:n,a,x[:n]) map(y[:n]) { int block_size = n/omp_get_num_teams();

#pragma omp distribute dist_sched(static, 1) for (int i = 0; i < n; i += block_size){

workshare (w/o barrier)

#pragma omp parallel for for (int j = i; j < i + block_size; j++) {

workshare (w/ barrier)

y[j] = a*x[j] + y[j]; }} }

41 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Combined Constructs

• The distribution patterns can be inconvenient

• Composite constructs for typical code patterns – distribute simd – distribute parallel for (C/C++) – distribute parallel for simd (C/C++) – distribute parallel do (Fortran) – distribute parallel do simd (Fortran) – ... plus additional combinations for teams and target • Avoids the need to do manual loop blocking

42 Advanced OpenMP Tutorial – Device Constructs Tim Cramer SAXPY: Accelerated worksharing using combined constructs

• use the combined construct to let the compiler partition the loop for you

void saxpy(float* restrict y, float* restrict x, float a, int n) { #pragma omp target teams map(to:n,a,x[:n]) map(y[:n]) #pragma omp distribute parallel for for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }

void saxpy(float* restrict y, float* restrict x, float a, int n) { #pragma omp target teams distribute parallel for map(to:n,a,x[:n]) map(y[:n]) for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; }

43 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Accessing the device memory

Device memory routines (C/C++ only) – void* omp_target_alloc(size_t size, int device_num); – void omp_target_free(void* device_ptr, int device_num); – int omp_target_is_present(void* ptr, size_t offset, int device_num); – int omp_target_memcpy(void* dst, void* src, size_t length, size_t dst_offset, size_t src_offset, int dst_device_num, int src_device_num); – int omp_target_memcpy_rect(void* dst, void* src, size_t element_size, int num_dims, const size_t* volume, const size_t* dst_offsets, const size_t* src_offsets, const size_t* dst_dimensions, const size_t* src_dimensions,int dst_device_num, int src_device_num); – int omp_target_associate_ptr(void* host_ptr, void* device_ptr, size_t size, size_t device_offset, int device_num); – int omp_target_disassociate_ptr(void* ptr, int device_num);

45 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Accessing the device memory • Clauses – omp target data … use_device_ptr(list) … – omp target … is_device_ptr(list) …

void get_dev_cos(double* mem, size_t n) { void compute() { double* mem_dev_cpy; float* x; int h = omp_get_initial_device(); // host cudaMalloc(&x, n); int t = omp_get_default_device(); // target device #pragma omp target data use_device_ptr(x) mem_dev_cpy = omp_target_alloc(sizeof(double)*n, t); { cudaFunction(x); // dst=mem_dev_cpy; src=mem omp_target_memcpy(mem_dev_cpy, mem, sizeof(double)*n, 0, 0, t, h); #pragma omp target is_device_ptr(x) { #pragma omp target is_device_ptr(mem_dev_cpy) device(t) for (int i=0; i

// dst=mem; src=mem_dev_cpy omp_target_memcpy(mem, mem_dev_cpy, sizeof(double)*n, 0, 0, h, t);

omp_target_free(mem_dev_cpy, t); }

46 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Tasks and Target Example

#pragma omp declare target #include #include extern void compute(float*, float*, int); #pragma omp end declare target

void vec_mult_async(float* p, float* v1, float* v2, int N) You should no be able { int i; to understand the

#pragma omp target enter data map(alloc: v1[:N], v2[:N]) complete example.

#pragma omp target nowait depend(out: v1, v2) compute(v1, v2, N);

#pragma omp task other_work(); // execute asynchronously on host device

#pragma omp target map(from:p[0:N]) nowait depend(in: v1, v2) { #pragma omp distribute parallel for for (i=0; i

#pragma omp taskwait

#pragma omp target exit data map(release: v1[:N], v2[:N]) }

47 Advanced OpenMP Tutorial – Device Constructs Tim Cramer Acknowledgments

n James Beyer (Nvidia) n Kelvin Li (IBM) n Eric Stotzer (Texas Instruments) n many others ...

48 Advanced OpenMP Tutorial – Device Constructs Tim Cramer