Day 3, Session 2: Parallel Programming with OpenMP

Xavier Martorell

Barcelona Supercomputing Center Agenda Agenda

DAY 3 - SESSION 2 14:00 - 15:00 OpenMP fundamentals, parallel regions 15:00 - 15:30 Worksharing constructs 15:30 - 15:45 Break 15:45 - 16:00 Synchronization mechanisms in OpenMP 16:00 - 17:00 Practical: heat diffusion (Jacobi solver)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 2 / 143 Agenda Agenda

DAY 4 - SESSION 1 10:00 - 11:00 Practical: heat diffusion (Gauss-Seidel solver) 11:00 - 11:30 Tasking in OpenMP 3.0 11:30 - 11:45 Break 11:45 - 13:00 Practical: multisort 13:00 - 14:00 Lunch DAY 4 - SESSION 2 14:00 - 14:15 Tasking in OpenMP 4.0 14:15 - 14:30 Programming using hybrid MPI/OpenMP 14:30 - 17:00 Practical: heat diffusion in MPI/OpenMP

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 3 / 143 Part I

OpenMP fundamentals, parallel regions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 4 / 143 Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 5 / 143 OpenMP Overview Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 6 / 143 OpenMP Overview What is OpenMP?

It is an API extension to the C, C++ and Fortran languages to write parallel programs for machines Current version is 4.0 (July 2013) Most compilers support 3.1 (July 2011) ... 4.0 is coming Supported by most compiler vendors Intel, IBM, PGI, TI, Sun, Cray, Fujitsu, HP, GCC... Maintained by the Architecture Review Board (ARB), a consortium of industry and academia http://www.openmp.org

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 7 / 143 OpenMP Overview A bit of history

OpenMP FortranOpenMP 1.0 C/C++OpenMP 1.0 FortranOpenMP 1.1 Fortran 2.0 OpenMP C/C++ 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 OpenMP 4.0

1997 ’98 ’99 2000 ’02 ’05 ’08 ’11 2013

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 8 / 143 OpenMP Overview Advantages of OpenMP

Mature standard and implementations Standardizes practice of the last 20 years Good performance and Portable across architectures Incremental parallelization Maintains sequential version (mostly) High level language Some people may say a medium level language :-) Supports both task and Communication is implicit Support for accelerators (4.0) Support for error recovery (4.0) Initial support for task dependences (4.0)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 9 / 143 OpenMP Overview Disadvantages of OpenMP

Communication is implicit No support for accelerators (introduced in 4.0) No error recovery capabilities (introduced in 4.0) Flat memory model Incremental parallelization creates false sense of glory/failure Difficult to compose Lacks high-level algorithms and structures Does not run on clusters (although there is research on the topic)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 10 / 143 The OpenMP model Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 11 / 143 The OpenMP model OpenMP at a glance

OpenMP components

Constructs

Compiler

OpenMP Executable OpenMP API Environment Variables

OpenMP Runtime Library ICVs

OS Threading Libraries

CPU CPU CPU CPU CPU CPU SMP

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 12 / 143 The OpenMP model Execution model

Fork-join model OpenMP uses a fork-join model The master spawns a team of threads that joins at the end of the parallel region Threads in the same team can collaborate to do work

Master Thread

Nested Parallel Region

Parallel Region Parallel Region

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 13 / 143 The OpenMP model Memory model

OpenMP defines a relaxed memory model Threads can see different values for the same variable Memory consistency is only guaranteed at specific points Luckily, the default points are usually enough Variables can be shared or private to each thread

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 14 / 143 Writing OpenMP programs Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 15 / 143 Writing OpenMP programs OpenMP directives syntax

In Fortran Through a specially formatted comment: sentinel construct [clauses] where sentinel is one of: !$OMP or C$OMP or *$OMP in fixed format !$OMP in free format

In C/C++ Through a compiler directive: #pragma omp construct [clauses]

OpenMP syntax is ignored if the compiler does not recognize OpenMP

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 16 / 143 Writing OpenMP programs OpenMP directives syntax

In Fortran Through a specially formatted comment: sentinel construct [clauses] where sentinel is one of: !$OMP or C$OMP or *$OMP in fixed format !$OMP in free format

In C/C++ Through a compiler directive: #pragma omp construct [clauses]

OpenMP syntax is ignored if the compiler does not recognize We’llOpenMP be using C/C++ syntax through this tutorial

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 16 / 143 Writing OpenMP programs Headers/Macros

C/C++ only omp.h contains the API prototypes and data types definitions The _OPENMP is defined by the OpenMP enabled compilers Allows conditional compilation of OpenMP

Fortran only The omp_lib module contains the subroutine and function definitions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 17 / 143 Writing OpenMP programs Structured Block

Definition Most directives apply to a structured block: Block of one or more statements One entry point, one exit point No branching in or out allowed Terminating the program is allowed

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 18 / 143 Creating Threads Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 19 / 143 Creating Threads The parallel construct

Directive

#pragma omp parallel [ clauses ] structured block where clauses can be: num_threads(expression) if(expression) shared(var-list) Coming shortly! private(var-list) firstprivate(var-list) default(none|shared| private | firstprivate ) reduction(var-list) We’ll see it later

Only in Fortran

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 20 / 143 Creating Threads The parallel construct

Specifying the number of threads The number of threads is controlled by an internal control variable (ICV) called nthreads-var When a parallel construct is found a parallel region with a maximum of nthreads-var is created Parallel constructs can be nested creating nested parallelism The nthreads-var can be modified through the omp_set_num_threads API called the OMP_NUM_THREADS environment variable Additionally, the num_threads clause causes the implementation to ignore the ICV and use the value of the clause for that region

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 21 / 143 Creating Threads The parallel construct

Avoiding parallel regions Sometimes we only want to run in parallel under certain conditions E.g., enough input data, not running already in parallel, ... The if clause allows to specify an expression. When evaluates to false the parallel construct will only use 1 thread Note that still creates a new team and data environment

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 22 / 143 Creating Threads Putting it together

Example

void main ( ) { #pragma omp parallel ... omp_set_num_threads ( 2 ) ; #pragma omp parallel ... #pragma omp parallel num_threads (random()%4+1) if(N>=128) ... }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 23 / 143 Creating Threads Putting it together

Example

void main ( ) { #pragma omp parallel ... An unknown number of threads here. Use OMP_NUM_THREADS omp_set_num_threads ( 2 ) ; #pragma omp parallel ... #pragma omp parallel num_threads (random()%4+1) if(N>=128) ... }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 23 / 143 Creating Threads Putting it together

Example

void main ( ) { #pragma omp parallel ... omp_set_num_threads ( 2 ) ; #pragma omp parallel ... A team of two threads here #pragma omp parallel num_threads (random()%4+1) if(N>=128) ... }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 23 / 143 Creating Threads Putting it together

Example

void main ( ) { #pragma omp parallel ... omp_set_num_threads ( 2 ) ; #pragma omp parallel ... #pragma omp parallel num_threads (random()%4+1) if(N>=128) ... A team of [1..4] threads here }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 23 / 143 Creating Threads API calls

Other useful routines int omp_get_num_threads() Returns the number of threads in the cur- rent team int omp_get_thread_num() Returns the id of the thread in the current team int omp_get_num_procs() Returns the number of processors in the machine int omp_get_max_threads() Returns the maximum number of threads that will be used in the next parallel region double omp_get_wtime() Returns the number of seconds since an arbitrary point in the past

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 24 / 143 Data-sharing attributes Outline

OpenMP Overview

The OpenMP model

Writing OpenMP programs

Creating Threads

Data-sharing attributes

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 25 / 143 Data-sharing attributes Data environment

A number of clauses are related to building the data environment that the construct will use when executing shared private firstprivate default threadprivate lastprivate reduction

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 26 / 143 Data-sharing attributes Data-sharing attributes

Shared When a variable is marked as shared, the variable inside the construct is the same as the one outside the construct In a parallel construct this means all threads see the same variable but not necessarily the same value Usually need some kind of synchronization to update them correctly OpenMP has consistency points at synchronizations

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 27 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel shared ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 28 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel shared ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ; Prints 2 or 3 (three printfs in total)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 28 / 143 Data-sharing attributes Data-sharing attributes

Private When a variable is marked as private, the variable inside the construct is a new variable of the same type with an undefined value In a parallel construct this means all threads have a different variable Can be accessed without any kind of synchronization

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 29 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 30 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; Can print anything (twice, same or different) } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 30 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel private ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ; Prints 1

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 30 / 143 Data-sharing attributes Data-sharing attributes

Firstprivate When a variable is marked as firstprivate, the variable inside the construct is a new variable of the same type but it is initialized to the original value of the variable In a parallel construct this means all threads have a different variable with the same initial value Can be accessed without any kind of synchronization

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 31 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 32 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; Prints 2 (twice) } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 32 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x =1; #pragma omp parallel firstprivate ( x ) num_threads ( 2 ) { x ++; p r i n t f ( "%d\n" , x ) ; } p r i n t f ( "%d\n" , x ) ; Prints 1

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 32 / 143 Data-sharing attributes Data-sharing attributes

What is the default? Static/global storage is shared Heap-allocated storage is shared Stack-allocated storage inside the construct is private Others If there is a default clause, what the clause says none means that the compiler will issue an error if the attribute is not explicitly set by the programmer Otherwise, depends on the construct For the parallel region the default is shared

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 33 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x , y ; #pragma omp parallel private ( y ) { x = y = #pragma omp parallel private ( x ) { x = y = } }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 34 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x , y ; #pragma omp parallel private ( y ) { x = x is shared y = #pragma ompy parallel is private private ( x ) { x = y = } }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 34 / 143 Data-sharing attributes Data-sharing attributes

Example

i n t x , y ; #pragma omp parallel private ( y ) { x = y = #pragma omp parallel private ( x ) { x = x is private y = } y is shared }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 34 / 143 Data-sharing attributes Threadprivate storage

The threadprivate construct

#pragma omp threadprivate(var− l i s t ) Can be applied to: Global variables Static variables Class-static members Allows to create a per-thread copy of “global” variables threadprivate storage persist across parallel regions if the number of threads is the same

Threadprivate persistence across nested regions is complex

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 35 / 143 CreatesNow foo onecanstatic be called by multiplecopy of threadsbuffer per at the same threadtime

foo returns correct address to caller

Data-sharing attributes Threaprivate storage

Example

char∗ foo ( ) { s t a t i c char buffer [BUF_SIZE]; #pragma omp threadprivate(buffer)

...

return b u f f e r ; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 36 / 143 Now foo can be called by multiple threads at the same time

foo returns correct address to caller

Data-sharing attributes Threaprivate storage

Example

char∗ foo ( ) { s t a t i c char buffer [BUF_SIZE]; Creates one static #pragma omp threadprivate(buffer) copy of buffer per thread ...

return b u f f e r ; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 36 / 143 Creates one static copy of buffer per thread

foo returns correct address to caller

Data-sharing attributes Threaprivate storage

Example

char∗ foo ( ) { s t a t i c char buffer [BUF_SIZE]; Now foo can be called by #pragma omp threadprivate(buffer) multiple threads at the same time ...

return b u f f e r ; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 36 / 143 CreatesNow foo onecanstatic be called by multiplecopy of threadsbuffer per at the same threadtime

Data-sharing attributes Threaprivate storage

Example

char∗ foo ( ) { s t a t i c char buffer [BUF_SIZE]; #pragma omp threadprivate(buffer)

... foo returns correct return b u f f e r ; } address to caller

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 36 / 143 Part II

Worksharing constructs

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 37 / 143 Outline

The worksharing concept

Loop worksharing

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 38 / 143 The worksharing concept Outline

The worksharing concept

Loop worksharing

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 39 / 143 The worksharing concept Worksharings

Worksharing constructs divide the execution of a code region among the threads of a team Threads cooperate to do some work Better way to split work than using thread-ids Lower overhead than using tasks But, less flexible In OpenMP, there are four worksharing constructs: single loop worksharing section workshare

Restriction: worksharings cannot be nested

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 40 / 143 Loop worksharing Outline

The worksharing concept

Loop worksharing

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 41 / 143 Loop worksharing Loop parallelism

The for construct

#pragma omp for [ clauses ] for ( i n i t −expr ; t e s t −expr ; inc−expr )

where clauses can be: private firstprivate lastprivate(variable-list) reduction(operator:variable-list) schedule(schedule-kind) nowait collapse(n) ordered

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 42 / 143 Loop worksharing The for construct

How it works? The iterations of the loop(s) associated to the construct are divided among the threads of the team Loop iterations must be independent Loops must follow a form that allows to compute the number of iterations Valid data types for induction variables are: integer types, pointers and random access iterators (in C++) The induction variable(s) are automatically privatized The default data-sharing attribute is shared

It can be merged with the parallel construct: #pragma omp parallel for

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 43 / 143 The i variable is automatically privatized Must be explicitly privatized

Loop worksharing The for construct

Example

void foo ( i n t ∗m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 44 / 143 The i variable is automatically privatized Must be explicitly privatized

Loop worksharing The for construct

Example

void foo ( i n t ∗m, i n t N, i n t M) { i n t i ; New created threads cooperate to exe- #pragma omp parallel for private ( j ) for ( i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 44 / 143 Must be explicitly privatized

Loop worksharing The for construct

Example

void foo ( i n t ∗m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; iThe

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 44 / 143 The i variable is automatically privatized

Loop worksharing The for construct

Example

void foo ( i n t ∗m, i n t N, i n t M) { i n t i ; #pragma omp parallel for private ( j ) for ( i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 44 / 143 random access iterators (and pointers) are valid != cannot be used in the testtypes expression

Loop worksharing The for construct

Example

void foo ( std::vector &v ) { #pragma omp parallel for for ( std::vector::iterator it = v.begin() ; i t < v . end ( ) ; i t ++ ) ∗ i t = 0; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 45 / 143 != cannot be used in the test expression

Loop worksharing The for construct

Example

void foo ( std::vector &v ) { #pragma omp parallel for random access iterators for ( std::vector::iterator it = v.begin()(and pointers) ; are valid i t < v . end ( ) ; types i t ++ ) ∗ i t = 0; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 45 / 143 random access iterators (and pointers) are valid types

Loop worksharing The for construct

Example

void foo ( std::vector &v ) { #pragma omp parallel for for ( std::vector::iterator it = v.begin() ; i t < v . end ( ) ; != cannot be used in the test expression i t ++ ) ∗ i t = 0; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 45 / 143 Loop worksharing Removing dependences

Example

x = 0; for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 46 / 143 Loop worksharing Removing dependences

Example

x = 0; for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 47 / 143 Loop worksharing The lastprivate clause

When a variable is declared lastprivate, a private copy is generated for each thread. Then the value of the variable in the last iteration of the loop is copied back to the original variable A variable can be both firstprivate and lastprivate

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 48 / 143 Loop worksharing The reduction clause

A very common pattern is where all threads accumulate some values into a single variable E.g., n += v[i], our heat program, ... Using critical or atomic is not good enough Besides being error prone and cumbersome Instead we can use the reduction clause for basic types Valid operators are: +, -, *, |, ||, &, &&,^, min, max User-defined reductions coming soon... 4.0 The compiler creates a private copy that is properly initialized At the end of the region, the compiler ensures that the shared variable is properly (and safely) updated We can also specify reduction variables in the parallel construct

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 49 / 143 Private copy initialized here to the identity value

Shared variable updated here with the partial values of each thread

Loop worksharing The reduction clause

Example

i n t vector_sum ( i n t n , i n t v [ n ] ) { i n t i , sum = 0; #pragma omp parallel for reduction ( + :sum) { for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 50 / 143 Loop worksharing The reduction clause

Example

i n t vector_sum ( i n t n , i n t v [ n ] ) { i n t i , sum = 0; #pragma omp parallel for reduction ( + :sum) { Private copy initialized here to the identity value for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 50 / 143 Loop worksharing The schedule clause

The schedule clause determines which iterations are executed by each thread If no schedule clause is present then is implementation defined There are several possible options as schedule: STATIC STATIC,chunk DYNAMIC[,chunk] GUIDED[,chunk] AUTO RUNTIME

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 51 / 143 Loop worksharing The schedule clause

Static schedule The iteration space is broken in chunks of approximately size N/num − threads. Then these chunks are assigned to the threads in a Round-Robin fashion

Static, N schedule (Interleaved) The iteration space is broken in chunks of size N. Then these chunks are assigned to the threads in a Round-Robin fashion

Characteristics of static schedules Low overhead Good locality (usually) Can have load imbalance problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 52 / 143 Loop worksharing The schedule clause

Dynamic, N schedule Threads dynamically grab chunks of N iterations until all iterations have been executed. If no chunk is specified, N = 1.

Guided, N schedule Variant of dynamic. The size of the chunks deceases as the threads grab iterations, but it is at least of size N. If no chunk is specified, N = 1.

Characteristics of dynamic schedules Higher overhead Not very good locality (usually) Can solve imbalance problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 53 / 143 Loop worksharing The schedule clause

Auto schedule In this case, the implementation is allowed to do whatever it wishes Do not expect much of it as of now

Runtime schedule The decision is delayed until the program is run through the sched-nvar ICV. It can be set with: The OMP_SCHEDULE environment variable The omp_set_schedule() API call

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 54 / 143 Loop worksharing The nowait clause

When a worksharing has a nowait clause then the implicit barrier at the end of the loop is removed This allows to overlap the execution of non-dependent loops/tasks/worksharings

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 55 / 143 Loop worksharing The nowait clause

Example First and second loop are independent, #pragma omp for nowait for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 56 / 143 Loop worksharing The nowait clause

Example

#pragma omp for nowait for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 56 / 143 Loop worksharing The nowait clause

Example First and second loops are dependent! #pragma omp for nowait No guarantees that the previous iteration for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 57 / 143 Loop worksharing The nowait clause

Exception: static schedules If the two (or more) loops have the same static schedule and all have the same number of iterations

Example

#pragma omp for schedule ( static ,M) nowait for ( i=0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 58 / 143 Example

#pragma omp for collapse ( 2 ) i and j loops are folded and itera- for ( i =0; i

Loop worksharing The collapse clause

Allows to distribute work from a set of n nested loops Loops must be perfectly nested The nest must traverse a rectangular iteration space

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 59 / 143 Loop worksharing The collapse clause

Allows to distribute work from a set of n nested loops Loops must be perfectly nested The nest must traverse a rectangular iteration space

Example

#pragma omp for collapse ( 2 ) i and j loops are folded and itera- for ( i =0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 59 / 143 Break

Coffee time! :-)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 60 / 143 Part III

Basic Synchronizations

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 61 / 143 Outline

Thread barriers

Exclusive access

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 62 / 143 Why synchronization?

Mechanisms Threads need to synchronize to impose some ordering in the sequence of actions of the threads. OpenMP provides different synchronization mechanisms: barrier critical atomic taskwait ordered locks

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 63 / 143 Thread barriers Outline

Thread barriers

Exclusive access

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 64 / 143 Thread barriers Thread Barrier

The barrier construct #pragma omp barrier Threads cannot proceed past a barrier point until all threads reach the barrier AND all previously generated work is completed Some constructs have an implicit barrier at the end E.g., the parallel construct

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 65 / 143 Forces all foo occurrences too happen before all bar occurrences Implicit barrier at the end of the parallel region

Thread barriers Barrier

Example

#pragma omp parallel { foo ( ) ; #pragma omp barrier bar ( ) ; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 66 / 143 Implicit barrier at the end of the parallel region

Thread barriers Barrier

Example

#pragma omp parallel { foo ( ) ; Forces all foo occurrences too #pragma omp barrier bar ( ) ; happen before all bar occurrences }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 66 / 143 Forces all foo occurrences too happen before all bar occurrences

Thread barriers Barrier

Example

#pragma omp parallel { foo ( ) ; #pragma omp barrier bar ( ) ; } Implicit barrier at the end of the parallel region

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 66 / 143 Exclusive access Outline

Thread barriers

Exclusive access

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 67 / 143 Exclusive access Exclusive access

The critical construct #pragma omp critical [ ( name ) ] structured block Provides a region of mutual exclusion where only one thread can be working at any given time. By default all critical regions are the same, but you can provide them with names Only those with the same name synchronize

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 68 / 143 Only one thread at a time here

Prints 3!

Exclusive access Critical construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 69 / 143 Prints 3!

Exclusive access Critical construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; Only one thread at a time here } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 69 / 143 Exclusive access Critical construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; Only one thread at a time here } p r i n t f ( "%d\n" , x ) ; Prints 3!

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 69 / 143 Exclusive access Critical construct

Example

i n t x=1 ,y =0; #pragma omp parallel num_threads ( 4 ) { #pragma omp critical ( x ) x ++; #pragma omp critical ( y ) y ++; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 70 / 143 Exclusive access Critical construct

Example

i n t x=1 ,y =0; #pragma omp parallel num_threads ( 4 ) { omp critical #pragma ( x ) Different names: One thread can x ++; #pragma omp critical ( y ) update x while another updates y y ++; }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 70 / 143 Exclusive access Exclusive access

The atomic construct

#pragma omp atomic expression Provides an special mechanism of mutual exclusion to do read & update operations Only supports simple read & update expressions E.g., x += 1, x = x - foo() Only protects the read & update part foo() not protected Usually much more efficient than a critical construct Not compatible with critical

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 71 / 143 Only one thread at a time updates x here

Prints 3!

Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 72 / 143 Prints 3!

Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; Only one thread at a time updates x here } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 72 / 143 Only one thread at a time updates x here

Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp atomic x ++; } p r i n t f ( "%d\n" , x ) ; Prints 3!

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 72 / 143 Prints 3,4 or 5 :(

Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; #pragma omp atomic x ++; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 73 / 143 Prints 3,4 or 5 :(

Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical Different threads can update x at x ++; #pragma omp atomic the same time! x ++; } p r i n t f ( "%d\n" , x ) ;

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 73 / 143 Exclusive access Atomic construct

Example

i n t x =1; #pragma omp parallel num_threads ( 2 ) { #pragma omp critical x ++; #pragma omp atomic x ++; } p r i n t f ( "%d\n" , x ) ; Prints 3,4 or 5 :(

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 73 / 143 Part IV

Practical: OpenMP heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 74 / 143 Outline

Heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 75 / 143 Heat diffusion Outline

Heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 76 / 143 Heat diffusion Before you start

Enter the OpenMP directory to do the following exercises Session3.2-exercise contains the serial version of the Heat application

you can use Tareador on heat-tareador to determine parallelism, and observe diferences among the three algorithms

and then annotate the application with OpenMP

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 77 / 143 Heat diffusion Description of the Heat Diffusion app Hands-on

Parallel loops The file solver.c implements the computation of the Heat diffusion 1 Annotate the jacobi, redblack, and gauss functions with OpenMP 2 Execute the application with different numbers of processors compare the results evaluate the performance

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 78 / 143 Heat diffusion Heat Iteration

Execution of the Jacobi solver

Graphic of temperatures generated

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 79 / 143 Heat diffusion Heat Iteration

Execution Graphic of temperatures

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 80 / 143 Heat diffusion Heat Iteration

Execution of the Red-Black solver

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 81 / 143 Part V

Task Parallelism in OpenMP 3.0

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 82 / 143 Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 83 / 143 OpenMP tasks Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 84 / 143 OpenMP tasks Task parallelism in OpenMP

Task parallelism model

Team Task pool

Parallelism is extracted from “several” pieces of code Supports very unstructured parallelism Unbounded loops, recursive functions, ...

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 85 / 143 OpenMP tasks What is a task in OpenMP ?

Tasks are work units whose execution may be deferred they can also be executed immediately Tasks are composed of: code to execute a data environment Initialized at creation time internal control variables (ICVs) Threads of the team cooperate to execute them

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 86 / 143 OpenMP tasks Creating tasks

The task construct

#pragma omp task [ clauses ] structured block Where clauses can be: shared private firstprivate Values are captured at creation time default if(expression) untied

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 87 / 143 OpenMP tasks When are tasks created?

Parallel regions create tasks One implicit task is created and assigned to each thread So all task-concepts make sense inside the parallel region Each thread that encounters a task construct Packages the code and data Creates a new explicit task

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 88 / 143 OpenMP tasks Default task data-sharing attributes When there are no clauses ...

If no default clause Implicit rules apply e.g., global variables are shared Otherwise... firstprivate shared attribute is lexically inherited

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 89 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = b = c = d = e = }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = c = d = e = }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = firstprivate c = d = e = }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = firstprivate c = shared d = e = }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = firstprivate c = shared d = firstprivate e = }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = firstprivate c = shared d = firstprivate e = private }}}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks Task default data-sharing attributes In practice...

Example

i n t a ; void foo ( ) { i n t b , c ; #pragma omp parallel shared ( b ) #pragma omp parallel private ( b ) { i n t d ; #pragma omp task { i n t e ;

a = shared b = firstprivate c = shared d = firstprivate e = private }}} Tip: default(none) is your friend if you do not see it clearly

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 90 / 143 OpenMP tasks List traversal

Example

void traverse_list ( List l ) { Element e ; for ( e = l −>first; e ; e=e−>next ) #pragma omp task (e); e is firstprivate }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 91 / 143 Task synchronization Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 92 / 143 Task synchronization Task synchronization

There are two main constructs to synchronize tasks: barrier Remember: all previous work (including tasks) must be completed taskwait

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 93 / 143 Task synchronization Waiting for children

The taskwait construct

#pragma omp taskwait Suspends the current task until all children tasks are completed Just direct children, not descendants

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 94 / 143 Task synchronization Taskwait

Example

void traverse_list ( List l ) { Element e ; for ( e = l −>first; e ; e=e−>next ) #pragma omp task process(e);

#pragma omp taskwait All tasks guaranteed to be completed here }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 95 / 143 All tasks guaranteed to be completed here

Task synchronization Taskwait

Example

void traverse_list ( List l ) { Element e ; for ( e = l −>first; e ; e=e−>next ) #pragma omp task Now we need some threads process(e); to execute the tasks #pragma omp taskwait

}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 95 / 143 We need a way to have a single This will generate multiple traversals thread execute traverse_list

Task synchronization List traversal Completing the picture

Example

L i s t l

#pragma omp parallel traverse_list(l );

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 96 / 143 We need a way to have a single thread execute traverse_list

Task synchronization List traversal Completing the picture

Example

L i s t l

#pragma omp parallel traverse_list(l ); This will generate multiple traversals

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 96 / 143 This will generate multiple traversals

Task synchronization List traversal Completing the picture

Example

L i s t l

#pragma omp parallel We need a way to have a single traverse_list(l ); thread execute traverse_list

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 96 / 143 The single construct Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 97 / 143 The single construct Giving work to just one thread

The single construct

#pragma omp single [ clauses ] structured block where clauses can be: private firstprivate nowait We’ll see it later copyprivate Not today Only one thread of the team executes the structured block There is an implicit barrier at the end

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 98 / 143 This program outputs just one “Hello world”

The single construct The single construct

Example

i n t main ( i n t argc , char ∗∗ argv ) { #pragma omp parallel { #pragma omp single { p r i n t f ( "Hello world!\n" ); } } }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 99 / 143 The single construct The single construct

Example

i n t main ( i n t argc , char ∗∗ argv ) { #pragma omp parallel { #pragma omp single { This program outputs just p r i n t f ( "Hello world!\n" ); } one “Hello world” } }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 99 / 143 OneAll threads thread cooperate creates the to tasks execute of the them traversal

The single construct List traversal Completing the picture

Example

L i s t l

#pragma omp parallel #pragma single traverse_list(l );

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 100 / 143 All threads cooperate to execute them

The single construct List traversal Completing the picture

Example

L i s t l

#pragma omp parallel #pragma single traverse_list(l ); One thread creates the tasks of the traversal

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 100 / 143 One thread creates the tasks of the traversal

The single construct List traversal Completing the picture

Example

L i s t l

#pragma omp parallel #pragma single traverse_list(l ); All threads cooperate to execute them

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 100 / 143 Task clauses Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 101 / 143 Task clauses Task scheduling

How it works? Tasks are tied by default Tied tasks are executed always by the same thread Not necessarily the creator Tied tasks have scheduling restrictions Deterministic scheduling points (creation, synchronization, ... ) Tasks can be suspended/resumed at these points Another constraint to avoid problems Tied tasks may run into performance problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 102 / 143 Task clauses The untied clause

A task that has been marked as untied has none of the previous scheduling restrictions: Can potentially switch to any thread Can potentially switch at any moment Bad mix with thread based features thread-id, threadprivate, critical regions... Gives the runtime more flexibility to schedule tasks

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 103 / 143 Task clauses The if clause

If the expression of an if clause evaluates to false The encountering task is suspended The new task is executed immediately with its own data environment different task with respect to synchronization The parent task resumes when the task finishes Allows implementations to optimize task creation For very fine grain tasks you may need to do your own if

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 104 / 143 Common tasking problems Outline

OpenMP tasks

Task synchronization

The single construct

Task clauses

Common tasking problems

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 105 / 143 Common tasking problems Search problem

Example

i n t solutions = 0;

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

{ state[j] = i; i f (ok(j+1,state)) { nqueens(n, j+1,state ); } } }

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 106 / 143 Common tasking problems Search problem

Example

i n t solutions = 0;

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 106 / 143 Common tasking problems Search problem

Example

i n t solutions = 0;

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ; i f ( n == j ) { Data scoping /∗ good solution , count it ∗/ solutions++; return ; Because it’s an orphaned } task all variables are /∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 106 / 143 Common tasking problems Search problem

Example

i n t solutions = 0;

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) Data scoping { i n t i , res ; Because it’s an orphaned i f ( n == j ) { task all variables are /∗ good solution , count it ∗/ solutions++; firstprivate return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 106 / 143 Common tasking problems Search problem

Example

i n t solutions = 0;

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; Problem #1 return ; } Incorrectly capturing

/∗ try each possible solution for queen ∗/ pointed data for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 106 / 143 Common tasking problems Problem #1 Incorrectly capturing pointed data

Problem firstprivate does not allow to capture data through pointers

Solutions 1 Capture it manually 2 Copy it to an array and capture the array with firstprivate

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 107 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 108 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; Caution! return ; } Will state still be valid by the

/∗ try each possible solution for queen ∗/ time memcpy is executed? for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 108 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++; return ; Problem #2 }

/∗ try each possible solution for queen ∗/ No, so data can go out of for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 108 / 143 Common tasking problems Problem #2 Out-of-scope data

Problem Stack-allocated parent data can become invalid before being used by child tasks Only if not captured with firstprivate, as in our nqueens example

Solutions 1 Use firstprivate when possible 2 Allocate it in the heap Not always easy (we also need to free it) 3 Put additional synchronizations May reduce the available parallelism

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 109 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++ ; return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 110 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++ ; Shared variable needs protected access return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 110 / 143 Common tasking problems Search problem

Example

void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ solutions++ ; return ; Solutions } Use critical /∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 110 / 143 Common tasking problems Reductions for tasks

Example

i n t solutions=0; i n t mysolutions=0; #pragma omp threadprivate( mysolutions ) Use a separate counter for each thread

void start_search () { #pragma omp parallel { #pragma omp single { i n t initial_state[n]; nqueens(n,0, initial_state ); } #pragma omp atomic solutions += mysolutions ; Accumulate them at the end }

}

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 111 / 143 Common tasking problems Search problem

Example void nqueens ( i n t n , i n t j , i n t ∗s t a t e ) { i n t i , res ;

i f ( n == j ) { /∗ good solution , count it ∗/ mysolutions++; return ; }

/∗ try each possible solution for queen ∗/ for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 112 / 143 Part VI

Practical: multisort

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 113 / 143 Outline

Multisort

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 114 / 143 Multisort Outline

Multisort

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 115 / 143 Multisort Before you start

Enter the OpenMP directory to do the following exercises Session4.1-exercise contains the serial version of the Multisort application

... along with the versions multisort-leaf and multisort-tree

Annotate the application with OpenMP

Generate and analyze the traces you obtain from the application

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 116 / 143 Break

Bon appétit!*

*Disclaimer: actual food may differ from the image! :-)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 117 / 143 Part VII

Task Parallelism in OpenMP 4.0

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 118 / 143 Outline

Task dependences

Depend clause

Matrix multiply example

Restrictions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 119 / 143 Task dependences Outline

Task dependences

Depend clause

Matrix multiply example

Restrictions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 120 / 143 Task dependences Use of task dependences

Matrix multiplication, taskified version T0 - T1 - T2 - T3 T4 - T5 - T6 - T7 ...

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 121 / 143 Depend clause Outline

Task dependences

Depend clause

Matrix multiply example

Restrictions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 122 / 143 Depend clause Syntax

Depend clause depend keyword with a dependence type and a list of variables and array regions

Where dependence-type can be: in out inout

depend(in: ...) depend(out: ...) depend(inout: ...)

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 123 / 143 Depend clause Dependence types

Semantics in: the task only reads from the data specified The task will depend on all previouly generated sibling tasks that reference at least one of the list items in an out or inout dependence list. out: the task only writes to the data specified inout: the task reads from and writes to the data On both out and inout dependence types, the task will depend on all previously generated sibling tasks that reference at least one of list items in an in, out or inout dependence list.

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 124 / 143 Depend clause Dependence list

Variables A named data storage block

Array sections A designated subset of the elements of an array array_name [ lower_bound : length ] ... array_name [ lower_bound : ] ... array_name [ : length ] ... array_name [ : ] ... According to the dimensions of the array

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 125 / 143 Matrix multiply example Outline

Task dependences

Depend clause

Matrix multiply example

Restrictions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 126 / 143 Matrix multiply example Matrix multiply example

Example

void matmul_block( i n t N, i n t BS, f l o a t ∗A, f l o a t ∗B, f l o a t ∗C);

// Assume BS divides N perfectly void matmul ( i n t N, i n t BS, f l o a t A[N][N], f l o a t B[N][N], f l o a t C[N][N]) { i n t i , j , k ; for (i = 0; i

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 127 / 143 Restrictions Outline

Task dependences

Depend clause

Matrix multiply example

Restrictions

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 128 / 143 Restrictions Restrictions

List items used in depend clauses of the same task or sibling tasks must indicate identical storage or disjoint storage List items used in depend clauses cannot be zero-length array sections A variable that is part of another variable (such as a field of a structure) but is not an array element or an array section cannot appear in a depend clause

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 129 / 143 Part VIII

Programming using a hybrid MPI/OpenMP approach

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 130 / 143 Outline

MPI+OpenMP programming

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 131 / 143 MPI+OpenMP programming Outline

MPI+OpenMP programming

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 132 / 143 MPI+OpenMP programming Distributed- vs Shared- Memory Programming

Distributed-memory programming Separate processes Private variables are unaccessible from others Point-to-point and collective communication Implicit synchronization

Shared-memory programming Multiple threads share same address space Explicit synchronization

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 133 / 143 MPI+OpenMP programming Hybrid programming

Combining MPI+OpenMP Distributed algorithms spread over nodes Shared memory for computation within each node

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 134 / 143 MPI+OpenMP programming Opportunities

When to use MPI+OpenMP Starting from OpenMP and moving to clusters with MPI Starting from MPI and exploiting further parallelism inside each node

Improvements OpenMP can solve MPI imbalance

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 135 / 143 MPI+OpenMP programming Alternatives

MPI + computational kernels in OpenMP Use OpenMP directives to exploit parallelism between communication phases OpenMP parallel will end before new communication calls

MPI inside OpenMP constructs Call MPI from within for-loops, or tasks MPI needs to support multi-threaded mode

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 136 / 143 MPI+OpenMP programming Compiling MPI+OpenMP

MPI compiler driver needs the proper OpenMP option mpicc - mpicc -fopenmp

Also useful mpicc -show It displays the full command line executed by mpicc to compile your program

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 137 / 143 Part IX

Practical: MPI+OpenMP heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 138 / 143 Outline

MPI+OpenMP Heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 139 / 143 MPI+OpenMP Heat diffusion Outline

MPI+OpenMP Heat diffusion

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 140 / 143 MPI+OpenMP Heat diffusion Before you start

Enter the Session4.2-exercise directory to do the following exercises

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 141 / 143 MPI+OpenMP Heat diffusion Description of the Heat Diffusion app Hands-on

Parallel loops The file solver.c implements the computation of the Heat diffusion 1 Use MPI to distribute the work across nodes 2 Annotate the jacobi, redblack, and gauss functions with OpenMP 3 Execute the application with different numbers of nodes/processors, and compare the results

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 142 / 143 MPI+OpenMP Heat diffusion The End

Thanks for your attention!

Xavier Martorell (BSC) PATC Parallel Programming Workshop October 15-16, 2014 143 / 143