Multicore and Multicore programming with OpenMP

[email protected]

1/ 77 Why multicores? the three walls

What is the reason for introduction of multicores? Uniprocessors performance is leveling off due to the “three walls”:

I ILP wall: Instruction Level Parallelism is near its limits

I Memory wall: caches show diminishing returns

I Power wall: power per chip is getting painfully high

2/ 77 The ILP wall

There are two common approaches to exploit ILP:

I Vector instructions (SSE, AltiVec etc.)

I Out-of-order issue with in-order retirement, speculation, register renaming, branch prediction etc. Neither of these can generate much concurrency because:

I irregular memory access patterns

I control dependent computations

I data dependent memory access Multicore processors, on the other side, exploit Level Parallelism (TLP) which can virtually achieve any degree of concurrency

3/ 77 The Memory wall

The gap between processors and memory speed has increased dramatically. Caches are used to improve memory performance provided that data locality can be exploited. To deliver twice the performance with the same bandwidth, the cache miss rate must be cut in half; this means:

I For dense matrix-matrix multiply or dense LU, 4x bigger cache

I For sorting or FFTs, the square of its former size

I For sparse or dense matrix-vector multiply, forget it What is the cost of complicated memory hierarchies?

LATENCY

TLP (that is, multicores) can help overcome this inefficiency by means of multiple streams of execution where memory access latency can be hidden.

4/ 77 The Power wall

ILP techniques are based on the exploitation of higher clock frequencies. Processors performance can be improved by a factor k by increasing frequency by the same factor. Is this a problem? yes, it is.

2 P ' Pdynamic = CV f Pdynamic = dynamic power C = capacitance V = voltage f = frequency but

fmax ∼ V Power consumption and heat dissipation grow as f 3!

5/ 77 The Power wall

6/ 77 The Power wall

Is there any other way to increase performance without consuming too much power? Yes, with multicores: a k-way multicore is k times faster than an unicore and consumes only k times as much power.

Pdynamic ∝ C Thus power consumption and heat dissipation grow linearly with the number of cores (i.e., chip complexity or number of transistors).

7/ 77 The Power wall

It is even possible to reduce power consumption while still increasing performance. Assume a single-core processor with frequency f and capacitance C. A quad-core with frequency 0.6 × f will consume 15% less power while delivering 2.4 higher performance. 8/ 77 The first Multicore: Power4

Power4 (2001) and Power5 Power6 (2007) (2004)

9/ 77 The first Multicore: Power4

Power4 (2001) and Power5 Power6 (2007) (2004)

9/ 77 The first Multicore: Power4

Power4 (2001) and Power5 Power6 (2007) (2004)

9/ 77 AMD

Dual-core Opteron (2005) Phenom (2007)

10/ 77 Intel

Clovertown (2007) Dunnington (2008)

11/ 77 Conventional Multicores What are the problems with all these designs?

I Core-to-core communication. Although cores lie on the same piece of silicon, there is no direct communication channel between them. The only option is to communicate through main memory.

I Shared memory bus. On modern systems, processors are much faster than memory; example: Intel Woodcrest:

I at 3.0 GHz each core can process 3 × 4(SSE) × 2(dualissue) = 24 single-precision floating-point values in a nanosecond.

I at 10.5 GB/s the memory can provide 10.5/4 ' 2.6 single-precision floating-point values in a nanosecond. One core is 9 times as fast as the memory! Attaching more cores to the same bus only makes the problem worse unless heavy data reuse is possible.

12/ 77 The future of multicores TILE64 is a microcontroller manufactured by Tilera. It consists of a mesh network of 64 ”tiles”, where each tile houses a general purpose processor, cache, and a non-blocking router, which the tile uses to communicate with the other tiles on the processor.

I 4.5 TB/s on-chip mesh interconnect I 25 GB/s towards main memory I no floating-point 13/ 77 Intel Polaris Intel Polaris 80 cores prototype:

I 80 tiles arranged in a 8 × 10 grid

I on-chip mesh interconnect with 1.62 Tb/s bisection bandwidth

I 3-D stacked memory (future)

I consumes only 62 Watts and is 275 square millimeters

I each tile has:

I a router I 3 KB instruction memory

I 2 KB data memory I 2 SP FMAC units I 32 SP registers That makes 4(FLOPS) × 80(tiles) × 3.16GHz ' 1TFlop/s. The first TFlop machine was the ASCII Red made up of 10000 Pentium Pro, taking 250 mq and 500 KW...

14/ 77 The IBM The Cell Broadband Engine was released in 2005 by the STI ( IBM) consortium. It is an 9-way multicore processor.

I 1 control core + 8 working cores I computational power is achieved through exploitation of two levels of parallelism:

I vector units

I multiple cores

I on-chip interconnect bus for core-to-core communications

I caches are replaced by explicitly managed local memories

I performance comes at a price: the Cell is very hard to program

15/ 77 The Cell: architecture

I one (PPE): this is almost like a PowerPC processor (it does not have some ILP features) and it is almost exclusively meant for control work.

I 8 Synergistic Processing Elements (SPEs) (only 6 in the PS3)

I one Element Interconnect Bus (EIB): on-chip ring bus connecting all the SPEs and the PPE

I one Memory Interface Controller (MIC) that connects the EIB to the main memory

I PPE and SPEs have different ISAs and thus we have to write different code and use different compilers

16/ 77 The Cell: the SPE

I each SPE has an SPU that is, essentially, a SIMD unit I 2 in-order, RISC type (i.e., short) pipelines I one for local/store, shuffle, shifts... I one for fixed and floating point operations. The floating point is a vector unit that can do 4 SP (2 DP) FMAC per clock cycle. At 3.2 GHz it amounts to 25.6 Gflop/s. DP computations are 2 × 7 = 14 times slower

17/ 77 The Cell: the SPE

I each SPE has a 256 KB . Data and code have to be explicitly moved into the local memory through the MFC

I communication is managed by a Memory Flow Controller (MFC) which is essentially a DMA engine. The MFC can move data at a speed of 25.6 GB/s

I large 128-entry 128-bit register file

18/ 77 The Cell: the EIB

The Element Interconnect Bus:

I is a bus made of four unidirectional rings; two moving data in one sense and two in the opposite

I arbitration is token based

I aggregate bandwidth is 102.4 GB/s

I each link to SPEs, PPE or MIC (main memory) is 25.6 GB/s

I half the system clock (i.e., 1.6 GHz) 19/ 77 Hello world! example

PPE code

#include #include #include

extern spe_program_handle_t ; SPE code extern hello_spu; #include int main(void){ int main(unsigned long long speid, speid_t speid[8]; unsigned long long argp, int status[8]; unsigned long long envp){ int i; for (i=0;i<8;i++) printf("Hello world (0x%llx)\n", speid); speid[i] = spe create thread(0, &hello_spu, NULL, NULL, -1, 0); return 0; for (i=0;i<8;i++){ } spe wait(speid[i], &status[i], 0);

printf("status = %d\n", WEXITSTATUS(status[i])); } return 0; }

20/ 77 Programming the SPE

Simple, matrix-vector multiply/add c=c-A*b.

void spu code tile(float *A, float *b, float *c) { int m, n; for (n = 0; n < 32; n++) for (m = 0; m < 32; m++) c[m] -= A[n*32+m] * b[n]; }

I good news: standard C code will compile and run correctly

I bad news: no performance, i.e., 0.24 Gflop/s (¡ 1% peak)

21/ 77 Programming the SPE

Taking advantage of vector instructions:

I alias scalar arrays into vector arrays

I use SPU intrinsics void spu code tile(float *A, float *b, float *c) { vector float *Ap = (vector float*)A; vector float *cp = (vector float*)c; int m, n; for (n = 0; n < 32; n++) { vector floatb splat = spu splats(b[n]); for (m = 0; m < 8; m++) { cp[m] = spu nmsub(Ap[n*8+m], b splat, cp[m]); } } }

4.5 × faster: 1.08 Gflop/s, i.e., ∼4% of the peak

22/ 77 Programming the SPE

23/ 77 Programming the SPE

void spu code tile(float *A, float *b, float *c) { vector float *Ap = (vector float*)A; vector float *cp = (vector float*)c; vector float c0, c1, c2, c3, c4, c5, c6, c7, c8; int n; c0 = cp[ 0]; c4 = cp[ 4]; c1 = cp[ 1]; c5 = cp[ 5]; Vectorization + unrolling c2 = cp[ 2]; c6 = cp[ 6]; c3 = cp[ 3]; c7 = cp[ 7]; I unroll loops

for (n = 0; n < 32; n++) I reuse registers { vector floatb splat = spu splats(b[n]); between iterations c0 = spu nmsub(Ap[n*8+ 0], b splat, c0); c1 = spu nmsub(Ap[n*8+ 1], b splat, c1); I eliminate innermost c2 = spu nmsub(Ap[n*8+ 2], b splat, c2); c3 = spu nmsub(Ap[n*8+ 3], b splat, c3); loop completely c4 = spu nmsub(Ap[n*8+ 4], b splat, c4); c5 = spu nmsub(Ap[n*8+ 5], b splat, c5); 32 × faster: 7.64 Gflop/s, c6 = spu nmsub(Ap[n*8+ 6], b splat, c6); c7 = spu nmsub(Ap[n*8+ 7], b splat, c7); i.e., ∼30% peak. } cp[ 0] = c0; cp[ 4] = c4; cp[ 1] = c1; cp[ 5] = c5; cp[ 2] = c2; cp[ 6] = c6; cp[ 3] = c3; cp[ 7] = c7; }

24/ 77 Programming the SPE

Because the SPU has two independent pipelines we can overlap computations with loads/stores. This will drastically improve performance

I compute even iteration

I load/store odd iteration

25/ 77 Programming the SPE

c0 = cp[0]; a0 = Ap[0]; c1 = cp[1]; a1 = Ap[1]; c2 = cp[2]; a2 = Ap[2]; c3 = cp[3]; a3 = Ap[3]; c4 = cp[4]; a4 = Ap[4]; c5 = cp[5]; a5 = Ap[5]; c6 = cp[6]; a6 = Ap[6]; c7 = cp[7]; a7 = Ap[7]; b splat = spu splats(b[0]); for (n = 0; n < 32; n+=2) { d splat = spu splats(b[n+1]); c0 = spu nmsub(a0, b splat, c0); e0 = Ap[n*8+8]; c1 = spu nmsub(a1, b splat, c1); e1 = Ap[n*8+9]; c2 = spu nmsub(a2, b splat, c2); e2 = Ap[n*8+10]; c3 = spu nmsub(a3, b splat, c3); e3 = Ap[n*8+11]; c4 = spu nmsub(a4, b splat, c4); e4 = Ap[n*8+12]; c5 = spu nmsub(a5, b splat, c5); e5 = Ap[n*8+13]; 51 × faster: 12.34 c6 = spu nmsub(a6, b splat, c6); e6 = Ap[n*8+14]; c7 = spu nmsub(a7, b splat, c7); e7 = Ap[n*8+15]; Gflop/s, i.e., ∼48%

b splat = spu splats(b[n+2]); peak c0 = spu nmsub(e0, d splat, c0); a0 = Ap[n*8+16]; c1 = spu nmsub(e1, d splat, c1); a1 = Ap[n*8+17]; c2 = spu nmsub(e2, d splat, c2); a2 = Ap[n*8+18]; c3 = spu nmsub(e3, d splat, c3); a3 = Ap[n*8+19]; c4 = spu nmsub(e4, d splat, c4); a4 = Ap[n*8+20]; c5 = spu nmsub(e5, d splat, c5); a5 = Ap[n*8+21]; c6 = spu nmsub(e6, d splat, c6); a6 = Ap[n*8+22]; c7 = spu nmsub(e7, d splat, c7); a7 = Ap[n*8+23]; } cp[0] = c0; cp[1] = c1; cp[2] = c2; cp[3] = c3; cp[4] = c4; cp[5] = c5; cp[6] = c6; cp[7] = c7;

26/ 77 Double Buffering Double buffering is a technique that allows to overlap communications and computations. It is vital for the Cell and very important for ALL architectures. Example: assume a sequence of n 64 × 64 tiles. For each i we want to compute Ci = Ci + Ai × Bi in SP.

I The product can be performed on each SPE at peak speed (25.6 Gflop/s) I The data can be transferred from main memory to local storage at full speed. Because the bus to main memory is shared, we will assume that each SPE has 1/8 of the full ∗ band, i.e., 25.6/8 = 3.2 GB/s 27/ 77 2 × n3 2 × 643/25.6 = 20480

4 × 642 4 × 4 × 642 4 × 4 × 642/3.2 = 20480

Double Buffering

A few numbers:

I how much does it take to perform a product? The op-count for a singe matrix-matrix product is . At 25.6 Gflop/s it takes nanoseconds.

I How much does it take to transfer the data? The amount of data to be transferred is . That makes Bytes; at 3.2 GB/s it takes nanoseconds

28/ 77 4 × 642 4 × 4 × 642 4 × 4 × 642/3.2 = 20480

Double Buffering

A few numbers:

I how much does it take to perform a product? The op-count for a singe matrix-matrix product is 2 × n3. At 25.6 Gflop/s it takes 2 × 643/25.6 = 20480 nanoseconds.

I How much does it take to transfer the data? The amount of data to be transferred is . That makes Bytes; at 3.2 GB/s it takes nanoseconds

28/ 77 4 × 4 × 642 4 × 4 × 642/3.2 = 20480

Double Buffering

A few numbers:

I how much does it take to perform a product? The op-count for a singe matrix-matrix product is 2 × n3. At 25.6 Gflop/s it takes 2 × 643/25.6 = 20480 nanoseconds.

I How much does it take to transfer the data? The amount of data to be transferred is 4 × 642. That makes Bytes; at 3.2 GB/s it takes nanoseconds

28/ 77 Double Buffering

A few numbers:

I how much does it take to perform a product? The op-count for a singe matrix-matrix product is 2 × n3. At 25.6 Gflop/s it takes 2 × 643/25.6 = 20480 nanoseconds.

I How much does it take to transfer the data? The amount of data to be transferred is 4 × 642. That makes 4 × 4 × 642 Bytes; at 3.2 GB/s it takes 4 × 4 × 642/3.2 = 20480 nanoseconds

28/ 77 Double Buffering

for(i=0; i

mfc write tag mask(buf); With this naive implementation mfc read tag status all(); we can only get half of the peak: compute(buf); 2 × 643 mfc put(buf); = 12.8 Gflop/s 20480 + 20480 mfc write tag mask(buf); mfc read tag status all(); }

29/ 77 Double Buffering

Since communications are done asynchronously by a DMA engine, we can overlap them with computations. In order to achieve this, we need 2 buffers.

The idea behind double buffering is that we fetch the data for iteration i + 1 while we perform computations for iteration i.

30/ 77 Double Buffering

mfc get(buf0); mfc get(buf1); mfc write tag mask(buf0); mfc read tag status all(); compute(buf0); for (i=1; i

31/ 77 Other computing devices: GPUs

AMD/ATI x1900 vs AMD dual-core Opteron:

NVIDIA GeForce 8800 GTX:

I over 330 Gflop/s

I 80 GB/s streaming memory

32/ 77 Other computing devices: GPUs NVIDIA GeForce 8800 GTX:

16 streaming multiprocessors of 8 thread processors each. 33/ 77 Other computing devices: GPUs

How to program GPUs?

I SPMD programming model

I coherent branches (i.e. SIMD style) preferred I penalty for non-coherent branches (i.e., when different processes take different paths)

I directly with OpenGL/DirectX: not suited for general purpose computing

I with higher level GPGPU APIs:

I AMD/ATI HAL-CAL (Hardware Abstraction Level - Compute Abstraction Level)

I NVIDIA CUDA: C-like syntax with pointers etc. I RapidMind

I PeakStream

34/ 77 Other computing devices: GPUs

LU on 8-cores Xeon + GeForce GTX 280:

35/ 77 Other computing devices: FPGAs

Field Programmable Gate Arrays (FPGAs) are boards where logical gates can be reconfigured on-the-fly in order to accommodate a specific operation.

I Performance: optimal silicon use and configuration specific to the operation. Also, on chip memory guarantees fast access to data and low latency due to cache avoidance

I Power: 10% of the power required by a processor

I Flexibility: board can be reconfigured on-the-fly for a specific operation

I Programmability: hardware vendors provide programming environments that help with the development (although fine tuning requires deep knowledge)

36/ 77 Other computing devices: FPGAs

An example: LU on a Xilin Virtex4 vs Opteron 2.2 GHz

37/ 77 Mixed-precision arithmetic

On modern systems, single-precision arithmetic has a clear performance advantage over double-precision arithmetic for the following reasons:

I Vector instructions: vector units can normally do twice as many SP operations as DP ones every clock cycle. For example SSE units can do either 4 SP or 2 DP.

I Bus bandwidth: because SP values are twice as small as DP ones, the memory transfer rate for SP values is twice as big as for DP

I Data locality: again, because SP data are twice as small DP, you can put twice as many SP values in the cache as DP

38/ 77 Mixed-precision arithmetic Performance comparison between single and double precision arithmetic for matrix-matrix and matrix-vector product operations on square matrices.

Size SGEMM/ Size SGEMV/ DGEMM DGEMV AMD Opteron 246 3000 2.00 5000 1.70 Sun UltraSparc-IIe 3000 1.64 5000 1.66 Intel PIII Copp. 3000 2.03 5000 2.09 PowerPC 970 3000 2.04 5000 1.44 Intel Woodcrest 3000 1.81 5000 2.18 Intel XEON 3000 2.04 5000 1.82 Intel Centrino Duo 3000 2.71 5000 2.21

39/ 77 Mixed-precision arithmetic

Is there a way to do computations at the speed of single-precision while achieving the accuracy of double-precision? Sort of. For some operations it is possible to do the bulk of computations in single-precision and then recover the accuracy to double-precision by means of an iterative method like the Newton’s method.

40/ 77 Mixed-precision arithmetic

The Newton’s method says that given an approximate root of the function f (x), we can refine it through iterations of the type:

f (xk ) xk+1 = xk − 0 f (xk )

41/ 77 Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root of the function f (x) = b − Ax (equivalent to solving the linear system Ax = b).

−1 xk+1 = xk − A rk where rk = b − Axk This leads to the well known iterative refinement method:

−1 x0← A b repeat rk ← b − Axk−1 −1 zk ← A rk xk ← xx−1 − zk until convergence

42/ 77 Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root of the function f (x) = b − Ax (equivalent to solving the linear system Ax = b).

−1 xk+1 = xk − A rk where rk = b − Axk This leads to the well known iterative refinement method:

−1 3 x0← A b O(n ) repeat 2 rk ← b − Axk−1 O(n ) −1 2 zk ← A rk O(n ) xk ← xx−1 − zk O(n) until convergence

42/ 77 Mixed-precision arithmetic

The Newton’s method can be applied, for example, to refine the root of the function f (x) = b − Ax (equivalent to solving the linear system Ax = b).

−1 xk+1 = xk − A rk where rk = b − Axk This leads to the well known iterative refinement method:

−1 x0← A b εs We can perform the repeat expensive factorization in rk ← b − Axk−1 εd single precision and then −1 zk ← A rk εs do the refinement in xk ← xx−1 − zk εd double. until convergence

42/ 77 Mixed Precision Iterative Refinement: results

LU Solve −− AMD Opteron246 2.0 GHz

6

5

4

Gflop/s 3

2

1 Full−single Mixed−prec Full−double 0 0 1000 2000 3000 4000 5000 problem size

43/ 77 Mixed Precision Iterative Refinement: results

Cholesky Solve −− Cell Broadband Engine 3.2 GHz 180

160

140

120

100 Full−single Mixed−precision Gflop/s 80 Peak−double 60

40

20

0 500 1000 1500 2000 2500 3000 3500 4000 problem size

44/ 77 Mixed Precision Iterative Refinement: results

Intel Woodcrest 3.0 GHz

Single/double 2.5 Mixed prec./double

2 2 3 2 3 3 2 1.5 speedup 1

0.5

0 1 2 3 4 5 6 matrix number

45/ 77 How to program multicores: OpenMP

OpenMP (Open specifications forMultiProcessing) is an Application Program Interface (API) to explicitly direct multi-threaded, shared memory parallelism. I Comprised of three primary API components: I Compiler directives (OpenMP is a compiler technology)

I Runtime library routines I Environment variables I Portable: I Specifications for C/C++ and Fortran I Already available on many systems (including Linux, Win, IBM, SGI etc.) I Full specs http://openmp.org I Tutorial

https://computing.llnl.gov/tutorials/openMP/ 46/ 77 How to program multicores: OpenMP

OpenMP is based on a fork-join execution model:

I Execution is started by a single thread called master thread

I when a parallel region is encountered, the master thread spawns a set of threads

I the set of instructions enclosed in a parallel region is executed

I at the end of the parallel region all the threads synchronize and terminate leaving only the master

47/ 77 How to program multicores: OpenMP

Parallel regions and other OpenMP constructs are defined by means of compiler directives: C/C++ #include Fortran main() { program hello

int var1, var2, var3; integer :: var1, var2, var3

/* Serial code */ ! Serial code

#pragma omp parallel private(var1, var2) \ !$omp parallel private(var1, var2) shared(var3) !$omp& shared(var3) { ! Parallel section executed by all threads /* Parallel section executed by all threads */ !$omp end parallel

} ! Resume serial code

/* Resume serial code */ end program hello

}

48/ 77 OpenMP: the PARALLEL construct The PARALLEL one is the main OpenMP construct and identifies a block of code that will be executed by multiple threads:

!$OMP PARALLEL [clause ...] IF (scalar_logical_expression) PRIVATE (list) SHARED (list) DEFAULT (PRIVATE | SHARED | NONE) FIRSTPRIVATE (list) REDUCTION (operator: list) COPYIN (list) NUM_THREADS (scalar-integer-expression)

block

!$OMP END PARALLEL

I The master is a member of the team and has thread number 0

I Starting from the beginning of the region, the code is duplicated and all threads will execute that code.

I There is an implied barrier at the end of a parallel section.

I If any thread terminates within a parallel region, all threads in the team will terminate. 49/ 77 OpenMP: the PARALLEL construct

How many threads do we have? The number of threads depends on:

I Evaluation of the IF clause

I Setting of the NUM THREADS clause

I Use of the omp set num threads() library function

I Setting of the OMP NUM THREADS environment variable

I Implementation default - usually the number of CPUs on a node, though it could be dynamic

50/ 77 Hello world example:

program hello

integer :: nthreads, tid, & & omp_get_num_threads, omp_get_thread_num

! Fork a team of threads giving them ! their own copies of variables !$omp parallel private(tid)

! Obtain and print thread id tid = omp_get_thread_num() write(*,’("Hello from thread ",i2)’)tid

! Only master thread does this if (tid .eq. 0) then nthreads = omp_get_num_threads() write(*,’("# threads: ",i2)’)nthreads end if

! All threads join master thread and disband !$omp end parallel

end program hello

I the PRIVATE clause says that each thread will have its own copy of the tid variable (more later)

I the omp get num threads and omp get thread num are runtime library routines

51/ 77 OpenMP: Data scoping

I Most variables are shared by default I Global variables include: I Fortran: COMMON blocks, SAVE and MODULE variables I C: File scope variables, static I Private variables include: I Loop index variables

I Stack variables in subroutines called from parallel regions I Fortran: Automatic variables within a statement block I The OpenMP Data Scope Attribute Clauses are used to explicitly define how variables should be scoped. They include:

I PRIVATE I FIRSTPRIVATE I LASTPRIVATE I SHARED I DEFAULT I REDUCTION I COPYIN 52/ 77 OpenMP: Data scoping

I PRIVATE(list): a new object of the same type is created for each thread (uninitialized!)

I FIRSTPRIVATE(list): Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct.

I LASTPRIVATE(list): The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct.

I SHARED(list): only one object exists in memory and all the threads access it

I DEFAULT(SHARED|PRIVATE|NONE): sets the default scoping

I REDUCTION(operator:list): performs a reduction on the variables that appear in its list.

53/ 77 OpenMP: worksharing constructs

I A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it

I Work-sharing constructs do not launch new threads There are three main workshare constructs:

I DO/for construct: it is used to parallelize loops

I SECTIONS: used to identify portions of code that can be executed in parallel

I SINGLE: specifies that the enclosed code is to be executed by only one thread in the team.

54/ 77 OpenMP: worksharing constructs

The DO/for directive:

!$OMP DO [clause ...] SCHEDULE (type [,chunk]) ORDERED PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) SHARED (list) REDUCTION (operator | intrinsic : list)

do_loop

!$OMP END DO [ NOWAIT ] This directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team

55/ 77 OpenMP: worksharing constructs The SCHEDULE clause in the DO/for construct specifies how the cycles of the loop are assigned to threads:

I STATIC: loop iterations are divided into pieces of size chunk and then statically assigned to threads in a round-robin fashion

I DYNAMIC: loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads; when a thread finishes one chunk, it is dynamically assigned another

I GUIDED: for a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than 1), the size of each chunk is determined in the same way with the restriction that the chunks do not contain fewer than k iterations

I RUNTIME: The scheduling decision is deferred until runtime by the environment variable OMP SCHEDULE 56/ 77 OpenMP: worksharing constructs

Example showing scheduling policies for a loop of size 200

57/ 77 OpenMP: worksharing constructs

program do_example

integer :: i, chunk integer, parameter :: n=1000, & & chunksize=100 real(kind(1.d0)) :: a(n), b(n), c(n)

! Some sequential code... chunk = chunksize

!$omp parallel shared(a,b,c,chunk) private(i)

!$omp do schedule(dynamic,chunk) do i = 1, n c(i) = a(i) + b(i) end do !$omp end do

!$omp end parallel

end program do_example

58/ 77 OpenMP: worksharing constructs

The SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team.

!$OMP SECTIONS [clause ...] PRIVATE (list) FIRSTPRIVATE (list) LASTPRIVATE (list) REDUCTION (operator | intrinsic : list)

!$OMP SECTION

block

!$OMP SECTION

block

!$OMP END SECTIONS [ NOWAIT ]

59/ 77 OpenMP: worksharing constructs

Example of the SECTIONS worksharing construct

program vec_add_sections

integer :: i integer, parameter :: n=1000 real(kind(1.d0)) :: a(n), b(n), c(n), d(n)

! some sequential code

!$omp parallel shared(a,b,c,d), private(i)

!$omp sections

!$omp section do i = 1, n c(i) = a(i) + b(i) end do

!$omp section do i = 1, n d(i) = a(i) * b(i) end do

!$omp end sections !$omp end parallel

end program vec_add_sections

60/ 77 OpenMP: worksharing constructs

The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team.

!$OMP SINGLE [clause ...] PRIVATE (list) FIRSTPRIVATE (list)

block

!$OMP END SINGLE [ NOWAIT ]

61/ 77 OpenMP: synchronization constructs

The CRITICAL directive specifies a region of code that must be executed by only one thread at a time

!$OMP CRITICAL [ name ]

block

!$OMP END CRITICAL

The MASTER directive specifies a region that is to be executed only by the master thread of the team

!$OMP MASTER [ name ]

block

!$OMP END MASTER

The BARRIER directive synchronizes all threads in the team

!$OMP BARRIER

62/ 77 OpenMP: synchronization constructs

The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-CRITICAL section. !$OMP ATOMIC

statement expression

What is the difference with CRITICAL? !$omp atomic x = some_function() With ATOMIC the function some function will be evaluated in parallel since only the update is atomical.

63/ 77 OpenMP: worksharing constructs

How to do reductions with OpenMP?

sum = 0 do i=1,n sum = sum+a(i) end do

Here is a wrong way of doing it:

sum = 0 !$omp parallel do shared(sum) do i=1,n sum = sum+a(i) end do

What is wrong?

64/ 77 OpenMP: worksharing constructs

How to do reductions with OpenMP?

sum = 0 do i=1,n sum = sum+a(i) end do

Here is a wrong way of doing it:

sum = 0 !$omp parallel do shared(sum) do i=1,n sum = sum+a(i) end do

What is wrong? Concurrent access has to be synchronized otherwise we will end up in a WAW conflict!

65/ 77 OpenMP: worksharing constructs

We could use the ATOMIC construct: sum = 0 !$omp parallel do shared(sum) do i=1,n !$omp critical sum = sum+a(i) !$omp end critical end do

but there’s a more intelligent way

sum = 0 !$omp parallel do reduction(+:sum) do i=1,n sum = sum+a(i) end do

The reduction clause specifies an operator and one or more list items. For each list item, a private copy is created in each implicit task, and is initialized appropriately for the operator. After the end of the region, the original list item is updated with the values of the private copies using the specified operator.

66/ 77 OpenMP: the task construct

The TASK construct defines an explicit task

!$OMP TASK [clause ...] IF (scalar-logical-expression) UNTIED DEFAULT (PRIVATE | SHARED | NONE) PRIVATE (list) FIRSTPRIVATE (list) SHARED (list) block

!$OMP END TASK

When a thread encounters a TASK construct, a task is generated (not executed!!!) from the code for the associated structured block. The encountering thread may immediately execute the task, or defer its execution. In the latter case, any thread in the team may be assigned the task.

67/ 77 OpenMP: the task construct

But, then, when are tasks executed? Execution of a task may be assigned to a thread whenever it reaches a task scheduling point:

I the point immediately following the generation of an explicit task

I after the last instruction of a task region

I in taskwait regions

I in implicit and explicit barrier regions At a task scheduling point a thread can:

I begin execution of a tied or untied task

I resume a suspended task region that is tied to it

I resume execution of a suspended, untied task

68/ 77 OpenMP: the task construct

All the clauses in the TASK construct have the same meaning as for the other constructs except for:

I IF: when the IF clause expression evaluates to false, the encountering thread must suspend the current task region and begin execution of the generated task immediately, and the suspended task region may not be resumed until the generated task is completed

I UNTIED: by default a task is tied. This means that, if the task is suspended, then its execution may only be resumed by the thread that started it. If, instead, the UNTIED clause is present, any thread can resume its execution

69/ 77 OpenMP: the task construct Example of the TASK construct:

program example_task

integer :: i, n n = 10

!$omp parallel !$omp master do i=1, n !$omp task call tsub(i) result !$omp end task iam: 3 nt: 4 i: 3 end do iam: 2 nt: 4 i: 2 !$omp end master iam: 0 nt: 4 i: 4 !$omp end parallel iam: 1 nt: 4 i: 1 iam: 3 nt: 4 i: 5 stop iam: 0 nt: 4 i: 7 end program example_task iam: 2 nt: 4 i: 6 iam: 1 nt: 4 i: 8 subroutine tsub(i) iam: 3 nt: 4 i: 9 integer :: i iam: 0 nt: 4 i: 10 integer :: iam, nt, omp_get_num_threads, & &omp_get_thread_num

iam = omp_get_thread_num() nt = omp_get_num_threads()

write(*,’("iam:",i2," nt:",i2," i:",i4)’)iam,nt,i

return end subroutine tsub 70/ 77 OpenMP: the task construct

recursive subroutine traverse ( p ) type node type(node), pointer :: left, right end type node type(node) :: p if (associated(p%left)) then !$omp task ! p is firstprivate by default call traverse(p%left) !$omp end task end if if (associated(p%right)) then !$omp task ! p is firstprivate by default call traverse(p%right) !$omp end task end if call process ( p ) end subroutine traverse

Although the sequential code will traverse the tree in postorder, this is not true for the parallel execution since no synchronizations are performed.

71/ 77 OpenMP: the task construct

recursive subroutine traverse ( p ) type node type(node), pointer :: left, right end type node type(node) :: p if (associated(p%left)) then !$omp task ! p is firstprivate by default call traverse(p%left) !$omp end task end if if (associated(p%right)) then !$omp task ! p is firstprivate by default call traverse(p%right) !$omp end task end if !$omp taskwait call process ( p ) end subroutine traverse

The TASKWAIT construct is a synchronization which forces a thread to wait until the execution of the children tasks is terminated.

72/ 77 Homework Write a parallel version of the following subroutine using OpenMP tasks: function foo() integer :: foo integer :: a, b, c, x, y;

a = f_a() b = f_b() c = f_c() x = f1(b, c) y = f2(a, x)

return y;

end function foo

73/ 77 Homework Write a parallel version of the following subroutine using OpenMP tasks: function foo() integer :: foo integer :: a, b, c, x, y;

a = f_a() b = f_b() c = f_c() x = f1(b, c) y = f2(a, x)

return y;

end function foo

74/ 77 Homework Write a parallel version of the following subroutine using OpenMP tasks: !$omp parallel !$omp single

!$omp task a = f_a() !$omp end task

!$omp task if(.false.) b = f_b() c = f_c() !$omp end task

!$omp task x = f1(b, c) !$omp end task

!$omp taskwait !$omp task y = f2(a, x) !$omp end task

!$omp end single !$omp end parallel

75/ 77 Hybrid parallelism

How to exploit parallelism in a cluster of SMPs/Multicores? There are two options:

I Use MPI all over: MPI works on distributed memory systems as well as on shared memory

I Use an MPI/OpenMP hybrid approach: define one MPI task for each node and one OpenMP thread for each core in the node.

76/ 77 Hybrid parallelism

program hybrid

use mpi

integer :: mpi_id, ierr, mpi_nt integer :: omp_id, omp_nt, & & omp_get_num_threads, & & omp_get_thread_num

call mpi_init(ierr) result Thread 0(2) within MPI task 0(2) call mpi_comm_rank(mpi_comm_world, mpi_id, ierr) Thread 0(2) within MPI task 1(2) call mpi_comm_size(mpi_comm_world, mpi_nt, ierr) Thread 1(2) within MPI task 1(2) Thread 1(2) within MPI task 0(2) !$omp parallel omp_id = omp_get_thread_num() omp_nt = omp_get_num_threads()

write(*,’("Thread ",i1,"(",i1,") & & within MPI task ",i1,"(",i1,")")’) & & omp_id,omp_nt,mpi_id,mpi_nt

!$omp end parallel

end program hybrid

77/ 77