Multi-core CPU Computing Straightforward with OpenMP

Roel Jordans Parallelization...

 Need more and using ILP and/or vectorization wasn't enough?  Go multi-core! Multi-core  Started with hyper-threading  Moved on to Multi-core  Utilize these cores with threads? Multi-core

 Divide the program over multiple cores  Requires support!  Pthreads, fork, …  Abstracted away using OpenMP Fork-join programming model

initial (master thread)

fork

Hardware resource team of threads (worker threads) CPU CPU CPU CPU collaborating join

Memory Each thread original (master) runs on a thread CPU Fork-join example  Speeding up parts of the application with parallelism  We use OpenMP to implement these operations What is OpenMP?

 API for shared-memory parallel programming  In the form of directives #pragma omp parallel  functions omp_get_num_threads()  Environment variables OMP_NUM_THREADS = 4

 No additional parallelization effort for development, maintenance, etc.

 Supported by mainstream /C++  A simple example  saxpy operation ya  x  y

const int n = 10000; float x[n], y[n], a; int i;

for (i=0; i

const int n = 10000; float x[n], y[n], a; OpenMP int i; #pragma omp parallel for for (i=0; i

const int n = 10000; Creates a float x[n], y[n], a; team of threads int i; Explicitly specify the number of #pragma omp parallel num_threads(3) the number of #pragma omp parallel num_threads(3) threads { #pragma omp for for (i=0; i

Compile-time Run-time

C/C++ C / C++ C/C++C/C++ back-end withwith front-end withOpenMP () a.out OpenMPOpenMP ()

 Three essential parts  Front-end  Back-end OpenMP RTL  Library  Two approaches  Early / late outlining Early / Late Outlining

 Parallel regions are put into separate routines  To be executed in separate threads  This can be done either in front-end

or back-end omp_parallel_for(0, N, N/omp_get_num_threads(), forb) #pragma omp parallel for … for (i=0; i

Late outlining viewed as too intrusive by LLVM architects

Early outlining (in clang front-end) has been implemented Tiny example

void vzero(float *a, int n) { #pragma omp parallel for for (int i=0; i < n; i++) a[i] = 0; }

$ clang -cc1 -ast-dump -fopenmp vzero.c

`-FunctionDecl 0x969ef90 line:1:6 vzero 'void (float *, int)' |-ParmVarDecl 0x9651350 col:19 used a 'float *' |-ParmVarDecl 0x96513c0 col:26 used n 'int' `-CompoundStmt 0x96a1018 `-OMPParallelForDirective 0x96a0f50 |-CapturedStmt 0x969f790 | |-DeclRefExpr 0x969f4d0 'int' lvalue ParmVar 0x96513c0 'n' 'int' | `-DeclRefExpr 0x969f650 'float *' lvalue ParmVar 0x9651350 'a' 'float *' |-DeclRefExpr 0x96a0b70 'int' lvalue Var 0x96a0b10 '.omp.iv' 'int' |-DeclRefExpr 0x969fae0 <> 'int' lvalue Var 0x969fa80 '.omp.last.iteration' 'int' Beware of  Loop index i is private (OpenMP default)  Each thread maintains it’s own i value and range  Private variable i becomes undefined after: parallel for  Everything else is shared (OpenMP default)  All threads update y, but at different memory locations  a, n, x, are read-only (it is OK to share) const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for OpenMP default can be changed for (i=0; i

 But some compilers don’t detect the error:

$gcc-4.8 –fopenmp loop-index.c –C $ Nested loops

#pragma omp parallel for for (j=0; j

 We want i and j to be private: #pragma omp parallel for private(i)  Or just make everything private by changing the default: #pragma omp parallel for default(none) A more complex example 1 4  Compute π dx 4 atan(1)  atan(0)   0 1 x2 N 1 4  2    i0 1 xi

Processing time 953.144 ms Converting to SPMD

Total workload: number of steps 100000000

thread 1 thread 2 thread3 thread4 # define numthreads = 4 Processing time 4 threads 4412 ms? Problem!

Single thread Processing time 953 ms

False Sharing

Each thread has its own partial_sum[id] Defined as an array, the partial sums are in consecutive memory locations, these can share a cache line Remove false sharing

Processing time 4 threads 253 ms

Single thread Processing time 953 ms

Compiler directive, indicate that it’s a critical region. An easier way

Reduction directive

Processing time 4 threads 246 ms

Single thread Processing time 953 ms There is more

 Compiling with OpenMP is very simple  Clang and GCC add compiler flag –fopenmp  Optional add #include "omp.h“ New additions to OpenMP

 OK, that's only OpenMP 2 yet (2002)  But there's new versions  OpenMP 3 (2008): Tasks  What if SPMD isn't working for your program?  OpenMP 4.0 (2013):  More task extensions  SIMD extensions  Atomics  Co-processor / Accelerator offloading (Next lecture) OpenMP 3: Tasks

 Fork-join on parrallel loops is not always useful  e.g. Producer/consumer  Initial design was very simple  Idea was (is) to augment tasking when needed  Tasks can be nested  If you feel up to it... Tasking concept Who does what when?

 Developer Uses a pragma to specify where tasks are

 OpenMP  Creates new tasks when they are encountered  Moment of execution is up to the runtime system  Can either be immediately or delayed  Completion of tasks can be enforced through task synchronization The tasking construct Task synchronization

Explicitly waits for child tasks Example: linked list

Hard to do before tasking:

1: First count number of iterations 2: Translate while into for Example: linked list

OpenMP 4: SIMD annotations

 Handle data simultaniously #pragma omp  Forces vectorization  No cost function checked!  No legality checked! Example 1

 Tell the compiler that a loop should be vectorized

#pragma omp simd for (i=0; i

 Tell the compiler that a loop should be vectorized with vector width 8

#pragma omp simd safelen(8) for (i=0; i

 Tell the compiler that two loops can be flatened #pragma omp simd collapse for (j=0; j

 The next memory access will be executed atomically

#pragma omp parallel for for (long int i=2; i<= max/2; i++) for (long int scale=2; scale*i < max; scale++) { #pragma omp atomic update factor_ct[scale*i]++; }

 Useful for things like histogram And there's lots more

 Exelent examples can be found with the official documentation http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf

 Try out yourself with assignment 4a!