Multi-core CPU Computing Straightforward with OpenMP
Roel Jordans Parallelization...
Need more speedup and using ILP and/or vectorization wasn't enough? Go multi-core! Multi-core Started with hyper-threading Moved on to Multi-core Utilize these cores with threads? Multi-core
Divide the program over multiple cores Requires operating system support! Pthreads, fork, … Abstracted away using OpenMP Fork-join programming model
initial thread (master thread)
fork
Hardware resource team of threads (worker threads) CPU CPU CPU CPU collaborating join
Memory Each thread original (master) runs on a thread CPU Fork-join example Speeding up parts of the application with parallelism We use OpenMP to implement these operations What is OpenMP?
API for shared-memory parallel programming In the form of compiler directives #pragma omp parallel Library functions omp_get_num_threads() Environment variables OMP_NUM_THREADS = 4
No additional parallelization effort for development, maintenance, etc.
Supported by mainstream compilers C/C++ Fortran A simple example saxpy operation ya x y
const int n = 10000; float x[n], y[n], a; int i;
for (i=0; i const int n = 10000; float x[n], y[n], a; OpenMP int i; directive #pragma omp parallel for for (i=0; i const int n = 10000; Creates a float x[n], y[n], a; team of threads int i; Explicitly specify the number of #pragma omp parallel num_threads(3) the number of #pragma omp parallel num_threads(3) threads { #pragma omp for for (i=0; i Compile-time Run-time C/C++ C / C++ C/C++C/C++ back-end withwith front-end withOpenMP (llvm) a.out OpenMPOpenMP (clang) Three essential parts Front-end Back-end OpenMP RTL Library Two approaches Early / late outlining Early / Late Outlining Parallel regions are put into separate routines To be executed in separate threads This can be done either in front-end or back-end omp_parallel_for(0, N, N/omp_get_num_threads(), forb) #pragma omp parallel for … for (i=0; i Late outlining viewed as too intrusive by LLVM architects Early outlining (in clang front-end) has been implemented Tiny example void vzero(float *a, int n) { #pragma omp parallel for for (int i=0; i < n; i++) a[i] = 0; } $ clang -cc1 -ast-dump -fopenmp vzero.c `-FunctionDecl 0x969ef90 But some compilers don’t detect the error: $gcc-4.8 –fopenmp loop-index.c –C $ Nested loops #pragma omp parallel for for (j=0; j We want i and j to be private: #pragma omp parallel for private(i) Or just make everything private by changing the default: #pragma omp parallel for default(none) A more complex example 1 4 Compute π dx 4 atan(1) atan(0) 0 1 x2 N 1 4 2 i0 1 xi Processing time 953.144 ms Converting to SPMD Total workload: number of steps 100000000 thread 1 thread 2 thread3 thread4 # define numthreads = 4 Processing time 4 threads 4412 ms? Problem! Single thread Processing time 953 ms False Sharing Each thread has its own partial_sum[id] Defined as an array, the partial sums are in consecutive memory locations, these can share a cache line Remove false sharing Processing time 4 threads 253 ms Single thread Processing time 953 ms Compiler directive, indicate that it’s a critical region. An easier way Reduction directive Processing time 4 threads 246 ms Single thread Processing time 953 ms There is more Compiling with OpenMP is very simple Clang and GCC add compiler flag –fopenmp Optional add #include "omp.h“ New additions to OpenMP OK, that's only OpenMP 2 yet (2002) But there's new versions OpenMP 3 (2008): Tasks What if SPMD isn't working for your program? OpenMP 4.0 (2013): More task extensions SIMD extensions Atomics Co-processor / Accelerator offloading (Next lecture) OpenMP 3: Tasks Fork-join on parrallel loops is not always useful e.g. Producer/consumer Initial design was very simple Idea was (is) to augment tasking when needed Tasks can be nested If you feel up to it... Tasking concept Who does what when? Developer Uses a pragma to specify where tasks are OpenMP runtime system Creates new tasks when they are encountered Moment of execution is up to the runtime system Can either be immediately or delayed Completion of tasks can be enforced through task synchronization The tasking construct Task synchronization Explicitly waits for child tasks Example: linked list Hard to do before tasking: 1: First count number of iterations 2: Translate while into for Example: linked list OpenMP 4: SIMD annotations Handle data simultaniously #pragma omp simd Forces vectorization No cost function checked! No legality checked! Example 1 Tell the compiler that a loop should be vectorized #pragma omp simd for (i=0; i Tell the compiler that a loop should be vectorized with vector width 8 #pragma omp simd safelen(8) for (i=0; i Tell the compiler that two loops can be flatened #pragma omp simd collapse for (j=0; j The next memory access will be executed atomically #pragma omp parallel for for (long int i=2; i<= max/2; i++) for (long int scale=2; scale*i < max; scale++) { #pragma omp atomic update factor_ct[scale*i]++; } Useful for things like histogram And there's lots more Exelent examples can be found with the official documentation http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf Try out yourself with assignment 4a!