Multi-Core CPU Computing Straightforward with Openmp

Multi-core CPU Computing Straightforward with OpenMP Roel Jordans Parallelization... Need more speedup and using ILP and/or vectorization wasn't enough? Go multi-core! Multi-core Started with hyper-threading Moved on to Multi-core Utilize these cores with threads? Multi-core Divide the program over multiple cores Requires operating system support! Pthreads, fork, … Abstracted away using OpenMP Fork-join programming model initial thread (master thread) fork Hardware resource team of threads (worker threads) CPU CPU CPU CPU collaborating join Memory Each thread original (master) runs on a thread CPU Fork-join example Speeding up parts of the application with parallelism We use OpenMP to implement these operations What is OpenMP? API for shared-memory parallel programming In the form of compiler directives #pragma omp parallel Library functions omp_get_num_threads() Environment variables OMP_NUM_THREADS = 4 No additional parallelization effort for development, maintenance, etc. Supported by mainstream compilers C/C++ Fortran A simple example saxpy operation ya x y const int n = 10000; float x[n], y[n], a; int i; for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } We want to parallelize this loop using OpenMP A simple example saxpy operation ya x y const int n = 10000; float x[n], y[n], a; OpenMP int i; directive #pragma omp parallel for for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } We want to parallelize this loop using OpenMP A simple example saxpy operation ya x y const int n = 10000; Creates a float x[n], y[n], a; team of threads int i; Explicitly specify the number of #pragma omp parallel num_threads(3) the number of #pragma omp parallel num_threads(3) threads { #pragma omp for Divides the work for (i=0; i<n; i++) { over the threads y[i] = a * x[i] + y[i]; } } The compiler perspective Compile-time Run-time C/C++ C / C++ C/C++C/C++ back-end withwith front-end withOpenMP (llvm) a.out OpenMPOpenMP (clang) Three essential parts Front-end Back-end OpenMP RTL Library Two approaches Early / late outlining Early / Late Outlining Parallel regions are put into separate routines To be executed in separate threads This can be done either in front-end or back-end omp_parallel_for(0, N, N/omp_get_num_threads(), forb) #pragma omp parallel for … for (i=0; i<N; i++) { a[i] = x * y * z; void forb(int L, int U, R *r) { … // rest of loop for (int i = L; I < U; i++) { } a[i] = x * y * z; … // rest of loop } } Comparison Early vs Late Late outlining viewed as too intrusive by LLVM architects Early outlining (in clang front-end) has been implemented Tiny example void vzero(float *a, int n) { #pragma omp parallel for for (int i=0; i < n; i++) a[i] = 0; } $ clang -cc1 -ast-dump -fopenmp vzero.c `-FunctionDecl 0x969ef90 <vzero.c:1:1, line:6:1> line:1:6 vzero 'void (float *, int)' |-ParmVarDecl 0x9651350 <col:12, col:19> col:19 used a 'float *' |-ParmVarDecl 0x96513c0 <col:22, col:26> col:26 used n 'int' `-CompoundStmt 0x96a1018 <line:2:1, line:6:1> `-OMPParallelForDirective 0x96a0f50 <line:3:9, col:25> |-CapturedStmt 0x969f790 <line:4:2, line:5:10> | |-DeclRefExpr 0x969f4d0 <line:4:20> 'int' lvalue ParmVar 0x96513c0 'n' 'int' | `-DeclRefExpr 0x969f650 <line:5:3> 'float *' lvalue ParmVar 0x9651350 'a' 'float *' |-DeclRefExpr 0x96a0b70 <line:4:7> 'int' lvalue Var 0x96a0b10 '.omp.iv' 'int' |-DeclRefExpr 0x969fae0 <<invalid sloc>> 'int' lvalue Var 0x969fa80 '.omp.last.iteration' 'int' Beware of shared memory Loop index i is private (OpenMP default) Each thread maintains it’s own i value and range Private variable i becomes undefined after: parallel for Everything else is shared (OpenMP default) All threads update y, but at different memory locations a, n, x, are read-only (it is OK to share) const int n = 10000; float x[n], y[n], a; int i; #pragma omp parallel for OpenMP default can be changed for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } More about loop index Suppose we incorrectly use a shared loop index #pragma omp parallel for shared(i) for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } Some compilers may complain (e.g. clang) But some compilers don’t detect the error: $gcc-4.8 –fopenmp loop-index.c –C $ Nested loops #pragma omp parallel for for (j=0; j<n; j++) { for (i=0; i<n; i++) { // statement } } By default, only j is private j-loop is bound to parallel for We want i and j to be private: #pragma omp parallel for private(i) Or just make everything private by changing the default: #pragma omp parallel for default(none) A more complex example 1 4 Compute π dx 4 atan(1) atan(0) 0 1 x2 N 1 4 2 i0 1 xi Processing time 953.144 ms Converting to SPMD Total workload: number of steps 100000000 thread 1 thread 2 thread3 thread4 # define numthreads = 4 Processing time 4 threads 4412 ms? Problem! Single thread Processing time 953 ms False Sharing Each thread has its own partial_sum[id] Defined as an array, the partial sums are in consecutive memory locations, these can share a cache line Remove false sharing Processing time 4 threads 253 ms Single thread Processing time 953 ms Compiler directive, indicate that it’s a critical region. An easier way Reduction directive Processing time 4 threads 246 ms Single thread Processing time 953 ms There is more Compiling with OpenMP is very simple Clang and GCC add compiler flag –fopenmp Optional add #include "omp.h“ New additions to OpenMP OK, that's only OpenMP 2 yet (2002) But there's new versions OpenMP 3 (2008): Tasks What if SPMD isn't working for your program? OpenMP 4.0 (2013): More task extensions SIMD extensions Atomics Co-processor / Accelerator offloading (Next lecture) OpenMP 3: Tasks Fork-join on parrallel loops is not always useful e.g. Producer/consumer Initial design was very simple Idea was (is) to augment tasking when needed Tasks can be nested If you feel up to it... Tasking concept Who does what when? Developer Uses a pragma to specify where tasks are OpenMP runtime system Creates new tasks when they are encountered Moment of execution is up to the runtime system Can either be immediately or delayed Completion of tasks can be enforced through task synchronization The tasking construct Task synchronization Explicitly waits for child tasks Example: linked list Hard to do before tasking: 1: First count number of iterations 2: Translate while into for Example: linked list OpenMP 4: SIMD annotations Handle data simultaniously #pragma omp simd Forces vectorization No cost function checked! No legality checked! Example 1 Tell the compiler that a loop should be vectorized #pragma omp simd for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } Example 1 Tell the compiler that a loop should be vectorized with vector width 8 #pragma omp simd safelen(8) for (i=0; i<n; i++) { y[i] = a * x[i] + y[i]; } Example 2 Tell the compiler that two loops can be flatened #pragma omp simd collapse for (j=0; j<m; j++) { for (i=0; i<n; i++) { y[j*m+i] = a * x[j*m+i] + y[j*m+i]; } } Is equivalent to #pragma omp simd for (k=0; k<n*m; k++) { y[k] = a * x[k] + y[k]; } OpenMP 4: Atomic The next memory access will be executed atomically #pragma omp parallel for for (long int i=2; i<= max/2; i++) for (long int scale=2; scale*i < max; scale++) { #pragma omp atomic update factor_ct[scale*i]++; } Useful for things like histogram And there's lots more Exelent examples can be found with the official documentation http://openmp.org/mp-documents/OpenMP4.0.0.Examples.pdf Try out yourself with assignment 4a! .

Load more