<<

Lecture 04-06: Programming with OpenMP

Concurrent and Mulcore Programming, CSE536

Department of and Engineering Yonghong Yan [email protected] www.secs.oakland.edu/~yan

1 Topics (Part 1)

• Introducon • Programming on system (Chapter 7) – OpenMP • Principles of design (Chapter 3) • Programming on shared memory system (Chapter 7) – /Cilkplus (?) – PThread, , locks, synchronizaons • Analysis of parallel program execuons (Chapter 5) – Performance Metrics for Parallel Systems • Execuon Time, Overhead, , Efficiency, Cost – of Parallel Systems – Use of performance tools

2 Outline

• OpenMP Introducon • Parallel Programming with OpenMP – OpenMP parallel region, and worksharing – OpenMP data environment, tasking and synchronizaon • OpenMP Performance and Best Pracces • More Case Studies and Examples • Reference Materials

3 What is OpenMP

• Standard API to write shared memory parallel applicaons in , C++, and Fortran – Compiler direcves, Runme rounes, Environment variables • OpenMP Architecture Review Board (ARB) – Maintains OpenMP specificaon – Permanent members • AMD, Cray, Fujitsu, HP, IBM, , NEC, PGI, Oracle, Microso, Texas Instruments, NVIDIA, Convey – Auxiliary members • ANL, ASC/LLNL, cOMPunity, EPCC, LANL, NASA, TACC, RWTH Aachen University, UH – hp://www..org • Latest Version 4.5 released Nov 2015

4 My role with OpenMP

5 “Hello Word” Example/1

#include #include int main(int argc, char *argv[]) {

printf("Hello World\n");

return(0); }

6 “Hello Word” - An Example/2

#include #include int main(int argc, char *argv[]) {

#pragma omp parallel { printf("Hello World\n");

} // End of parallel region

return(0); }

7 “Hello Word” - An Example/3

$ gcc –fopenmp hello.c

$ export OMP_NUM_THREADS=2 $ ./a.out Hello World Hello World

$ export OMP_NUM_THREADS=4 $ ./a.out Hello World #include #include Hello World Hello World int main(int argc, char *argv[]) {

Hello World #pragma omp parallel $ { printf("Hello World\n");

} // End of parallel region

return(0); } 8 OpenMP Components

Direcves Runme Environment Environment Variable

• Parallel region • Number of threads • Number of threads • Worksharing constructs • Thread ID • type • Dynamic thread • Tasking adjustment • Dynamic thread adjustment • Offloading • Nested parallelism • Nested parallelism • Affinity • Schedule • Stacksize • Error Handing • Acve levels • Idle threads • SIMD • Thread limit • Nesng level • Acve levels • Synchronizaon • Ancestor thread • Thread limit • Data- aributes • Team size • Locking • Wallclock mer 9 “Hello Word” - An Example/3

#include #include #include int main(int argc, char *argv[]) {

#pragma omp parallel Direcves { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads();

printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

return(0); } Runme Environment

10 “Hello Word” - An Example/4

Environment Variable

Environment Variable: it is similar to program arguments used to change the configuraon of the execuon without recompile the program.

NOTE: the order of print

11 The Design Principle Behind #pragma omp parallel { • Each prin is a int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads();

printf("Hello World from thread %d of %d\n", thread_id, num_threads); } • A parallel region is to claim a set of cores for computaon – Cores are presented as mulple threads

• Each thread execute a single task – Task id is the same as thread id • omp_get_thread_num() – Num_tasks is the same as total number of threads • omp_get_num_threads()

• 1:1 mapping between task and thread – Every task/core do similar work in this simple example 12 OpenMP Parallel Compung Soluon Stack

End User

Applicaon

User layer

Direcves, Environment . Layer OpenMP library Compiler variables

Prog

(OpenMP API)

Runme library

OS/system

System layer

13 OpenMP Syntax

• Most OpenMP constructs are compiler direcves using pragmas. – For C and C++, the pragmas take the form: #pragma … • pragma vs language – pragma is not language, should not express logics – To provide compiler/preprocessor addional informaon on how to processing direcve- annotated code

– Similar to #include, #define

14 OpenMP Syntax

• For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…]

• For Fortran, the direcves take one of the forms: – Fixed form *$OMP construct [clause [clause]…] C$OMP construct [clause [clause]…] – Free form (but works for fixed form too) !$OMP construct [clause [clause]…]

• Include file and the OpenMP lib module #include use omp_lib

15 OpenMP Compiler

• OpenMP: thread programming at “high level”. – The user does not need to specify the details • Program decomposion, assignment of work to threads • Mapping tasks to hardware threads • User makes strategic decisions

• Compiler figures out details – Compiler flags enable OpenMP (e.g. –openmp, -xopenmp, - fopenmp, -mp)

16 OpenMP Memory Model

• OpenMP assumes a shared memory • Threads communicate by sharing variables.

• Synchronizaon protects data conflicts. – Synchronizaon is expensive. • Change how data is accessed to minimize the need for synchronizaon.

17 OpenMP -Join Execuon Model

• Master thread spawns multiple worker threads as needed, together form a team • Parallel region is a block of code executed by all threads in a team simultaneously

Fork Join

Master thread

Worker threads

A Nested Parallel Regions Parallel region 18 OpenMP Parallel Regions

• In C/C++: a block is a single statement or a group of statement between { } #pragma omp parallel #pragma omp parallel for { for(i=0;i

id = omp_get_thread_num(); res[i] = big_calc(i); res[id] = lots_of_work(id); A[i] = B[i] + res[i]; } } • In Fortran: a block is a single statement or a group of statements between direcve/end-direcve pairs.

C$OMP PARALLEL C$OMP PARALLEL DO 10 wrk(id) = garbage(id) do i=1,N res(id) = wrk(id)**2 res(i)=bigComp(i) if(.not.conv(res(id)) goto 10 end do C$OMP END PARALLEL C$OMP END PARALLEL DO

19 Scope of OpenMP Region

A parallel region can span multiple source files.

bar.f foo.f C$OMP PARALLEL whoami call whoami + external omp_get_thread_num C$OMP END PARALLEL integer iam, omp_get_thread_num iam = omp_get_thread_num() C$OMP CRITICAL lexical Dynamic extent print*,’Hello from ‘, iam extent of of parallel region C$OMP END CRITICAL parallel Orphaned directives includes lexical return region extent can appear outside a end parallel construct

20 SPMD Program Models

• SPMD (Single Program, Mulple Data) for parallel regions – All threads of the parallel region execute the same code – Each thread has unique ID

• Use the thread ID to diverge the execuon of the threads – Different thread can follow different paths through the same code

if(my_id == x) { } else { }

• SPMD is by far the most commonly used paern for structuring parallel programs – MPI, OpenMP, CUDA, etc

21 Modify the Hello World Program so …

• Only one thread prints the total number of threads #pragma omp parallel { – gcc –fopenmp int hello.cthread_id –o hello = omp_get_thread_num(); int num_threads = omp_get_num_threads();

if (thread_id == 0) printf("Hello World from thread %d of %d\n", thread_id, num_threads); else printf("Hello World from thread %d\n", thread_id);

} • Only one thread read the total number of threads and all int num_threads = 99999; threads print that info #pragma omp parallel { int thread_id = omp_get_thread_num();

if (thread_id == 0) num_threads = omp_get_num_threads();

#pragma omp barrier

printf("Hello World from thread %d of %d\n", thread_id, num_threads); } 22 Barrier

#pragma omp barrier

23 OpenMP Master

• Denotes a structured block executed by the master thread • The other threads just skip it – no synchronizaon is implied

#pragma omp parallel private (tmp) { do_many_things_together(); #pragma omp master { exchange_boundaries_by_master_only (); } #pragma barrier do_many_other_things_together(); }

24 OpenMP Single

• Denotes a block of code that is executed by only one thread. – Could be master • A barrier is implied at the end of the single block.

#pragma omp parallel private (tmp) { do_many_things_together(); #pragma omp single { exchange_boundaries_by_one(); } do_many_other_things_together(); }

25 Using omp master/single to modify the Hello World Program so … • Only one thread prints the total number of threads

#pragma omp parallel { int thread_id = omp_get_thread_num(); int num_threads = omp_get_num_threads();

printf("Hello World from thread %d of %d\n", thread_id, num_threads); }

• Only one thread read the total number of threads and all threads print that info

26 Distribung Work Based on Thread ID

cat /proc/cpuinfo

Sequential code for(i=0;i

#pragma omp parallel shared (a, b) { int id, i, Nthrds, istart, iend; OpenMP parallel id = omp_get_thread_num(); region Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i

28 OpenMP Worksharing Constructs

• Divides the execuon of the enclosed code region among the members of the team • The “for” worksharing construct splits up loop iteraons among threads in a team – Each thread gets one or more “chunk” -> loop chuncking

#pragma omp parallel #pragma omp for for (i = 0; i < N; i++) { work(i); } By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp for nowait “nowait” is useful between two consecutive, independent omp for loops. 29 Worksharing Constructs

Sequential code for(i=0;i

#pragma omp parallel shared (a, b) { int id, i, Nthrds, istart, iend; OpenMP parallel id = omp_get_thread_num(); region Nthrds = omp_get_num_threads(); istart = id * N / Nthrds; iend = (id+1) * N / Nthrds; for(i=istart;i

30 OpenMP schedule Clause

• schedule ( stac | dynamic | guided [, chunk] ) • schedule ( auto | runme ) stac Distribute iteraons in blocks of size "chunk" over the threads in a round-robin fashion dynamic Fixed porons of work; size is controlled by the value of chunk; When a thread finishes, it starts on the next poron of work guided Same dynamic behavior as "dynamic", but size of the poron of work decreases exponenally auto The compiler (or runme system) decides what is best to use; choice could be implementaon dependent runme Iteraon scheduling scheme is set at runme through environment variable OMP_SCHEDULE

31 OpenMP Secons

• Worksharing construct • Gives a different structured block to each thread

#pragma omp parallel #pragma omp sections { #pragma omp section x_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

By default, there is a barrier at the end of the “omp sections”. Use the “nowait” clause to turn off the barrier. 32 Loop Collapse

• Allows parallelizaon of perfectly nested loops without using nested parallelism • The collapse clause on for/do loop indicates how many loops should be collapsed

!$omp parallel do collapse(2) ... do i = il, iu, is do j = jl, ju, js do k = kl, ku, ks ..... end do end do end do !$omp end parallel do

33 Exercise: OpenMP Matrix Mulplicaon

• Parallel version • Parallel for version – Experiment different schedule policy and chunk size • #omp pragma parallel for – Experiment collapse(2)

#pragma omp parallel shared (a, b) private (i) #pragma omp for schedule(static) for(i=0;i

#pragma omp parallel for schedule(static) private (i) num_threads(num_ths) for(i=0;i

• Barrier: Each thread waits unl all threads arrive.

#pragma omp parallel shared (A, B, C) private(id) { id=omp_get_thread_num(); A[id] = big_calc1(id); implicit barrier at the #pragma omp barrier end of a for work- #pragma omp for sharing construct for(i=0;i

• Most variables are shared by default • Global variables are SHARED among threads – Fortran: COMMON blocks, SAVE variables, MODULE variables – C: File scope variables, stac • But not everything is shared... – Stack variables in sub-programs called from parallel regions are PRIVATE – Automac variables defined inside the parallel region are PRIVATE.

36 OpenMP Data Environment

double a[size][size], b=4; #pragma omp parallel private (b) { .... }

shared data a[size][size]

private data private data private data private data b’=? b’=? b’=? b’=?

T0 T1 T2 T3 b becomes undefined

37 OpenMP Data Environment

program sort subroutine work (index) common /input/ A(10) common /input/ A(10) integer index(10) integer index(*) C$OMP PARALLEL real temp(10) call work (index) integer count C$OMP END PARALLEL save count print*, index(1) ………… !

A, index, count A, index and count are shared by all threads. temp temp temp temp is local to each thread

A, index, count 38 Data Environment: Changing storage aributes

• Selecvely change storage aributes constructs using the following clauses – SHARED – PRIVATE – FIRSTPRIVATE – THREADPRIVATE • The value of a private inside a parallel loop and global value outside the loop can be exchanged with – FIRSTPRIVATE, and LASTPRIVATE • The default status can be modified with: – DEFAULT (PRIVATE | SHARED | NONE)

39 OpenMP Private Clause

• private(var) creates a local copy of var for each thread. – The value is uninialized – Private copy is not storage-associated with the original – The original is undefined at the end IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000 IS = IS + J END DO C$OMP END PARALLEL DO print *, IS

40 OpenMP Private Clause

• private(var) creates a local copy of var for each thread. – The value is uninialized – Private copy is not storage-associated with the original – The original is undefined at the end IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000 IS = IS + J IS was not END DO ✗ C$OMP END PARALLEL DO initialized print *, IS ✗ IS is undefined here 41 Firstprivate Clause

• firstprivate is a special case of private. – Inializes each private copy with the corresponding value from the master thread.

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO print *, IS

42 Firstprivate Clause

• firstprivate is a special case of private. – Inializes each private copy with the corresponding value from the master thread.

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE C$OMP END PARALLEL✔ DO Each thread gets its own IS print *, IS ✗ with an initial value of 0 Regardless of initialization, IS is undefined at this point 43 Lastprivate Clause

• Lastprivate passes the value of a private from the last iteraon to the variable of the master thread

IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) C$OMP& LASTPRIVATE(IS) DO 20 J=1,1000 Is this code meaningful?✔ IS = IS + J 20 CONTINUE C$OMP END PARALLEL DO Each thread gets its own IS print *, IS with an initial value of 0

IS is defined as its value at the last iteration (i.e. for J=1000)

44 OpenMP Reducon l Here is the correct way to parallelize this code.

IS = 0 C$OMP PARALLEL DO REDUCTION(+:IS) DO 20 J=1,1000 IS = IS + J 20 CONTINUE print *, IS

Reduction NOT implies firstprivate, where is the initial 0 comes from?

45 Reducon operands/inial-values

• Associave operands used with reducon • Inial values are the ones that make sense mathemacally

Operand Initial value Operand Initial value + 0 .OR. 0 * 1 MAX 1 - 0 MIN 0 .AND. All 1’s // All 0’s

46 Exercise: OpenMP Sum.c

• Two versions – Parallel for with reducon – Parallel version, not using “omp for” or “reducon” clause

47 OpenMP Threadprivate

• Makes global data private to a thread, thus crossing parallel region boundary – Fortran: COMMON blocks – C: File scope and stac variables • Different from making them PRIVATE – With PRIVATE, global variables are masked. – THREADPRIVATE preserves global scope within each thread • Threadprivate variables can be inialized using COPYIN or by using DATA statements.

48 Threadprivate/copyin • You initialize threadprivate data using a copyin clause.

parameter (N=1000) common/buf/A(N) C$OMP THREADPRIVATE(/buf/)

C Initialize the A array call init_data(N,A)

C$OMP PARALLEL COPYIN(A) … Now each thread sees threadprivate array A initialized … to the global value set in the subroutine init_data() C$OMP END PARALLEL .... C$OMP PARALLEL ... Values of threadprivate are persistent across parallel regions C$OMP END PARALLEL

49 OpenMP Synchronizaon

• High level synchronizaon: – crical secon – atomic – barrier – ordered • Low level synchronizaon – flush – locks (both simple and nested)

50 Crical secon

• Only one thread at a me can enter a crical secon.

C$OMP PARALLEL DO PRIVATE(B) C$OMP& SHARED(RES) DO 100 I=1,NITERS B = DOIT(I) C$OMP CRITICAL CALL CONSUME (B, RES) C$OMP END CRITICAL 100 CONTINUE C$OMP END PARALLEL DO

51 Atomic

• Atomic is a special case of a crical secon that can be used for certain simple statements • It applies only to the update of a memory locaon

C$OMP PARALLEL PRIVATE(B) B = DOIT(I) tmp = big_ugly(); C$OMP ATOMIC X = X + temp C$OMP END PARALLEL

52 OpenMP Tasks

Define a task: – C/C++: #pragma omp task – Fortran: !$omp task

• A task is generated when a thread encounters a task construct – Contains a task region and its data environment – Task can be nested • A task region is a region consisng of all code encountered during the execuon of a task. • The data environment consists of all the variables associated with the execuon of a given task. – constructed when the task is generated

53 Task compleon and synchronizaon

• Task compleon occurs when the task reaches the end of the task region code • Mulple tasks joined to complete through the use of task synchronizaon constructs – taskwait – barrier construct int fib(int n) { int x, y; if (n < 2) return n; • taskwait constructs: else { – #pragma omp taskwait #pragma omp task shared(x) x = fib(n-1); – !$omp taskwait #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x + y;

} 54 } Example: A Linked List  Example - A Linked List

...... while(my_pointer) {

(void) do_independent_work (my_pointer);

my_pointer = my_pointer->next ; } // End of while loop ......

  

55

RvdP/V1 An Overview of OpenMP Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Example: A Linked List with Tasking

 Example - A Linked List With Tasking

my_pointer = listhead; OpenMP Task is specified here #pragma omp parallel (executed in parallel) { #pragma omp single nowait { while(my_pointer) { #pragma omp task firstprivate(my_pointer) { (void) do_independent_work (my_pointer); } my_pointer = my_pointer->next ; } } // End of single - no implied barrier (nowait) } // End of parallel region - implied barrier

56

RvdP/V1 An Overview of OpenMP Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Ordered

• The ordered construct enforces the sequenal order for a block.

#pragma omp parallel private (tmp) #pragma omp for ordered for (i=0;i

57 OpenMP Synchronizaon

• The flush construct denotes a sequence point where a thread tries to create a consistent view of memory. – All memory operaons (both reads and writes) defined prior to the sequence point must complete. – All memory operaons (both reads and writes) defined aer the sequence point must follow the flush. – Variables in registers or write buffers must be updated in memory. • Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed.

58 A flush example l pair-wise .

integer ISYNC(NUM_THREADS) C$OMP PARALLEL DEFAULT (PRIVATE) SHARED (ISYNC) IAM = OMP_GET_THREAD_NUM() ISYNC(IAM) = 0 Make sure other threads can C$OMP BARRIER see my write. CALL WORK() ISYNC(IAM) = 1 ! I’m all done; this to other threads C$OMP FLUSH(ISYNC) DO WHILE (ISYNC(NEIGH) .EQ. 0) C$OMP FLUSH(ISYNC) END DO Make sure the read picks up a C$OMP END PARALLEL good copy from memory.

Note: flush is analogous to a fence in other shared memory .

59 OpenMP rounes • Simple Lock rounes: available if it is unset. – omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock(), omp_destroy_lock() • Nested Locks: available if it is unset or if it is set but owned by the thread execung the nested lock funcon – omp_init_nest_lock(), omp_set_nest_lock(), omp_unset_nest_lock(), omp_test_nest_lock(), omp_destroy_nest_lock()

60 OpenMP Locks • Protect resources with locks. omp_lock_t lck; omp_init_lock(&lck); #pragma omp parallel private (tmp, id) { id = omp_get_thread_num(); Wait here for tmp = do_lots_of_work(id); your turn. omp_set_lock(&lck); printf(“%d %d”, id, tmp); Release the lock so omp_unset_lock(&lck); the next thread gets } a turn. omp_destroy_lock(&lck); Free-up storage when done. 61 OpenMP Library Rounes • Modify/Check the number of threads – omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() • Are we in a parallel region? – omp_in_parallel() • How many processors in the system? – omp_num_procs()

62 OpenMP Environment Variables

• Set the default number of threads to use. – OMP_NUM_THREADS int_literal

• Control how “omp for schedule(RUNTIME)” loop iteraons are scheduled. – OMP_SCHEDULE “schedule[, chunk_size]”

63 Outline

• OpenMP Introducon • Parallel Programming with OpenMP – Worksharing, tasks, data environment, synchronizaon • OpenMP Performance and Best Pracces • Case Studies and Examples • Reference Materials

64 OpenMP Performance

• Relave ease of using OpenMP is a mixed blessing • We can quickly write a correct OpenMP program, but without the desired level of performance. • There are certain “best pracces” to avoid common performance problems. • Extra work needed to program with large thread count

65 Typical OpenMP Performance Issues

• Overheads of OpenMP constructs, thread management. E.g. – dynamic loop schedules have much higher overheads than stac schedules – Synchronizaon is expensive, use NOWAIT if possible – Large parallel regions help reduce overheads, enable beer cache usage and standard opmizaons • Overheads of runme library rounes – Some are called frequently • balance • Cache ulizaon and false sharing

66 Overheads of OpenMP Direcves

OpenMP Overheads EPCC Microbenchmarks SGI Altix 3600

PA RA LLEL 1400000 FOR PA RA LLEL FOR 1200000 BARRIER 1000000 SINGLE CRITICA L 800000 LOCK/UNLOCK ATOMIC ORDERED 600000 LOCK/UNLOCK ATOMIC REDUCTION Overhead(Cycles) 400000 SINGLE

200000 PA RA LLEL FOR

PA RA LLEL 0 1 2 4 8 16 32 64 128 256 Number of Threads

67 OpenMP Best Pracces

• Reduce usage of barrier with nowait clause

#pragma omp parallel { #pragma omp for for(i=0;i

68 OpenMP Best Pracces

#pragma omp parallel private(i) { #pragma omp for nowait for(i=0;i

69 OpenMP Best Pracces

• Avoid large ordered construct • Avoid large crical regions

#pragma omp parallel shared(a,b) private(c,d) { …. #pragma omp crical { a += 2*c; c = d*d; } } Move out this

Statement 70 OpenMP Best Pracces

• Maximize Parallel Regions

#pragma omp parallel #pragma omp parallel { { #pragma omp for #pragma omp for for (…) { /* Work-sharing loop 1 */ } for (…) { /* Work-sharing loop 1 */ } } opt = opt + N; //sequenal #pragma omp single nowait #pragma omp parallel opt = opt + N; //sequenal { #pragma omp for #pragma omp for for(…) { /* Work-sharing loop 2 */ } for(…) { /* Work-sharing loop 2 */ }

#pragma omp for #pragma omp for for(…) { /* Work-sharing loop N */} for(…) { /* Work-sharing loop N */} } }

71 OpenMP Best Pracces

• Single parallel region enclosing all work-sharing loops. for (i=0; i

• Address load imbalances • Use parallel for dynamic schedules and different chunk sizes

Smith-Waterman Sequence Alignment Algorithm 73 OpenMP Best Pracces

• Smith-Waterman Algorithm – Default schedule is for stac even àload imbalance

#pragma omp for for(…) for(…) for(…) for(…) { /* compute alignments */ } #pragma omp crical {. /* compute scores */ }

74 OpenMP Best Pracces Smith-Waterman Sequence Alignment Algorithm #pragma omp for 100

100 600 Speedup 10 1000 Ideal

1 2 4 8 16 32 64 128 threads

#pragma omp for dynamic(schedule, 1)

100

100 600 Speedup 10 1000 Ideal

1 2 4 8 16 32 64 128 threads 128 threads with 80% efficiency 75 OpenMP Best Pracces

• Address load imbalances by selecng the best schedule and chunk size • Avoid selecng small chunk size when work in chunk is small.

Overheads of OpenMP For Static Scheduling Overheads of OpenMP For Dynamic Schedule SGI Altix 3600 SGI Altix 3600

80000 14000000

70000 12000000

60000 10000000

50000 8000000

40000 Cycles 6000000 64 128 30000 4000000 16 32 Chunk Size

Overhead(inCycles) 2000000 4 20000 8 Chunk Size 1 10000 2 0 1 2 4 8 16 32 64 128 256 Def ault 0 Number of Threads 1 2 4 8 16 32 64 128 256 Number of Threads

76 OpenMP Best Pracces

processing to overlap I/O and computaons

for (i=0; i

for(j=0; j

WriteResultsToFile(i) }

77 OpenMP Best Pracces

#pragma omp parallel • Pipeline Processing { For dealing with • Pre-fetches I/O #pragma omp single { ReadFromFile(0,...); } the last file • Threads reading or for (i=0; i

#pragma omp for schedule(dynamic) The implicit barrier here for (j=0; j

• single vs. master work-sharing – master is more efficient but requires thread 0 to be available – single is more efficient if master thread not available – single has implicit barrier

79

• Real-world shared memory systems have caches between memory and CPU • Copies of a single data item can exist in mulple caches • Modificaon of a shared data item by one CPU leads to outdated copies in the cache of another CPU

Memory Original data item

Cache Cache Copy of data item in cache of CPU 0 Copy of data item CPU 0 CPU 1 in cache of CPU 1  False Sharing A store into a shared cache line invalidates the other OpenMP Best Pracces copies of that line:

CPUs Caches Memory The system is not able to • False sharing distinguish between changes – When at least one thread write to a T0 within one individual line cache line while others access it • Thread 0: = A[1] (read) A • Thread 1: A[0] = … (write) T1 • Soluon: use array padding

int a[max_threads]; #pragma omp parallel for schedule(static,1) for(int i=0; i

a[i][0] +=i; 81 Exercise: Feel the false sharing with axpy-papi.c

82 OpenMP Best Pracces

• Data placement policy on NUMA architectures  A generic cc-NUMA architecture

  

 

     

• First Touch Policy  – The process that first touches a page of memory causes that page to be allocated in the node on which the process is running

83

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 About “First Touch” placement/1  NUMA First-touch placement/1

a[0] 

:   

 a[99]  

  Only reserve the vm int a[100]; address for (i=0; i<100; i++) a[i] = 0;

First Touch All array elements are in the memory of the executing this thread

84

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010  AboutNUMA First-touch placement/2 “First Touch” placement/2

a[0]  a[50]

:    :

 a[49]  a[99] 

 

#pragma omp parallel for num_threads(2) for (i=0; i<100; i++) a[i] = 0;

First Touch Both memories each have “their half” of the array

85

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 OpenMP Best Pracces

• First-touch in pracce – Inialize data consistently with the computaons

#pragma omp parallel for for(i=0; i

#pragma omp parallel for for(i=0; i

86 OpenMP Best Pracces

• Privaze variables as much as possible – Private variables are stored in the local stack to the thread • Private data close to cache double a[MaxThreads][N][N] double a[N][N] #pragma omp parallel for #pragma omp parallel private(a) for(i=0; i

87 OpenMP Best Pracces • CFD applicaon psudo-code – Shared arrays inialized incorrectly (first touch policy) – Delays in remote memory accesses are probable causes by saturaon of interconnect

procedure diff_coeff()‏ { array allocaon by master thread inializaon of shared arrays

PARALLEL REGION { loop lower_bn [id] , upper_bn [id] computaon on shared arrays ….. } }

88 OpenMP Best Pracces • Array privazaon – Improved the performance of the whole program by 30% – Speedup of 10 for the procedure, now only 5% of total me • Processor stalls are reduced significantly

Stall Cycle Breakdown for Non-Privatized (NP) and Privatized (P) Versions of diff_coeff

5.00E+10 4.50E+10 4.00E+10 3.50E+10 NP 3.00E+10 2.50E+10 P 2.00E+10 Cycles 1.50E+10 NP-P 1.00E+10 5.00E+09 0.00E+00 flushes Front-end FLP Units miss stallmiss Branch Instruction

D-cachstalls 89 misprediction OpenMP Best Pracces

• Avoid Thread Migraon – Affects data locality • Bind threads to cores. • : – numactl –cpubind=0 foobar – taskset –c 0,1 foobar • SGI Alx – dplace –x2 foobar

90 OpenMP Source of Errors

• Incorrect use of synchronizaon constructs – Less likely if user scks to direcves – Erroneous use of NOWAIT • Race condions (true sharing) – Can be very hard to find • Wrong “spelling” of sennel • Use tools to check for data races.

91 Outline

• OpenMP Introducon • Parallel Programming with OpenMP – Worksharing, tasks, data environment, synchronizaon • OpenMP Performance and Best Pracces • Hybrid MPI/OpenMP • Case Studies and Examples • Reference Materials

92 Matrix vector mulplicaon  The Sequential Source

 j   = *   i  The OpenMP Source

   j   = *   i  93

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010  PerformancePerformance – 2-socket Nehalem - 2 Socket Nehalem

       

        



         

94

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010  A Two Socket2-socket Nehalem Nehalem System       

        

     

  

      

     

  

  

   

     

  

      95

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Data inializaon  Data Initialization

    j   =  *   i   Inializaon will cause the allocaon of memory according to the first touch policy  



96

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Exploit First Touch  Exploit First Touch       

         



         

97

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 A 3D matrix update  Observation K  No data dependency on 'I'  Therefore we can split the 3D matrix in larger blocks and process these in parallel

J I

     

 98

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 The idea  The Idea K  We need to distribute the M iterations over the number of processors ie is  We do this by controlling the start (IS) and end (IE) I value of the inner loop J  Each thread will calculate these values for it's portion of the work

       99

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 A 3D matrix update  A 3D matrix update            The loops are correctly nested for K Data Dependency Graph serial performance k  Due to a data dependency on J and K, only the inner loop can be k-1 parallelized

 This will cause the barrier to be executed (N-1) 2 times J I j-1 j

100

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 The performance  The performance

 Scaling is very poor

 Inner loop over I has (as to be expected) been parallelized 







 Performance (Mflop/s)

   Number of threads

Dimensions : M=7,500 N=20 Footprint : ~24 MByte

101

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Performance analyzer data  Performance Analyzer data

 Using 10 threads          s

c

 a

l

e

s

 s do not

Using 20 threads o 

m

scale at all   e

w 

h

 a

 t     

Question: Why is __mt_WaitForWork so high in the profi le ?

102

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 False sharing at work  This is False Sharing at work !

      P=1 P=2 P=4 P=8

False sharing increases as we increase the number of threads no sharing no

103

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Performance compared  Performance comparison



 M = 75,000 





 Performance (Mflop/s)  M = 7,500

  Number of threads For a higher value of M, the program scales better 104

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 The first implementaon  The first implementation

               



      



 105

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 OpenMP version  Another Idea: Use OpenMP !



   

           

106

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 How this works  How this works on 2 threads  

 parallel region   

   work sharing   

 parallel region   

   work sharing   

… Thisetc …splits! the operation in a way that… etc is …! similar to our manual implementation

107

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Performance  Performance

 We have set M=7500 N=20

 This problem size does not scale at all when we explicitly parallelized the inner loop over 'I'  We have have tested 4 versions of this program

 Inner Loop Over 'I' - Our first OpenMP version  AutoPar - The automatically parallelized version of 'kernel'  OMP_Chunks - The manually parallelized version with our explicit calculation of the chunks  OMP_DO - The version with the OpenMP parallel region and work-sharing DO

108

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Performance  The performance (M=7,500)

 OMP Chunks

 OMP DO

 The auto-parallelizing

 compiler does really well !



Performance (Mflop/s) Innerloop

  Number of threads

Dimensions : M=7,500 N=20109 Footprint : ~24 MByte

RvdP/V1 Getting OpenMP Up To Speed Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 Reference Material on OpenMP • OpenMP Homepage www.openmp.org: – The primary source of informaon about OpenMP and its development. • OpenMP User’s Group (cOMPunity) Homepage – www.compunity.org: • Books: – Using OpenMP, Barbara Chapman, Gabriele Jost, Ruud Van Der Pas, Cambridge, MA : The MIT Press 2007, ISBN: 978-0-262-53302-7 – Parallel programming in OpenMP, Chandra, Rohit, San Francisco, Calif. : Morgan Kaufmann ; London : Harcourt, 2000, ISBN: 1558606718

110 110 Standard OpenMP Implementaon

• Direcves implemented OpenMP Code Translation via code modificaon and int main(void) _INT32 main() { { int a,b,c; inseron of runme int a,b,c; library calls /* microtask */ #pragma omp parallel \ void __ompregion_main1() – Basic step is outlining of code private(c) { in parallel region do_sth(a,b,c); _INT32 __mplocal_c; return 0; /*shared variables are kept intact, • Runme library } substitute accesses to private responsible for managing variable*/ do_sth(a, b, __mplocal_c); threads } – Scheduling loops … – Scheduling tasks /*OpenMP runtime calls */ __ompc_fork(&__ompregion_main1 – Implemenng ); synchronizaon … } • Implementaon effort is reasonable Each compiler has custom run-me support. Quality of the runme system has major impact on performance.