Programming Parallel Machines

Prof. R. Eigenmann

ECE563, Spring 2013 engineering.purdue.edu/~eigenman/ECE563

1 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelism - Why Bother?

 Hardware-Perspective: Parallelism is everywhere – instruction level – chip level (multicores) – co-processors (accelerators, GPUs) – multi-processor level – multi-computer level – distributed system level Big Question: Can all this parallelism be hidden ? 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Hiding Parallelism - For Whom and Behind What?

 For the end user: – HPC applications: parallelism is not really apparent. Sometimes, the user needs to tell the system on how many “nodes” the application shall run. – Collaborative applications and remote resource accesses: the user may want to see the parallelism.  For the application programmer: We can try to hide parallelism – using parallelizing – using parallel libraries or software components – In reality: partially hide parallelism behind a good API

3 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Different Forms of Parallelism

 An operating system has parallel processes to manage the many parallel activities that are going on concurrently.  A high-performance computing application executes multiple parts of the program in parallel in order to get the job done faster.  A bank that performs a transaction with another banks uses parallel systems to engage multiple distributed databases  A multi-group research team that uses a satellite downlink in Boston, a large computer in San-Diego and a “Cave” for visualization at the Univ. of Illinois needs collaborative parallel systems.

4 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Academia?

 It’s a hot topic again – There was significant interest in the 1980es and first half of 1990es. – Then the interest in parallelism declined until about 2005. – Increased funding for HPC and emerging multicore architectures have made parallel programming a central topic.  Hard problems remain: – what is the right user model and application programmer interface (API) for high performance and high productivity ? – what are the right programming methodologies? – how can we build scalable hardware and software systems? – how can we get the user community to accept parallelism?

5 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How important is Parallel Programming 2013 in Industry?  Parallelism has become a mainstream technology – The “multicore challenge” is one of the current big issues. – Industry’s presence at “Supercomputing” is growing. – Most large-scale computational applications exist in parallel form (car crash, computational chemistry, airplane/flow simulations, seismic processing, to name just a few). – Parallel systems sell in many shapes and forms: HPC systems for “number crunching”, parallel servers, multi-processor PCs. – Students who learn about parallelism find good jobs.  However, the hard problems pose significant challenges: – Lack of a good API leads to non-portable programs. – Lack of a programming methodology leads to high software cost. – Non-scalable systems limit the benefit from parallelism. – Lack of acceptance hinders dissemination of parallel systems.

6 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 What is so Special About Parallel Programming? Parallel programming is hard because:  You have to “think parallel” – We tend to think step-by-step, which is closer to the way a sequential program is written.  All algorithms have ordering constraints – They are a.k.a. data and control dependences, and they are difficult to analyze and express => you need to coordinate the parallel threads using synchronization constructs (e.g., locks)  Deadlocks may occur. – Incorrectly coordinated threads may cause a program to “hang”.  Race conditions may occur – Race conditions occur when dependences are violated. They lead to non- reproducible answers of the program. Most often this is an error.  Data partitioning, off-loading, and message generation are tedious – But unavoidable when programming distributed and heterogeneous systems. 7 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 So, Why Can’t We Hide Parallelism?

 Could we reuse parallel modules? – Software reuse is still largely an unsolved problem – Parallelism encoded in reusable modules may not be at the right level  Parallelizing an inner loop is almost always less efficient than an outer loop.  Could we use parallelizing compilers? – Despite much progress in parallelizing compilers, there are still many programs that fail to use such tools, e.g.  programs not written in Fortran77 and C  irregular programs (e.g., using sparse data structures) 8 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Basic Methods For Creating Parallel Programs  Write a sequential program, then parallelize it. – Application-level approach: Study the physics behind the code. Identify steps that can be executed concurrently. Re-code these steps in parallel form. – Program-level approach: Analyze the program. Find code sections that access disjoint data. Mark these sections for parallel execution. Our focus  Write a parallel program directly (from scratch) – Advantage: parallelism is not inhibited by the sequential programming style. – Disadvantages:

 Large, existing applications: huge effort

 If sequential programming is difficult already, this approach makes programming even more difficult 9 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Where in the Application Can We Find Parallelism?

Coarse-grain parallelism through domain decomposition

Subroutine-level parallelism Loop-level parallelism

call A parallelcall(A) DO i=1,n !$OMP PARALLEL DO parallelcall(B) call B A(i)=B(i) DO i=1,n call C A(i)=B(i) call C ENDDO waitfor A, B ENDDO

Instruction-level parallelism load B,R1 load A,R0 C=A+4.5*B mul R1,#4.5 add R0,R1 store R1,C 10 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How Can We Express This Parallelism?

Parallel Threads new_thread(id,subr, param) all threads live from thread creation the beginning to the dynamic, as needed. wait_thread(id) end of the program or program section

DO i=1,n

PARALLEL DO i=1,n Parallel Loops ... ENDDO multiple inner PARALLEL DO i=1,n PARALLEL DO i=1,n parallel loops call work(i) ... ENDDO ENDDO

PARALLEL DO i=1,n outermost loop in program ... ENDDO is parallel ENDDO 11 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How Can We Express This Parallelism? COBEGIN task1 task 1, 2, and Parallel Sections || 3 are task2 executed concurrently || task3 COEND

BEGIN (all in parallel) a copy of SPMD execution “do-the-work” is ...do-the-work... executed by each processor/ thread May be implicit END

12 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 How do Parallel Tasks Communicate?

shared-address-space, distributed-memory or global-address-space, or message-passing model: shared-memory model: tasks exchange data tasks see each other’s data through explicit (unless it is explicitly messages declared as private.) ...... shared data A A=3 load send(A,task2) message receive (X,task1) ... store ...... B=X A=3 B=A ...... task 1 task 2 13 task 1 task 2 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Generating Parallel Programs Automatically

A quick tour through a parallelizing compiler

The important techniques of parallelizing compilers are also the important techniques for manual parallelization

See ECE 663 Lecture Notes and Slides engineering.purdue.edu/~eigenman/ECE663/Handouts 14 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013

Automatic Parallelization

 The Scope of Auto-Parallelization  Loop Parallelization 101  Most Influential Dependence Removing Transformations  User’s Role in “Automatic” Parallelization  Performance of Parallelizing Compilers

15 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Where Do Parallelization Tools Fit Into the Software Engineering Process ? Source

Source- User Parallelizing to- F90 F90 + inserts compiler Source OpenMP Parallel inserts parallel Restructurer constructs or constructs or directives directives Source-to-source restructurers: • F90 → F90/OpenMP parallel • C → C/OpenMP

program user examples: • SGI F77 compiler tunes (-apo -mplist option) program • Cetus compiler 16 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 The Basics About Parallelizing Compilers

 Loops are the primary source of parallelism in scientific and engineering applications.  Compilers detect loops that have independent iterations, I.e. iterations access disjoint data

FOR I = 1 TO n Parallelism of loops accessing arrays: A[expression1] = … The loop is independent if, … = A[expression2] expression1 is different from expression2 (for any ENDFOR two different iterations)

17 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallel Loop Execution

 Fully-independent loops are executed in parallel. FOR i=1 TO 250 FOR i=251 TO 500 The execution of this … … loop on 4 processors ENDFORFOR i=1 ENDFORTO 1000 may assign 250 FOR i=501… TO 750 FOR i=751 TO 1000 iterations to each … ENDFOR … processor ENDFOR ENDFOR

 Usually, any dependence prevents a loop from executing in parallel. Hence, removing dependences is very important. 18 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Loop Parallelization 101 Data Dependence: Definition A dependence exists between two data references if (in a sequential execution of the program) – both references access the same storage location, – and at least one of them is a write reference. Basic data dependence classification: Read after Write (RAW): flow dependence true dependence

Write after Read (WAR): anti dependence false, or storage- Write after Write (WAW): output dependence related dependences

Read after Read (RAR): input dependence not a dependence

19 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Data Dependence Classification Examples

ANTI / WAR S1: … = X S2: X = … TRUE / FLOW / RAW read at S1 INPUT / RAR S1: X = … must occur before S1: … = X S2: … = X write at S2 S2: … = X value read at S2 read at S1 depends on OUTPUT / WAW must occur before value written at S1 S1: X = … write at S2 S2: X = … (just for write at S2 must completeness; RAR occur after write is only relevant for at S1 if X is read certain in a later statement optimizations) 20 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Automatic Parallelization

 The Scope of Auto-Parallelization  Loop Parallelization 101  Most Influential Dependence Removing Transformations  User’s Role in “Automatic” Parallelization  Performance of Parallelizing Compilers

21 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Dependence-Removing Program Transformations I: Data Privatization

C$OMP PARALLEL DO Dependency: C$OMP+ PRIVATE(work) Elements of work DO i = 1, n read in iteration i’ work[1:n] = … were also written . . in iteration i’-1. . … = work[1:n] ENDDO Each processor is given a separate version of the private data, so there is no sharing conflict 22 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Privatization DO i=1,n DO j=1,n t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j) loop-carried anti dependence C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2 ENDDO ENDDO

scalar privatization array privatization

!$OMP PARALLEL DO !$OMP PARALLEL DO !$OMP+PRIVATE(t) !$OMP+PRIVATE(t) DO i=1,n DO j=1,n t = A(i)+B(i) t(1:m) = A(j,1:m)+B(j) C(i) = t + t**2 C(j,1:m) = t(1:m) + t(1:m)**2 ENDDO ENDDO

23 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Array Privatization k = 5 More complicated patterns for DO j=1,n Array Privatization t(1:10) = A(j,1:10)+B(j) C(j,iv) = t(k) t(11:m) = A(j,11:m)+B(j) C(j,1:m) = t(1:m) ENDDO

DO j=1,n IF (cond(j)) t(1:m) = A(j,1:m)+B(j) C(j,1:m) = t(1:m) + t(1:m)**2 ENDIF D(j,1) = t(1) ENDDO 24 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Dependence-Removing Program Transformations II: Reduction Recognition

C$OMP PARALLEL DO Dependency: C$OMP+ REDUCTION (+:sum) Value of sum DO i = 1, n written in iteration … is read in sum = sum + A[i] i’-1 … iteration i’. ENDDO

Each processor will accumulate partial sums, followed by a combination of these parts at the end of the loop. 25 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Rules for Reductions

• reduction statements in a loop have the form X=X ⊗ expr , • where X is either scalar or an array expression (a[sub], where sub must be the same on the LHS and the RHS), • ⊗ is a reduction operation, such as +, *, min, max • X must not be used in any non-reduction statement in this loop (however, there may be multiple reduction statements for X)

26 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Reduction Parallelization

DO PARALLEL j=1,n PRIVATE s=0 Preamble DO j=1,n s = s + a(j) sum = sum + a(j) ...... POSTAMBLE ENDDO ATOMIC: Postamble sum=sum+s ENDDO

DO PARALLEL j=1,n Preamble an Postamble ATOMIC: are executed exactly once sum = sum + a(j) by all participating threads. . . . ENDDO 27 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 !$OMP PARALLEL PRIVATE(s) s=0 Reduction !$OMP DO DO i=1,n s=s+A(i) Parallelization II ENDDO Privatized reduction Scalar Reductions !$OMP ATOMIC implementation sum = sum+s DO i=1,n !$OMP END PARALLEL sum = sum + A(i) ENDDO DO i=1,num_proc s(i)=0 ENDDO Remember, OpenMP has a reduction clause; only reduction recognition is needed: !$OMP PARALLEL DO !$OMP PARALLEL DO DO i=1,n !$OMP+REDUCTION(+:sum) s(my_proc)=s(my_proc)+A(i) DO i=1,n ENDDO sum = sum + A(i) Expanded ENDDO DO i=1,num_proc reduction sum=sum+s(i) implementation ENDDO 28

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Reduction Parallelization III Array Reductions (a.k.a. irregular or histogram reductions) DIMENSION sum(m),s(m,#proc) DIMENSION sum(m) !$OMP PARALLEL DO DO i=1,n DO i=1,m sum(expr) = sum(expr) + A(i) DO j=1,#proc ENDDO s(i,j)=0 ENDDO ENDDO DIMENSION sum(m),s(m) !$OMP PARALLEL DO !$OMP PARALLEL PRIVATE(s) DO i=1,n s(1:m)=0 s(expr,my_proc)=s(expr,my_proc)+A(i) !$OMP DO ENDDO DO i=1,n !$OMP PARALLEL DO s(expr)=s(expr)+A(i) DO i=1,m ENDDO DO j=1,#proc !$OMP ATOMIC sum(i)=sum(i)+s(i,j) sum(1:m) = sum(1:m)+s(1:m) ENDDO !$OMP END PARALLEL ENDDO

Note, OpenMP 1.0 does not support such array reductions 29 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Considerations for Reduction Parallelization

 Parallelized reductions execute substantially more code than their serial versions ⇒ overhead if the reduction (n) is small.  In many cases (for large reductions) initialization and sum-up are insignificant.  False sharing can occur, especially in expanded reductions, if multiple processors use adjacent array elements of the temporary reduction array (s).  Expanded reductions exhibit more parallelism in the sum-up operation compared to privatized reductions.  Potential overhead in initialization, sum-up, and memory used for large, sparse array reductions ⇒ compression schemes can become useful.

30 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Dependence-Removing Program Transformations III: Substitution

ind = k DO i=1,n loop-carried Parallel DO i=1,n flow ind = ind + 2 A(k+2*i) = B(i) dependence A(ind) = B(i) ENDDO ENDDO

This is the simple case of an induction variable

31 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Generalized Induction Variables

ind=k DO j=1,n Parallel DO j=1,n ind = ind + j A(k+(j**2+j)/2) = B(j) A(ind) = B(j) ENDDO ENDDO

DO i=1,n DO i=1,n DO j=1,i ind1 = ind1 + 1 ind = ind + 1 ind2 = ind2 + ind1 A(ind) = B(i) A(ind2) = B(i) ENDDO ENDDO ENDDO

32 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Rules for Induction Variables

 induction statements in a loop nest have the form iv=iv+expr or iv=iv*expr, where iv is an scalar integer

 expr must be loop-invariant or another induction variable (there must not be cyclic relationships among IVs)

 iv must not be assigned in a non-induction statement

33 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 User’s Role: Choice of Compiler Options Examples of parallelizing compiler options (typically there is a large set of options) Will primarily increase – optimization levels compilation time

 optimize : simple analysis, advanced analysis, , data-dependence analysis, locality enhancement, array privatization/reduction

 aggressive: data padding, data layout adjustment Makes up for lack of interprocedural analysis – subroutine

 inline all, specific routines, how to deal with libraries May enhance OR degrade – try specific optimizations performance

 e.g., recurrence and reduction recognition, loop fusion, tiling 34 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 More About Compiler Options

– Limits on amount of optimization:

 e.g., size of optimization data structures, number of optimization variants tried – Make certain assumptions:

 e.g., array bounds are not violated, arrays are not aliased – Machine parameters:

 e.g., cache size, line size, mapping – Listing control

Compiler options can be a substitute for advanced compiler strategies. If the compiler has limited information, the user can help out.

Compiler options are very important for the user to know. Setting good compiler options can make a big performance difference. 35 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 User Tuning: Inspecting the Translated Program

 Source-to-source restructurers: – Transformed source code is the actual output – Example: Cetus => The output can be a starting point for code tuning  Code-generating compilers: – Some have an option for viewing the translated (parallel) code – Example: SGI f77 -apo –mplist ⇒ You may modify the source code to make it easier for the compiler to detect parallelism 36 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 . Compiler Listings

The listing gives many useful clues for improving the performance: – Loop optimizationLoop tables nest summary: – Reports about dataCLENMO_do#1 dependences: loop is parallel – Explanations aboutCLENMO_do#1#1 applied transformations: loop is serial, because it contains I/O statements, – The annotated, transformed and the following code variables (may) – Calling tree have loop-carried dependences: VEC[] – Performance statistics

The type of reports to be included in the listing can be set through compiler options. 37 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Automatic Parallelization

 The Scope of Auto-Parallelization  Loop Parallelization 101  Most Influential Dependence Removing Transformations  User’s Role in “Automatic” Parallelization  Performance of Parallelizing Compilers

38 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance of Parallelizing Compilers

Cetus OpenUH ICC PGI Rose Hand Parallel 10 8 6 4 Speedup Speedup 2 0 BT CG EP SP FT IS LU MG

39 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Tuning Automatically-Parallelized Programs

40 Why Would We Want toTune Automatically-parallelized Code?

Because  compiler techniques are limited E.g., array reductions are parallelized by only few compilers  compilers may have insufficient information E.g., – loop iteration range may be input data – variables are defined in other subroutines (no interprocedural analysis)  Tuning the compiler-parallelized program is generally easier than hand parallelizing a sequential program. 41 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Methods for Tuning Automatically-Parallelized Programs  1. Tuning compiler options – Parallelizers have many more options than standard compilers  2. Changing the source program – so that the parallelizer can recognize more opportunities for optimizations  3. Manually improving the transformed code – This task is similar to explicit parallel programming. – The parallelizer must be a source-to-source restructurer. 42 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Tuning Parallelizer Options

 Tuning compiler options may improve performance by 100% or more.  Examples: – compile for specific machine architecture – enable/disable recurrence recognition – padding data structures – enable/disable tiling – set parallelization threshold – set degree of inlining

– strict language standard interpretation 43 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Changing the Source Program

 Inserting directives, such as – explicitly parallel loops (e.g., OpenMP syntax) – properties of data structures (e.g., permutation array) – assert independence  Modifying the source code. Examples: – assigning explicit values to variables – removing pointers and obscure code – removing debug output statements

44 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Manually Improving the Transformed Code

 The basic method: – inspect the most time-consuming loops. If they are not parallelized, find out why; then transform them by hand.  Remember (very important): – The compiler gives hints in its listing, which may tell you where to focus attention. E.g., which variables have data dependences.

45 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Exercise 1: A multi-threaded “Hello world” program  Write a multithreaded program where each thread prints “hello world”. #include “omp.h” void main() { Sample Output: #pragma omp parallel hello(1) hello(0) world(1) { world(0) int ID = omp_get_thread_num(); hello (3) hello(2) world(3) printf(“ hello(%d) ”, ID); printf(“ world(%d) \n”, ID); world(2) }

} 46 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Exercise 2: A multi-threaded “pi” program  On the following slide, you’ll see a sequential program that uses numerical integration to compute an estimate of PI.  Parallelize this program using OpenMP. There are several options (do them all if you have time):

 Do it as an SPMD program using a parallel region only.

 Do it with a work sharing construct.  Remember, you’ll need to make sure multiple threads don’t overwrite each other’s variables.

47 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Our running Example: The PI program Numerical Integration Mathematically, we know that: 4.0 1 4.0 dx = π ∫ (1+x2) 0

)

2 We can approximate the integral as 2.0 a sum of rectangles:

N

F(x) = 4.0/(1+x ∑ F(xi)Δx ≈ π i = 0

0.0 1.0 X Where each rectangle has width Δx and height F(xi) in the middle of interval i. 48 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 PI Program: The sequential program static long num_steps = 100000; double step; void main () { int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; }

49 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 PI Program The OpenMP parallel version

#include static long num_steps = 100000; double step; { int i; double x, pi, sum = 0.0; step = 1.0/(double) num_steps; #pragma omp parallel for reduction(+:sum) private(x) for (i=1;i<= num_steps; i++){ x = (i-0.5)*step; sum = sum + 4.0/(1.0+x*x); } pi = step * sum; OpenMP adds } 2 lines of code

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OpenMP PI Program: Parallelized without a reduction clause

#include static long num_steps = 100000; double step; #define NUM_THREADS 2 void main () { int i; double x, pi, sum[NUM_THREADS] ={0.0}; step = 1.0/(double) num_steps; omp_set_num_threads(NUM_THREADS); #pragma omp parallel { double x; int i, id; id = omp_get_thread_num(); #pragma omp for for (i=0;i< num_steps; i++){ x = (i+0.5)*step; sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0;i

51 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OpenMP PI Program: Without the use of a worksharing (omp for) construct => SPMD program #include static long num_steps = 100000; double step; SPMD void main () { int i; double x, pi, sum[NUM_THREADS] ={0}; Program: step = 1.0/(double) num_steps; Each thread

#pragma omp parallel runs the same { double x; int id, i; code; the id = omp_get_thread_num(); thread ID int nthreads = omp_get_num_threads(); selects any for (i=id;i< num_steps; i=i+nthreads){ thread-specific x = (i+0.5)*step; behavior. sum[id] += 4.0/(1.0+x*x); } } for(i=0, pi=0.0;i

52 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Tuning Example 1: MDG

 MDG: A Fortran code of the “Perfect Benchmarks”.  Advanced autoparallelizers may recognize the parallelism in this code. 3.5 3 2.5 Performance on a 2 1.5 4-core machine 1 0.5 0 original tuning step 1 tuning step 2

53 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 MDG: Tuning Steps Step 1: Parallelize the most time- consuming loop. It consumes 95% of the serial execution time. The transformations it takes to parallelize this loop are: – array privatization – reduction parallelization Step 2: Balancing the iteration space of this loop. – Loop nest is “triangular”. Default block partitioning would create unbalanced assignment of iterations to processors. 54 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 c1 = x(1)>0 Parallel c2 = x(1:10)>0

Allocate(xsum(n,1:#proc)) MDG Structure of the most time- C$OMP PARALLEL DO consuming loop in MDG: C$OMP+ PRIVATE (I,j,rl,id) C$OMP+ SCHEDULE (STATIC,1) DO i=1,n c1 = x(1)>0 Original id = omp_get_thread_num() c2 = x(1:10)>0 DO j=i,n IF (c1) THEN rl(1:100) = … DO i=1,n … DO j=i,n IF (c2) THEN … = rl(1:100) IF (c1) THEN rl(1:100) = … xsum(j,id) = xsum(j,id) + … … ENDDO IF (c2) THEN … = rl(1:100) ENDDO sum(j) = sum(j) + … ENDDO C$OMP PARALLEL DO DO i=1,n ENDDO sum(i)=sum(i)+xsum(j,1:#proc) ENDDO 55 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Tuning Example 2: ARC2D ARC2D: A Fortran code of the “Perfect Benchmarks”.

6 ARC2D is parallelized 5 very well by available compilers. However, the 4 mapping of the code to 3 the machine could be 2 improved. 1 0 original locality granularity

56 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 ARC2D: Tuning Steps

 Step 1: Loop interchanging increases cache locality through stride-1 references  Step 2: Move parallel loops to outer positions  Step 3: Move synchronization points outward  Step 4: Coalesce loops

57 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 ARC2D !$OMP PARALLEL DO !$OMP+PRIVATE(R1,R2,K,J) DO j = jlow, jup DO k = 2, kmax-1 r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO !$OMP END PARALLEL Loop interchanging increases (spatial) cache locality !$OMP PARALLEL DO !$OMP+PRIVATE(R1,R2,K,J) DO k = 2, kmax-1 DO j = jlow, jup r1 = prss(jminu(j), k) + prss(jplus(j), k) + (-2.)*prss(j, k) r2 = prss(jminu(j), k) + prss(jplus(j), k) + 2.*prss(j, k) coef(j, k) = ABS(r1/r2) ENDDO ENDDO !$OMP END PARALLEL 58 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 ARC2D

!$OMP PARALLEL Increasing !$OMP+PRIVATE(LDI,LD2,LD1,J,LD,K) DO k = 2+2, ku-2, 1 parallel loop !$OMP DO granularity DO j = jl, ju ld2 = a(j, k) through ld1 = b(j, k)+(-x(j, k-2))*ld2 NOWAIT clause ld = c(j, k)+(-x(j, k-1))*ld1+(-y(j, k-1))*ld2 ldi = 1./ld f(j, k, 1) = ldi*(f(j, k, 1)+(-f(j, k-2, 1))*ld2+(-f(j, k-1, 1))*ld1) f(j, k, 2) = ldi*(f(j, k, 2)+(-f(j, k-2, 2))*ld2+(-f(j,k-2, 2))*ld1) x(j, k) = ldi*(d(j, k)+(-y(j, k-1))*ld1) y(j, k) = e(j, k)*ldi Note that • the k-loop is executed in ENDDO sequential order and !$OMP END DO NOWAIT • all iterations that have the ENDDO same value jx are executed !$OMP END PARALLEL by the same thread in all iterations of the k-loop 59 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Related Technique: The same effect is achieved like this: Loop Blocking !$OMP PARALLEL DO j=1,m DO i1=1,n,block DO j=1,m DO j=1,m !$OMP DO SCHEDULE(STATIC,block) DO i=1,n DO i=1,n B(i,j)=A(i,j)+A(i,j-1) DO i=i1,min(i1+block-1,n) B(i,j)=A(i,j)+A(i,j-1) ENDDO B(i,j)=A(i,j)+A(i,j-1) ENDDO ENDDO !$OMP END DO NOWAIT ENDDO ENDDO ENDDO !$OMP END PARALLELj ENDDO j

Loop blocking i i increases temporal locality

Step1: Split inner loop in two (a.k.a. loop “stripmining”) Step2: interchange outer two loops 60 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 ARC2D !$OMP PARALLEL DO !$OMP PARALLEL DO !$OMP+PRIVATE(n, k,j) !$OMP+PRIVATE(nk,n,k,j) DO n = 1, 4 DO nk = 0,4*(kmax-2)-1 DO k = 2, kmax-1 n = nk/(kmax-2) + 1 DO j = jlow, jup k = MOD(nk,kmax-2)+2 q(j, k, n) = q(j, k, n)+s(j, k, n) DO j = jlow, jup s(j, k, n) = s(j, k, n)*phic q(j, k, n) = q(j, k, n)+s(j, k, n) ENDDO s(j, k, n) = s(j, k, n)*phic ENDDO ENDDO ENDDO ENDDO !$OMP END PARALLEL !$OMP END PARALLEL

Increasing parallel loop granularity though loop coalescing (a.k.a. loop collapsing)

61 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Underlying Technique: Loop Coalescing/Collapsing

PARALLEL DO i=1,n PARALLEL DO ij=1,n*m DO j=1,m i = 1 + (ij-1) DIV m loop j = 1 + (ij-1) MOD m A(i,j) = B(i,j) coalescing ENDDO A(i,j) = B(i,j) ENDDO ENDDO

Transformation can be beneficial if • n is small, unknown, or variable • the loop body is large • computation is irregular 62 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Tuning Example 3: EQUAKE EQUAKE: A C code of the SPEC OMP 2001 benchmarks. 8 7 EQUAKE is hand- 6 parallelized with 5 relatively few code 4 modifications. It 3 achieves excellent 2 speedup. 1 0 original initial improved sequential OpenMP allocate 63 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 EQUAKE: Tuning Steps

 Step1: Parallelizing the four most time-consuming loops

 inserted OpenMP pragmas for parallel loops and private data

 array reduction transformation  Step2: A change in memory allocation

64 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 /* malloc w1[numthreads][…] */ subroutine #pragma omp parallel for smvp for (j = 0; j < numthreads; j++) for (i = 0; i < nodes; i++) { w1[j][i] = 0.0; ...; }

#pragma omp parallel private(my_cpu_id,exp,...) { my_cpu_idfor (i = 0; i =< omp_get_thread_num nodes; i++) (); EQUAKE while (...) { #pragma ... omp for for exp(i = 0;= iloop-local < nodes; i ++)computation; while (...) { ...w [...] += exp; exp... = loop-local computation; } w1[my_cpu_id][...] += exp; ... } } #pragma omp parallel for for (j = 0; j < numthreads; j++) { for (i = 0; i < nodes; i++) { w[i] += w1[j][i]; ...;}

65 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 !$OMP PARALLEL PRIVATE(rand, iother, itemp, temp) int my_cpu_id = 1 !$ my_cpu_id = omp_get_thread_num() + 1 Example 4: !$OMP DO DO j=1,npopsiz-1 CALL ran3(1,rand,my_cpu_id,0) GAFORT iother=j+1+DINT(DBLE(npopsiz-j)*rand) !$ IF (j < iother) THEN !$ CALL omp_set_lock(lck(j)) • Parallel shuffle !$ CALL omp_set_lock(lck(iother)) • 20,000 locks !$ ELSE • Deadlock !$ CALL omp_set_lock(lck(iother)) • Parallel random !$ CALL omp_set_lock(lck(j)) !$ END IF number generation itemp(1:nchrome)=iparent(1:nchrome,iother) iparent(1:nchrome,iother)=iparent(1:nchrome,j) subroutine iparent(1:nchrome,j)=itemp(1:nchrome) temp=fitness(iother) shuffle Different executions fitness(iother)=fitness(j) fitness(j)=temp produce different !$ IF (j < iother) THEN results => !$ CALL omp_unset_lock(lck(iother)) asynchronous !$ CALL omp_unset_lock(lck(j)) algorithm, !$ ELSE !$ CALL omp_unset_lock(lck(j)) non-deterministic !$ CALL omp_unset_lock(lck(iother)) parallelism !$ END IF END DO !$OMP END DO !$OMP END PARALLEL R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 !$OMP PARALLEL PRIVATE(rand, iother, itemp, temp) int my_cpu_id = 1 !$ my_cpu_id = omp_get_thread_num() + 1 Atomic !$OMP DO DO j=1,npopsiz-1 CALL ran3(1,rand,my_cpu_id,0) Region iother=j+1+DINT(DBLE(npopsiz-j)*rand) !$ IF (j < iother) THEN !$ CALL omp_set_lock(lck(j)) !$ CALL omp_set_lock(lck(iother)) !$ ELSE This is also referred !$ CALL omp_set_lock(lck(iother)) Atomic { to as Transactional !$ CALL omp_set_lock(lck(j)) !$ END IF Execution Memory itemp(1:nchrome)=iparent(1:nchrome,iother) iparent(1:nchrome,iother)=iparent(1:nchrome,j) by each iparent(1:nchrome,j)=itemp(1:nchrome) thread temp=fitness(iother) appears fitness(iother)=fitness(j) indivisible fitness(j)=temp !$ IF (j < iother) THEN } !$ CALL omp_unset_lock(lck(iother)) !$ CALL omp_unset_lock(lck(j)) !$ ELSE !$ CALL omp_unset_lock(lck(j)) !$ CALL omp_unset_lock(lck(iother)) !$ END IF END DO !$OMP END DO !$OMP END PARALLEL R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Atomic Region vs. Critical Section

AtomicRegion { CriticalSection { } }

All processors execute indivisibly ✔ ✔

All, but conflicts #threads executing force serialization 1 in parallel or rollback

Deterministic execution no no order

high low Implementation complexity 68 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallel Programming Tools and Methodologies

69 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 What Tools Did We Use for Performance Analysis and Tuning?

 Compilers – the starting point for performance tuning was the compiler-parallelized program. – It reports: parallelized loops, data dependences, call graph.  Subroutine and loop profilers – focusing attention on the most time-consuming loops is absolutely essential.  Performance “spreadsheets”: – typically comparing performance differences at the loop level.

70 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Guidelines for Fixing “Performance Bugs”

 The methodology that worked for us: – Use compiler-parallelized code as a starting point – Get loop profile and compiler listing – Inspect time-consuming loops (largest potential for improvement)

 Case 1. Check for parallelism where the compiler could not find it

 Case 2. Improve parallel loops where the speedup is limited

Remember: we are considering a program-level approach to performance tuning, as opposed to an application-level approach. 71 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Tuning

Case 1: if the loop is not parallelized automatically, do this:  Check for parallelism: – read the compiler explanation, if available – a variable may be independent even if the compiler detects dependences (compilers are conservative) – check if conflicting array is privatizable (e.g., compilers don’t perform array privatization well) – check if the conflicting variable is a reduction or induction variable  If you find parallelism, add OpenMP parallel directives, or make the information explicit for the parallelizer

72 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Tuning Case 2: if the loop is parallel but does not perform well, consider several optimization factors:

Memory serial High overheads are caused by: program CPU CPU CPU • parallel startup cost Parallelization • small loops overhead • additional parallel code • over-optimized inner loops • less optimization for parallel code

Spreading • load imbalance overhead • synchronized section • non-stride-1 references parallel • many shared references program (memory bandwidth) • low cache affinity (increased cache misses) 73 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelization Overheads

 Parallel startup cost – even with efficient runtime libraries and microtasking schemes, fork/join overheads are in the order of 1-10 µs. They are machine-specific and may increase with the number of processors. – the overhead includes the time to

 wakeup helper tasks and communicate data to them (fork), and

 the barrier synchronization at the end of the loop (join)  Small loops – loops with small numbers of iterations or small bodies – generally, if the average execution time of a loop is less than 0.1 ms, you cannot expect good speedup.

 How many statements execute in that time?

74 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallelization Overheads continued

 Additional parallel code – parallel code may perform more work than the serial code

 examples: - initialization and final sum of reductions - parallel search may search a larger space than the serial equivalent  Over-optimized inner loops – the compiler may parallelize a subroutine, not knowing that it is called from within a parallel loop  Less optimization for parallel code – compilers are typically more conservative when optimizing parallel code. E.g., Sun’s OpenMP compiler uses -O3, although -O5 is available for serial code 75 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Spreading Overheads

 Load imbalance – uneven numbers of iterations

 triangular loops

 many if statements

 non-uniform memory or disk access times – interruptions

 other programs

 OS processes  Synchronized section – even short critical sections can decrease performance substantially – then there is the cost of calling synchronization primitives

76 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Spreading Overheads continued  Non-stride-1 references – this is already a problem in serial programs – issues in parallel programs:

 several processors read from the same cache line => more cache misses

 several processors write to the same cache line => false sharing even if the accesses go to different data  Many shared references – private data can be kept in the local memory (segment) – bandwidth of shared memory is limited  Low cache affinity – additional cache misses are incurred if processors access different data in consecutive loops 77 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Profiling Techniques

 Simple timing on/off calls before/after loops are useful.  Optimizations often affect other program sections as well. Be sure to monitor the overall program execution time as well.  Both inclusive and exclusive profiles are useful.  Be aware that inserted subroutine calls may inhibit certain compiler optimizations.

78 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 MDG performance optimization report

(see handout)

79 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Analysis Tool Examples

Analyze each Parallel region

Find serial regions that are hurt by parallelism

Sort or filter regions to navigate to hotspots

80 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Analysis Tool Examples

81 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Analysis Tool Examples

82 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Analysis Tool Examples Program Structure View

Performance Spreadsheet

83 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Introductions to

 OpenMP

 MPI

(see separate slides)

84 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 PI Program in MPI static long num_steps = 100000; double step; void main () { int i; double x, pi, sum =0; step = 1.0/(double) num_steps; double x; int id, i;

MPI_Init(…); MPI_Comm_rank (MPI_COMM_WORLD, &id);//rank of each process MPI_Comm_size (MPI_COMM_WORLD, &p);//total number of processes

for (i=id;i< num_steps; i=i+p) { x = (i+0.5)*step; sum += 4.0/(1.0+x*x); }

MPI_Allreduce(…sum…pi…);

MPI_Finalize(); }

85 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Introduction to Pthreads POSIX Standard, IEEE Std 1003.1c-1995 Creating and starting a new thread: int pthread_create (pthread_t *thread, const pthread_attr_t *attr, void * (*routine)(void*), void* arg)  Returns 0 if successful, or non-zero error code.  thread points to the ID of the newly created thread.  attr specifies the thread attributes to be applied to the new thread. NULL for default attributes.  routine is the name of the function that the thread calls when started. It takes a single parameter (arg), a pointer to void.

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Introduction to Pthreads

Blocking the calling thread until the specified thread ends: int pthread_join (pthread_t thread, void ** status);

87 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Introduction to Pthreads

More thread management functions  pthread_self : find out own thread ID  pthread_equal : test two thread ID for equality  pthread_detach : set thread to release resources  pthread_exit : exit thread without exiting the process  pthread_cancel : terminates another thread

88

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Mutexes int pthread_mutex_lock (pthread_mutex_t *mutex); locks the mutex. If already locked, blocks caller. int pthread_mutex_trylock (pthread_mutex_t *mutex); like mutex_lock, but no blocking if locked already int pthread_mutex_unlock (pthread_mutex_t *mutex); wakes 1st thread waiting on mutex

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Semaphores

 init semaphore to value: int sem_init (int sem_t * sem, int p shared, unsigned int value)

 increments sem. Any waiting thread will wake up: int sem_post (int sem_t * sem)

 decrements sem. Blocks if sem is 0: int sem_wait (int sem_t * sem)

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Data Sharing in Pthreads

 Parent and child thread share global variables  However, it is not guaranteed that a write to a variable is seen immediately by the other thread  Synchronization functions make the global memory state consistent  For volatile variables, the compiler will generate code that reads from/write to memory at every access. – BUT: the architecture may still not guarantee that the memory state is immediately seen by the other thread.

Understanding memory/consistency models is an advanced topic. Whenever possible, use programming constructs that update the memory state implicitly. • use OpenMP parallel and workshare constructs with implicit barriers • enclose shared variables with synchronization functions in Pthreads 91 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Translating OpenMP into (P)threads

 Most architectures do not have instructions that support the execution of parallel loops directly  A compiler (an OpenMP preprocessor) must translate the source program into a thread- based form.  The thread-based form of the program makes calls to a runtime library that supports the OpenMP execution scheme.

92 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Instructions that support direct execution of parallel loops 1. Example architecture: Alliant FX/8 (1980es) – machine instruction for parallel loop – HW concurrency bus supports loop scheduling

a=0 store #0, DO i=1,n load ,D6 b(i) = 2 sub 1,D6 ENDDO load &b,A1 cdoall D6 D7 is reserved b=3 for the loop store #2,A1(D7.r) variable. endcdoall Starts at 0. store #3, 93 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Usual Underlying Execution Scheme for Parallel Loops (a.k.a. Microtasking)

2. Microtasking scheme (dates back to early IBM mainframes)

p1 p2 p3 p4 sequential init_helper_tasks (create Pthreads) problem: wakeup_helpers parallel loop startup sleep_helpers sequential must be very fast wakeup_helpers parallel sleep_helpers sequential microtask startup: a few µs pthreads startup: 100 µs or more

94 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Compiler Transformation for the Microtasking Scheme call init_microtasking() // once at program start a=0 ... C$OMP PARALLEL DO a=0 DO i=1,n call loop_scheduler(loopsub,i,1,n,b) b(i) = 2 b=3 ENDDO subroutine loopsub(mytask,lb,ub,b) b=3 DO i=lb,ub b(i) = 2 ENDDO Loop scheduler: END shared data Helper 1: Master task side Helper task side loop_scheduler: loopsub lb,ub loop: partition loop iterations wait for flag wakeup param flag call loopsub(id,lb,ub,param) call loopsub(...) reset flag barrier (all flags reset) return 95 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OpenMP vs MPI lessons learned from a seismic processing applications

 MPI and OpenMP can look like opposite ends of some programming method spectrum  However, even here it is true that – it is at least as important how you use a model than what the model is.  Our lesson will show that you can program in MPI style using OpenMP. In particular, the following are myths: – MPI exploits outermost parallelism, OpenMP exploits inner parallelism – MPI programs have good locality, OpenMP programs do not – MPI threads are very different from OpenMP threads 96 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Case Study of a Large-Scale Application

 Overview of the Application  Basic use of MPI and OpenMP  Issues Encountered  Performance Results

97 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Overview of Seismic

 Representative of modern seismic processing programs used in the search for oil and gas.  20,000 lines of Fortran. C subroutines interface with the operating system.  Available in a serial and a parallel variant.  Parallel code is available in a message- passing and an OpenMP form.  Is part of the SPEC HPC benchmark suite. – SPEC HPC is now retired  Includes 4 data sets: small to x-large. 98 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Seismic: Basic Characteristics

 Program structure: – 240 Fortran and 119 C subroutines.  Main algorithms: – FFT, finite difference solvers  Running time of Seismic (@ 500MFlops): – small data set: 0.1 hours – x-large data set: 48 hours  IO requirement: – small data set: 110 MB – x-large data set: 93 GB 99 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Basic Parallelization Scheme

 Split into p parallel tasks (p = number of processors)

MPI OpenMP Program Seismic Program Seismic If (rank=0) initialization initialization

C$OMP PARALLEL call main_subroutine() call main_subroutine() C$OMP END PARALLEL

→ SPMD execution scheme. Initialization done by master thread only 100 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Synchronization and Communication

MPI: All-to-all send/receive compute OpenMP: communicate copy to shared buffer; compute barrier_synchronization; copy from shared buffer; communicate

101 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Issues: Mixing Fortran and C  Bulk of computation is done Data privatization in OpenMP/C in Fortran #pragma omp thread private (item)  Utility routines are in C: float item; – IO operations void x(){ – data partitioning routines ... = item; – communication/synchronization } operations Data expansion in  OpenMP-related issues: absence a of OpenMP/C compiler – IF C/OpenMP compiler is not float item[num_proc]; available, data privatization void x(){ must be done through int thread; “expansion”. thread = omp_get_thread_num_(); – Mix of Fortran and C is ... = item[thread]; implementation dependent } 102 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Broadcast Common Blocks common /cc/ cdata common /cc/ cdata common /dd/ ddata common /dd/ ddata C initialization: #pragma OMP threadprivate /cc/,/dd/ IF (rank=0) c initialization cdata = ... cdata = ... ddata = ... ddata = ... ENDIF

C copy common blocks: C$OMP PARALLEL call syspcom () C$OMP+COPYIN(/cc/, /dd/) call main_subroutine() call main_subroutine() C$END PARALLEL

OpenMP issue: At the start of the parallel region it is not yet known which common blocks need to be copied in.

Solution: copy-in all common blocks => overhead 103 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OpenMP Issues: Multithreading IO and malloc IO routines and memory allocation are called within parallel threads, inside C utility routines.  OpenMP requires all standard libraries and intrinsics to be thread-save. However the implementations are not always compliant. → system-dependent solutions need to be found  The same issue arises if standard C routines are called inside a parallel Fortran region or in non- standard syntax. Standard C compilers do not know anything about OpenMP and the thread safety requirement.

104 Note, this is not an issue in MPI

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 More OpenMP Issues: Processor Affinity

 OpenMP currently does not specify or provide constructs for controlling the binding of threads to

processors.p1 2 3 4 task1 task2 task3 parallel task4 region tasks may migrate as a result of an OS event

 Processors can migrate, causing overhead. This behavior is system-dependent. System-dependent solutions may be available.

105 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Performance Results

Speedups of Seismic on an 8-processor system

6 7

5 6 5 4 4 3 PVMPIM PVM OpenMP 3 OpenMP 2 2 1 1 0 0 1 proc 2 proc 4 proc 8 proc 1 proc 2 proc 4 proc 8 proc small data set medium data set

106 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Programming Vector Architectures

EE663, Spring 2012 Slide 107 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Vector Instructions

A vector instruction operates on a number of data elements at once. Example: vadd va,vb,vc,32 vector operation of length 32 on vector registers va,vb, and vc – va,vb,vc can be  Special cpu registers or memory → classical supercomputers  Regular registers, subdivided into shorter partitions (e.g., 64bit register split 8-way) → multi-media extensions – The operations on the different vector elements can overlap → vector pipelining

EE663, Spring 2012 Slide 108 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Applications of Vector Operations

 Science/engineering applications are typically regular with large loop iteration counts. This was ideal for classical supercomputers, which had long vectors (up to 256; vector pipeline startup was costly).  Graphics applications can exploit “multi- media” register features and instruction sets.

EE663, Spring 2012 Slide 109 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Basic Vector Transformation

DO i=1,n A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n) ENDDO

DO i=1,n A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n) C(i-1) = D(i)**2 C(0:n-1)=D(1:n)**2 ENDDO

Here, the triplet notation means vector operation. Notice that this is not necessarily the same meaning as the array notation supported by some languages. EE663, Spring 2012 Slide 110 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Distribution and Vectorization The transformation done on the previous slide involves loop distribution. Loop distribution reorders computation and is thus subject to data dependence constraints. loop distribution DO i=1,n DO i=1,n A(i) = B(i)+C(i) A(i) = B(i)+C(i) ENDDO D(i) = A(i)+A(i-1) DO i=1,n dependence ENDDO D(i) = A(i)+A(i-1) ENDDO

vectorization The transformation is not legal if there is A(1:n)=B(1:n)+C(1:n) a lexical-backward dependence: D(1:n)=A(1:n)+A(0:n-1)

DO i=1,n loop-carried dependence Statement reordering may help A(i) = B(i)+C(i) resolve the problem. However, C(i+1) = D(i)**2 this is not possible if there is a ENDDO dependence cycle. 111

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Slide 111 Vectorization Needs Expansion

... as opposed to privatization

DO i=1,n DO i=1,n t = A(i)+B(i) expansion T(i) = A(i)+B(i) C(i) = t + t**2 C(i) = T(i) + T(i)**2 ENDDO ENDDO vectorization

T(1:n) = A(1:n)+B(1:n) C(1:n) = T(1:n)+T(1:n)**2

EE663, Spring 2012 Slide 112 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Conditional Vectorization

DO i=1,n IF (A(i) < 0) A(i)=-A(i) ENDDO

conditional vectorization

WHERE (A(1:n) < 0) A(1:n)=-A(1:n)

EE663, Spring 2012 Slide 113 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Stripmining for Vectorization

stripmining DO i1=1,n,32 DO i=1,n DO i=i1,min(i1+31,n) A(i) = B(i) A(i) = B(i) ENDDO ENDDO ENDDO

Stripmining turns a single loop into a doubly-nested loop for two-level parallelism. It also needs to be done by the code- generating compiler to split an operation into chunks of the available vector length.

EE663, Spring 2012 Slide 114 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Programming Accelerators

EE663, Spring 2012 Slide 115 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Examples of Accelerators

Nvidia GPGPU

Intel MIC (Xeon Phi)

116 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Why Accelerators?

At a high level, special-purpose architectures can execute certain program patterns faster or with less energy than general-purpose CPUs. Examples: – vector machines – GRAPE (molecular-dynamics processor) – DSP

– FPGA 117 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Why Accelerators?

More specifically  Heterogeneity as an opportunity 1 fat processor for sequential code + many lean cores for highly parallel code  Tradeoff: fat core/large cache versus many of them  GPUs exist anyway, use them for computation as well  Energy argument  high bandwidth AND low (effective) latency for certain access patterns

118 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Accelerators Today

 GPGPUs have accelerated many applications  Performance varies widely  Emphasis is on science/engineering applications  Fastest supercomputers use GPGPUs  Nvidia GPGPUs dominate – key architecture features: SIMD and multithreading  Intel’s LRB did not become a product. However, MIC (Xeon Phi) has just entered the market.  Programming productivity is poor – programmer needs to understand complex architecture – new programming constructs and techniques 119 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 CUDA Architecture

CPU GPU

120 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Programming Models

 History: – Explicit offloading and optimization: CUDA, OpenCL, OpenACC  Pure Shared Memory model: – Compiler performs the translation (OpenMP research compiler)  Shared Memory plus directives: – OpenACC, OpenMP 4.0, OpenMP (research compiler)

121 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Important Accelerator Programming Techniques

 defining kernels for offloading  creating coalesced accesses (, coalescing, data transpose) – this is the most-important optimization (note, this was true for vector machines as well)  managing copy-in/out  data placement in accelerator memory hierarchy  defining thread block size

122 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Programming Complexity Comparison (Jacobi) OpenMP code CUDA code float a[SIZE_2][SIZE_2]; float a[SIZE_2][SIZE_2]; float b[SIZE_2][SIZE_2]; __global__ void main_kernel0(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { float b[SIZE_2][SIZE_2]; int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); int main (int argc, char *argv[]) { int i, j, k; i=(_gtid+1); if (i<=SIZE { for (k = 0; k < ITER; k++) { for (j=1; j<=SIZE j ++ ) #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) a[i][j]=(b[(i-1)][j]+b[(i+1)][j]+b[i][(j-1)]+b[i][(j+1)])/4.0F; } for (j = 1; j <= SIZE; j++) a[i][j] = (b[i - 1][j] + b[i + 1][j] + b[i][j - 1] + b[i][j + 1]) / 4.0f; } #pragma omp parallel for private(i, j) __global__ void main_kernel1(float a[SIZE_2][SIZE_2], float b[SIZE_2][SIZE_2]) { int i,j; int _bid = (blockIdx.x+(blockIdx.y*gridDim.x)); int _gtid = (threadIdx.x+(_bid*blockDim.x)); for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) i=(_gtid+1); if (i<=SIZE { b[i][j] = a[i][j]; for (j=1; j<=SIZE j ++ ) return 0; } b[i][j]=a[i][j]; } } int main(int argc, char * argv[]) { int i,j,k; float * gpu__a; float * gpu__b; gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__a)), gpuBytes)); dim3 dimBlock0(BLOCK_SIZE, 1, 1); dim3 dimGrid0(NUM_BLOCKS, 1, 1); CUDA_SAFE_CALL(cudaMemcpy(gpu__a, a, gpuBytes, cudaMemcpyHostToDevice)); gpuBytes=((SIZE_2*SIZE_2)*sizeof (float)); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__b)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__b, b, gpuBytes, cudaMemcpyHostToDevice)); dim3 dimBlock1(BLOCK_SIZE, 1, 1); dim3 dimGrid1(NUM_BLOCKS, 1, 1); for (k=0; k>>((float (*)[SIZE_2])gpu__a, (float (*)[SIZE_2])gpu__b); main_kernel1<<>>((float (*)[SIZE_2)])gpu__a, (float (*)[SIZE_2])gpu__b); } gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(b, gpu__b, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__b)); gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(a, gpu__a, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__a)); fflush(stdout); fflush(stderr); return 0; }

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OpenMP Version of Jacobi float a[SIZE_2][SIZE_2]; float b[SIZE_2][SIZE_2]; int main (int argc, char *argv[]) { int i, j, k; for (k = 0; k < ITER; k++) { #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) a[i][j] = (b[i - 1][j] + b[i + 1][j] + b[i][j - 1] + b[i][j + 1]) / 4.0f; #pragma omp parallel for private(i, j) for (i = 1; i <= SIZE; i++) for (j = 1; j <= SIZE; j++) b[i][j] = a[i][j]; return 0;

}

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 CUDA Version of Jacobi int main(int argc, char * argv[]) { GPU Memory Allocation and Data Transfer to GPU int i,j,k; float * gpu__a; float * gpu__b; gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__a)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__a, a, gpuBytes, cudaMemcpyHostToDevice)); gpuBytes=((SIZE_2*SIZE_2)*sizeof (float)); CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__b)), gpuBytes)); CUDA_SAFE_CALL(cudaMemcpy(gpu__b, b, gpuBytes, cudaMemcpyHostToDevice)); dim3 dimBlock0(BLOCK_SIZE, 1, 1); dim3 dimGrid0(NUM_BLOCKS, 1, 1); dim3 dimBlock1(BLOCK_SIZE, 1, 1); dim3 dimGrid1(NUM_BLOCKS, 1, 1); for (k=0; k>>((float (*)[SIZE_2])gpu__a, (float (*)[SIZE_2])gpu__b); main_kernel1<<>>((float (*)[SIZE_2)])gpu__a, (float (*)[SIZE_2])gpu__b); GPU Kernel Execution } gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(b, gpu__b, gpuBytes, cudaMemcpyDeviceToHost)); gpuBytes=(SIZE_2*SIZE_2)*sizeof (float); CUDA_SAFE_CALL(cudaMemcpy(a, gpu__a, gpuBytes, cudaMemcpyDeviceToHost)); CUDA_SAFE_CALL(cudaFree(gpu__b)); CUDA_SAFE_CALL(cudaFree(gpu__a)); fflush(stdout); fflush(stderr); return 0; } Data Transfer Back To CPU and GPU Memory Deallocation

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Intra-Thread vs. Inter-Thread Locality

Thread 0 Thread 1 Thread 0 Thread 1 Thread 2 Thread 3

Cache 0 Cache 1

0 4 8 12 16 20 0 4 8 12 16 20 24 28 32 Global Memory Global Memory False –sharing Coalesced global memory access reduces overall latency Common CPU Memory Model CUDA Memory Model

 Intra-thread locality is beneficial to both OpenMP and CUDA model.  Inter-thread locality plays a critical role in CUDA model.

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Parallel Loop-Swap Memory access at time t #pragma omp parallel for k Thread ID for(i=0; i< N; i++) T0 for(k=0; k

#pragma omp parallel for Thread ID T0 T1 T2 T3 schedule(static, 1) k for(k=0; k

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Loop Collapsing k #pragma omp parallel for Thread ID for(i=0; i

#pragma omp parallel Thread ID T0 T1 T2 T3 T4 T5 T6 T7 #pragma omp for collapse(2) schedule(static, 1) for(i=0; i

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Loop Collapsing k #pragma omp parallel for Thread ID // Collapsed loop for(i=0; i

// Reduction loop #pragma omp parallel Thread ID T0 T1 T2 T3 T4 T5 T6 T7 #pragma omp for collapse(2) If( tid2 < n_rows ) schedule(static, 1) for( k=rptr[tid2]; k

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Matrix-Transpose Memory access at time t k Thread ID T0 A #pragma omp parallel for T1 i schedule(static, 1) T2 transpose(A)1 T3 for(i=0; i< N; i++) { Global Memory for(k=0; k

access at time t i

1 OpenMP standard does not include a transpose directive Global Memory

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Managing Copy-in/out

1. Eliminate copy-out of data that are not live out 2. Leave data that is needed in future kernel invocations in the device memory – eliminate copy-out if not needed on the CPU side 3. Narrow the copy-in data range to the minimum needed 4. Copy-in early (overlap copy-in with execution of previous kernel) 5. Pipeline copy-in and execution within same kernel.

Note that techniques 2,4,5 increase the demand on device memory 131 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Setting Thread Block Size

for i=1,n there can be max n (logical) … = a[i] threads, however:

Choosing a threadblock size

thread data access 3t ≥3 thread blocks are needed, in

computation t this example, to overcome the memory latency

Note that different vendors use the terms thread and block differently 132 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Effective GPGPU Programming Techniques: Memory Management

 Accelerators may not provide virtual memory management  Blocking/stripmining may be needed to fit data within available memory space  Side benefit of blocking: overlapping copy-in/out and computation

133 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 SMP Programming Errors

 Shared memory parallel programming is a mixed bag: – It saves the programmer from having to map data onto multiple processors. In this sense, its much easier. – It opens up a range of new errors coming from unanticipated shared resource conflicts.

134 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 2 major SMP errors

 Race Conditions

 The outcome of a program depends on the detailed timing of the threads in the team.  Deadlock

 Threads lock up waiting on a locked resource that will never become free.  Livelock

 A termination condition is never reached

135 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Race Conditions

C$OMP PARALLEL SECTIONS  The result varies A = B + C unpredictably based on C$OMP SECTION detailed order of B = A + C execution for each C$OMP SECTION section. C = B + A C$OMP END PARALLEL SECTIONS  Wrong answers produced without warning!

136 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Race Conditions: A complicated (silly?) solution ICOUNT = 0 C$OMP PARALLEL SECTIONS A = B + C ICOUNT = 1  In this example, we C$OMP FLUSH ICOUNT C$OMP SECTION choose the 1000 CONTINUE assignments to occur C$OMP FLUSH ICOUNT in the order A, B, C. IF(ICOUNT .LT. 1) GO TO 1000 B = A + C – ICOUNT forces this ICOUNT = 2 order. C$OMP FLUSH ICOUNT – FLUSH so each thread C$OMP SECTION sees updates to 2000 CONTINUE ICOUNT - NOTE: you C$OMP FLUSH ICOUNT IF(ICOUNT .LT. 2) GO TO 2000 need the flush on each C = B + A read and each write. C$OMP END PARALLEL SECTIONS 137 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 A More Subtile Race Condition

C$OMP PARALLEL SHARED (X)  The result varies C$OMP& PRIVATE(ID) unpredictably because

C$OMP PARALLEL DO REDUCTION(+:X) access to shared DO 100 I=1,100 variable TMP is not TMP = WORK(I) protected. X = X + TMP  100 CONTINUE Wrong answers C$OMP END DO produced without warning! C$OMP END PARALLEL  The user probably wanted to make TMP private.

138 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Avoiding Race Conditions

 Easiest solution: don’t access any variable that is written by another parallel thread. – this is always the case for fully parallel loops  More complicated: if data dependences are unavoidable, synchronize them properly. – Be aware that you reduce the performance  Avoid: creating your own synchronization by “waiting for a flag set by the other thread”. – use provided synchronization primitives instead  (Desirable) race conditions seen in real programs: – parallel shuffle – parallel search

139 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Deadlock

CALL OMP_INIT_LOCK (LCKA)  If A is locked in the first C$OMP PARALLEL SECTIONS section and the if statement C$OMP SECTION branches around the unset CALL OMP_SET_LOCK(LCKA) lock, threads running the IVAL = DOWORK() other sections deadlock IF (IVAL .GT. TOL) THEN CALL ERROR (IVAL) waiting for the lock to be ELSE released. CALL OMP_UNSET_LOCK (LCKA)  Make sure you release your ENDIF locks. Always, even in error ENDIF situations! C$OMP SECTION CALL OMP_SET_LOCK(LCKA) CALL USE_B_and_A (RES) CALL OMP_UNSET_LOCK(LCKA) C$OMP END SECTIONS 140 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Livelock

C$OMP PARALLEL PRIVATE(ID)  This shows a race ID = OMP_GET_THREAD_NUM() N = OMP_GET_NUM_THREADS() condition and a livelock. 1000 CONTINUE  If the square of RES is PHASES[ID] = UPDATE(U, ID) C$OMP SINGLE never smaller than TOL, RES = MATCH (PHASES, N) the program spins C$OMP END SINGLE endlessly in Livelock. IF (RES*RES .LT. TOL) GO TO 2000 GO TO 1000 2000 CONTINUE C$OMP END PARALLEL

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Livelock

ICOUNT = 0  Solution: C$OMP PARALLEL PRIVATE (ID) ID = OMP_GET_THREAD_NUM() – Fix the race with a barrier N = OMP_GET_NUM_THREADS() before the single. This 1000 CONTINUE may fix the MATCH PHASES[ID] = UPDATE(U, ID) operation and fix the C$OMP BARRIER livelock error. C$OMP SINGLE – Decide on a maximum RES = MATCH (PHASES, N) number of iterations, and ICOUNT = ICOUNT + 1 use a loop with that C$OMP END SINGLE IF (RES*RES .LT. TOL) GO TO 2000 number rather than an IF (ICOUNT .GT. MAX) GO TO 2000 infinite loop. GO TO 1000 2000 CONTINUE C$OMP END PARALLEL

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 SMP (OpenMP) Error Advice

 Are you using threadsafe libraries?  I/O inside a parallel region can interleave unpredictably.  Make sure you understand what your constructors are doing with private objects.  Watch for private variables masking globals.  Understand when shared memory is coherent. When in doubt, use FLUSH.  NOWAIT removes implied barriers. 143 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Navigating through the Danger Zones

 Option 1: Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results. – This can be prohibitively difficult due to the explosion of possible interleavings. – Tools like Intel’s Thread Checker (Assure) can help.

144 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Navigating through the Danger Zones

 Option 2: Write SMP code that is portable and equivalent to the sequential form. – Use a safe subset of OpenMP. – Follow a set of “rules” for Sequential Equivalence.

145 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Portable Sequential Equivalence

 What is Portable Sequential Equivalence (PSE)?

 A program is sequentially equivalent if its results are the same with one thread and many threads.

 For a program to be portable (i.e. runs the same on different platforms/compilers) it must execute identically when the OpenMP constructs are used or ignored.

146 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Portable Sequential Equivalence

 Advantages of PSE

 A PSE program can run on a wide range of hardware and with different compilers - minimizes software development costs.

 A PSE program can be tested and debugged in serial mode with off-the-shelf tools - even if they don’t support OpenMP.

147 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 2 Forms of Sequential Equivalence

 Two forms of Sequential equivalence based on what you mean by the phrase “equivalent to the single threaded execution”:

 Strong SE: bitwise identical results.

 Weak SE: equivalent mathematically but due to quirks of floating point arithmetic, not bitwise identical.

148 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Strong Sequential Equivalence: Rules – Control data scope with the base language

 Avoid the data scope clauses, except…

 Only use private for scratch variables local to a block (e.g. temporaries or loop control variables) whose global initialization don’t matter. – Locate all cases where a shared variable written by one thread is accessesd (read or written) by another threads.

 All accesses to the variable must be protected.

 If multiple threads combine results into a single value, enforce sequential order.

 Do not use the reduction clause. 149 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Strong Sequential Equivalence: example

C$OMP PARALLEL PRIVATE(I, TMP)  Everything is shared except I and TMP. These can be C$OMP DO ORDERED private since they are not DO 100 I=1,NDIM initialized and they are TMP =ALG_KERNEL(I) unused outside the loop. C$OMP ORDERED  CALL COMBINE (TMP, RES) The summation into RES C$OMP END ORDERED occurs in the sequential 100 CONTINUE order so the result from the program is bitwise C$OMP END PARALLEL compatible with the sequential program.  Problem: Can be inefficient if threads finish in an order that’s greatly different from the sequential order. 150

R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Weak Sequential equivalence

 For weak sequential equivalence only mathematically valid constraints are enforced.

 Computer floating point arithmetic is not associative and not commutative.

 In most cases, no particular grouping of floating point operations is mathematically preferred so why take a performance hit by forcing the sequential order? – In most cases, if you need a particular grouping of floating point operations, you have a bad algorithm.  How do you write a program that is portable and satisfies weak sequential equivalence? – Follow the same rules as the strong case, but relax sequential ordering constraints. 151 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Weak equivalence: example

C$OMP PARALLEL PRIVATE(I, TMP)  The summation into C$OMP DO DO 100 I=1,NDIM RES occurs one thread TMP =ALG_KERNEL(I) at a time, but in any C$OMP CRITICAL order so the result is not CALL COMBINE (TMP, RES) C$OMP END CRITICAL bitwise compatible with 100 CONTINUE the sequential program.

 Much more efficient, but C$OMP END PARALLEL some users get upset when low order bits vary between program runs.

152 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Sequential Equivalence isn’t a Silver Bullet

C$OMP PARALLEL  This program follows C$OMP& PRIVATE(I, ID, TMP, RVAL) ID = OMP_GET_THREAD_NUM() the weak PSE rules, but N = OMP_GET_NUM_THREADS() its still wrong. RVAL = RAND ( ID )  In this example, RAND() C$OMP DO DO 100 I=1,NDIM may not be thread safe. RVAL = RAND (RVAL) Even if it is, the pseudo- TMP =RAND_ALG_KERNEL(RVAL) random sequences C$OMP CRITICAL CALL COMBINE (TMP, RES) might overlap thereby C$OMP END CRITICAL throwing off the basic 100 CONTINUE statistics. C$OMP END PARALLEL

153 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Map Reduce

 Wikipedia: MapReduce is – a programming model for processing large data sets, and – the name of an implementation of the model by Google. MapReduce is typically used to do distributed computing on clusters of computers.

MapReduce libraries have been written in many programming languages. A popular, free implementation is Apache Hadoop.

154 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Why MapReduce?

“MapReduce provides regular programmers the ability to produce parallel distributed programs much more easily, by requiring them to write only the simpler Map() and Reduce() functions, which focus on the logic of the specific problem at hand”

However, you need to have a good understanding how map, reduce, and the overall system interact.

Many problems can be expressed as such a two-step algorithm 155 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Map step “The master node takes the input, divides it into smaller sub- problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node.” – Map is performed fully parallel on each subproblem

156 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Reduce step

“ The master node then collects the answers to all the sub- problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve.” – Reduce is not fully parallel, but may be expressed as a sequence of parallel combine steps.

157 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 A Simple Example: Parallel Reduction Expressed as MapReduce  Problem: sum all n elements an an array  Master: splits array into #processor parts; calls map for each processor  Map: sums all elements or the assigned array part; returns partial sum  Reduce: receives partial sums; returns their sum

Processors can assume the role of Master (e.g. if the assigned part is larger than a threshold) and engage additional processors, in turn. => leads to a combining tree for the summation. 158 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 A 5-step Process

1. Prepare the Map() input – the "MapReduce system" designates Map processors, assigns the K1 input key value each processor would work on, and provides that processor with all the input data associated with that key value. 2. Run the user-provided Map() code – Map() is run exactly once for each K1 key value, generating output organized by key values K2. MAP 3. "Shuffle" the Map output to the Reduce processors – the MapReduce system designates Reduce processors, assigns the K2 key value each processor would work on, and provides that processor with all the Map- generated data associated with that key value. 4. Run the user-provided Reduce() code – Reduce() is run exactly once for each K2 key value produced by the Map step. REDUCE 5. Produce the final output – the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final outcome.

159 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 A More Advanced Example: Word Count function Map(String name, String document): count each word in the document return the map < word, #occurrences >

function Reduce(String word, List #occurrences): sum = Σ #occurrences return

• Map returns multiple results, each annotated with a key. In our example, the key identifies a specific word. • The shuffle function combines the values (#occurrences) for each key returned by a map function into a list and calls Reduce(key,list) • shuffle can take a substantial amount of time. It is implemented by the system; the user only needs to

write Map and Reduce. 160 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Other MapReduce Problems/ Applications

 searching  sorting  web link-graph reversal  web access log stats  inverted index construction  document clustering  machine learning  statistical machine translation

MapReduce has also been demonstrated on some linear algebra problems, such as matrix multiply “Basically wherever you see linear algebra (matrix/vector operations) you can apply Map Reduce” 161 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 OTHER PROGRAMMING MODELS, LANGUAGES, CONCEPTS

162 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Data Flow Parallelism is derived from the availability of data in a data flow graph.  Joule (1996): – Concurrent dataflow programming language for building distributed applications. The order of statements within a block is irrelevant to the operation of th eblock. Statements are executed whenever possible, based on their inputs. Everything in Joule happens by sending messages. There is no control flow. Instead, the programmer describes the flow of data, making it a data flow programming language.  SISAL (1993): – Uses and implements single-assignment concepts. Produces a data flow graph. (Single assignment removes anti and output dependences, which exposes maximum parallelism)  StarS (2009): – Hierarchical task-based programming with StarSs, Badia, Ayguade, Labarta (2009) 163 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Global Address Space Languages (GAS)

 UPC, co-array Fortran, HPF  Languages for distributed-memory machines  The user sees a shared address space  There are constructs for data distribution. They can be explicit, user-provided data- distribution directives or array dimensions that indicate the processor ID.  Compilers translate the GAS program into a message-passing form. 164 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Non-Blocking Operations

 Fundamental mechanism for overlapping communication and (multiple) operations.

a := b the computation of the value b and transfer to a is initiated, and completed at the synch() point. It … overlaps with the execution of the “…” synch() statements.

 If b is a simple value, this is essentially the same as prefetch  If b is an operation, this involves the spawning of a parallel task 165 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 CSP – Communicating Sequential Processes (Hoare, 1978)

 one of the earliest concepts for expressing parallel activities  center is the process  synchronization and communication constructs are very important – semaphores, monitors, locks, etc.  OCCAM/Transputer

166 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 BSP – Block Synchronous Parallelism

 Structuring a parallel computation into phases of full parallelism followed by communcation phases. time compute communicate compute communicate

167 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Parallel models differ in the primary concerns they require the user to deal with

 data flow: focus on needed and produced data by activities  control flow: how control flows and transfers – focus on starting and ending parallel activities  control transfer: focus on how cpu resources are assigned to activities  messaging: focus on communication (and implied synchronization) between parallel activities

168 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 High-level Problem Classes

 Fixed problem to be solved in shortest possible time. – HPC applications  On-demand services to be made available on a continual basis. – Operating systems, internet services  Continuous data streams to be processed. – Image processing

169 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Wikipedia: Concurrent and parallel programming languages

 Actor Model  Coordination Languages  CSP-based  Dataflow  Distributed Event-driven and hardware description  Functional  GPU languages  Locic programming  Multi-threaded  Object oriented  PGAS  Unsorted 170 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013 Wikipedia: Parallel programming models

 Process interaction  Shared memory  Message passing  Implicit parallelism  Problem decomposition  Task parallelism  Data parallelism

171 R. Eigenmann, Programming Parallel Machines ECE 563 Spring 2013