Optimizing

Source-to-source (high-level) translation and optimization for multicores

Rudolf (Rudi) Eigenmann

R. Eigenmann, Optimizing compilers Slide 1 I. Motivation and Introduction: Optimizing Compilers are in the Center of the (Software) Universe They translate increasingly advanced human interfaces (programming languages) onto increasingly complex target machines

Today Tomorrow

C, C++, Problem Human (programming) Java, Specification language Fortran Language Grand Grand Translator Challenge Machine architecture Workstation Globally Multicores Distributed/Cloud HPC Systems Resources

Processors have multiple cores. Parallelization is a key optimization. R. Eigenmann, Optimizing compilers Slide 2 Issues in Optimizing / Parallelizing Compilers

The Goal: • We would like to run standard (C, C++, Java, Fortran) programs on common parallel computers leads to the following high-level issues: • How to detect parallelism? • How to map parallelism onto the machine? • How to create a good architecture?

R. Eigenmann, Optimizing compilers Slide 3 Detecting Parallelism

• Program analysis techniques • Data dependence analysis • Dependence removing techniques • Parallelization in the presence of dependences • Runtime dependence detection

R. Eigenmann, Optimizing compilers Slide 4 Mapping Parallelism onto the Machine • Exploiting parallelism at many levels – Multiprocessors and multi-cores (our focus) – Distributed computers (clusters or global networks) – Heterogeneous architectures – Instruction-level parallelism – Vector machines • Exploiting memory organizations – Data placement – Locality enhancement – Data communication

R. Eigenmann, Optimizing compilers Slide 5 Architecting a Compiler

• Compiler generator languages and tools • Internal representations • Implementing analysis and transformation techniques • Orchestrating compiler techniques (when to apply which technique) • Benchmarking and performance evaluation

R. Eigenmann, Optimizing compilers Slide 6 Early, Foundational Parallelizing Compiler Books and Survey Papers Books: • Michael Wolfe: High-Performance Compilers for Parallel Computing (1996) • Utpal Banerjee: several books on Data Dependence Analysis and Transformations • Ken Kennedy, John Allen: Optimizing Compilers for Modern Architectures: A Dependence-based Approach (2001) • Zima, H. and Chapman, B., Supercompilers for parallel and vector computers (1990) • Scheduling and automatic Parallelization, Darte, A., Robert Y., and Vivien, F., (2000)

Survey Papers: • Rudolf Eigenmann and Jay Hoeflinger, Parallelizing and Vectorizing Compilers, Wiley Encyclopedia of Electrical Engineering, John Wiley &Sons, Inc., 2001 • Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), February 1993. • David F. Bacon, Susan L. Graham, Compiler transformations for high-performance computing, ACM Computing Surveys (CSUR), Volume 26, Issue 4, December 1994, Pages: 345 - 420,1994

R. Eigenmann, Optimizing compilers Slide 7 Course Approach

There are many schools on optimizing compilers. Our approach is performance-driven. We will discuss: – Performance of parallelization techniques – Analysis and Transformation techniques in the Cetus compiler (for multiprocessors/cores) – Additional transformations (for GPGPUs and other architectures) – Compiler infrastructure considerations

R. Eigenmann, Optimizing compilers Slide 8 The Heart of Automatic Parallelization Data Dependence Testing

If a loop does not have data dependences between any two iterations then it can be safely executed in parallel

In many science/engineering applications, loop parallelism is most important. In non- numerical programs other control structures are also important

R. Eigenmann, Optimizing compilers Slide 9 Data Dependence Tests: Motivating Examples

Loop Parallelization Statement Reordering Can the iterations of this can these two statements be loop be run concurrently? swapped?

DO i=1,100,2 DO i=1,100,2 B[2*i] = ... B[2*i] = ...... = B[2*i] +B[3*i] ... = B[3*i] ENDDO ENDDO DD testing is important not just for DD testing is needed to detect detecting parallelism parallelism A data dependence exists between two adjacent data references iff: • both references access the same storage location and • at least one of them is a write access

R. Eigenmann, Optimizing compilers Slide 10 Data Dependence Tests: Concepts

Terms for data dependences between statements of loop iterations. • Distance (vector): indicates how many iterations apart are source and sink of dependence. • Direction (vector): is basically the sign of the distance. There are different notations: (<,=,>) or (-1,0,+1) meaning dependence (from earlier to later, within the same, from later to earlier) iteration. • Loop-carried (or cross-iteration) dependence and non-loop-carried (or loop-independent) dependence: indicates whether or not a dependence exists within one iteration or across iterations. – For detecting parallel loops, only cross-iteration dependences matter. – equal dependences are relevant for optimizations such as statement reordering and loop distribution.

R. Eigenmann, Optimizing compilers Slide 11 Data Dependence Tests: Concepts

• Iteration space graphs: the un-abstracted form of a dependence graph with one node per statement instance.

j Example:

DO i=1,n DO j=1,m a[i,j] = a[i-1,j-2]+b[i,j] ENDDO ENDDO i order

R. Eigenmann, Optimizing compilers Slide 12 Data Dependence Tests: Formulation of the Data-dependence problem

DO i=1,n the question to answer: a[4*i] = . . . can 4*i ever be equal to 2*i +1 within i , i Î[1,n] ? . . . = a[2*i+1] 1 2 1 2

ENDDO Note that the iterations at which the two expressions are equal

may differ. To express this fact, we choose the notation i1, i2.

Let us generalize a bit: given • two subscript functions f and g, and • loop bounds lower, upper, Does

f(i1) = g(i2) have a solution such that lower £ i1, i2 £ upper ? R. Eigenmann, Optimizing compilers Slide 13 This course would now be finished if:

• the mathematical formulation of the data dependence problem had an accurate and fast solution, and • there were enough loops in programs without data dependences, and • dependence-free code could be executed by today’s parallel machines directly and efficiently. • engineering these techniques into a production compiler were straightforward.

There are enough hard problems to fill several courses!

R. Eigenmann, Optimizing compilers Slide 14 II. Performance of Basic Automatic Program Parallelization

R. Eigenmann, Optimizing compilers Slide 15 Three Decades of Parallelizing Compilers A Performance study at the beginning of the 90es (Blume study) Analyzed the performance of state-of-the-art parallelizers and vectorizers using the Perfect Benchmarks.

William Blume and Rudolf Eigenmann, Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE Transactions on Parallel and Distributed Systems, 3(6), November 1992, pages 643--656.

Good reasons for starting two decades back: • We will learn simple techniques first. • We will see how parallelization techniques have evolved • We will see that extensions of the important techniques back then are still the important techniques today.

R. Eigenmann, Optimizing compilers Slide 16 Overall Performance of parallelizers in 1990

Speedup on 8 processors with 4-stage vector units

R. Eigenmann, Optimizing compilers Slide 17 Performance of Individual Techniques

R. Eigenmann, Optimizing compilers Slide 18 Transformations measured in the “Blume Study” • Scalar expansion • Reduction parallelization • substitution • • Forward Substitution • Stripmining • Loop synchronization • Recurrence substitution

R. Eigenmann, Optimizing compilers Slide 19 Scalar Expansion Privatization DO PARALLEL j=1,n DO j=1,n PRIVATE t output t = a[j]+b[j] flow t = a[j]+b[j] c[j] = t + t2 c[j] = t + t2 ENDDO ENDDO

anti

Expansion We assume a shared-memory model: • by default, data is shared, i.e., all DO PARALLEL j=1,n processors can see and modify it t0[j] = a[j]+b[j] • processors share the work of parallel loops c[j] = t0[j] + t0[j]2 ENDDO

R. Eigenmann, Optimizing compilers Slide 20 Parallel Loop Syntax and Semantics in OpenMP

!$OMP PARALLEL PRIVATE() #pragma omp parallel for !$OMP DO DO i = lb, ub for (i=lb; i<=ub; i++) { ENDDO } !$OMP END DO !$OMP END PARALLEL

work (iterations) shared by participating processors (threads)

Same code executed by all participating processors (threads)

R. Eigenmann, Optimizing compilers Slide 21 Reduction Parallelization !$OMP PARALLEL, PRIVATE (s) s = 0 DO j=1,n !$OMP DO DO j=1,n flow sum = sum + a[j] s = s + a[j] ENDDO anti ENDDO !$OMP ENDDO !$OMP ATOMIC sum=sum+s !$OMP PARALLEL DO !$OMP END PARALLEL !$OMP+REDUCTION(+:sum) DO j=1,n sum = sum + a[j] ENDDO

R. Eigenmann, Optimizing compilers Slide 22 Induction Variable Substitution

ind = ind0 ind = ind0 DO j = 1,n DO PARALLEL j = 1,n a[ind] = b[j] a[ind0+k*(j-1)] = b[j] ind = ind+k flow ENDDO dependence ENDDO

Note, this is the reverse of , an important transformation in classical (code generating) compilers. R0 ¬ &d loop: real d[20,100] loop: ... DO j=1,n ... (R0) ¬ 0 d[1,j]=0 R0 ¬ &d+20*j ... ENDDO (R0) ¬ 0 ... R0 ¬ R0+20 jump loop jump loop

R. Eigenmann, Optimizing compilers Slide 23 Forward Substitution

m = n+1 m = n+1 … … DO j=1,n DO j=1,n a[j] = a[j+m] a[j] = a[j+n+1] ENDDO ENDDO

a = x*y a = x*y b = a+2 b = x*y+2 c = b + 4 c = x*y + 6 dependences no dependences

R. Eigenmann, Optimizing compilers Slide 24 Stripmining 1 n

strip

DO i=1,n,strip DO j=1,n DO j=i,min(i+strip-1,n) a[j] = b[j] a[j] = b[j] ENDDO ENDDO ENDDO

There are many variants of stripmining (sometimes called loop blocking)

R. Eigenmann, Optimizing compilers Slide 25 Loop Synchronization

DOACROSS j=1,n DO j=1,n a[j] = b[j] a[j] = b[j] post(current_iteration) c[j] = a[j]+a[j-1] wait(current_iteration-1) ENDDO c[j] = a[j]+a[j-1] ENDDO

R. Eigenmann, Optimizing compilers Slide 26 Recurrence Substitution

DO =1,n a[j] = c0+c1*a[j]+c2*a[j-1]+c3*a[j-2] ENDDO

call rec_solver(a,n,c0,c1,c2,c3)

Basic idea of the recurrence solver: DO j=1,10 DO j=11,20 DO j=21,30 DO j=31,40 DO j=1,40 a[j] = a[j] + a[j-1] a[j)]= a[j]+ a[j-1] a[j] = a[j] + a[j-1] a[j] = a[j)]+ a[j-1] a[j] = a[j] + a[j-1] ENDDO ENDDO ENDDO ENDDO ENDDO

Error: 0 ∆a[10] ∆a[10]+∆a[20] ∆a[10]+∆a[20]+∆a[30]

R. Eigenmann, Optimizing compilers Slide 27 Loop Interchange

DO i= 1,n DO j= 1,m DO j=1,m DO i=1,n a[i,j] = a[i,j]+a[i,j-1] a[i,j] =a[i,j]+a[i,j-1] ENDDO ENDDO ENDDO ENDDO • stride-1 references increase cache locality – read: increase spatial locality – write: avoid false sharing • scheduling of outer loop is important (consider original loop nest): – cyclic: no locality w.r.t. to i loop – block schedule: there may be some locality – dynamic scheduling: chunk scheduling desirable • cache organization is important • parallelism at outer position reduces loop fork/join overhead

R. Eigenmann, Optimizing compilers Slide 28 Effect of Loop Interchange

Example: speedups of the most time-consuming loops in the ARC2D benchmark on 4-core machine loop interchange applied in the process of parallelization

R. Eigenmann, Optimizing compilers Slide 29 Execution Scheme for Parallel Loops

1. Architecture supports parallel loops. Example: Alliant FX/8 (1980es) – machine instruction for parallel loop – HW concurrency bus supports loop scheduling

a=0 store #0, ! DO PARALLEL load ,D6 DO i=1,n sub 1,D6 b[i] = 2 load &b,A1 cdoall D6 D7 is reserved ENDDO for the loop b=3 store #2,A1(D7.r) variable. endcdoall Starts at 0. store #3,

R. Eigenmann, Optimizing compilers Slide 30 Execution Scheme for Parallel Loops

2. Microtasking scheme (dates back to early IBM mainframes)

p1 p2 p3 p4 sequential init_helper_tasks problem: wakeup_helpers parallel loop startup sleep_helpers sequential must be very fast wakeup_helpers parallel sleep_helpers sequential microtask startup: 1 µs pthreads startup: up to 100 µs

R. Eigenmann, Optimizing compilers Slide 31 Compiler Transformation and Runtime Function for the Microtasking Scheme call init_microtasking() // once at program start a=0 ... ! DO PARALLEL a=0 DO i=1,n call loop_scheduler(loopsub,i,1,n,b) b=3 b[i] = 2 ENDDO subroutine loopsub(mytask,lb,ub,b) b=3 DO i=lb,ub b[i] = 2 ENDDO END

Helper 1: Master task Helper task loop_scheduler: loopsub lb,ub loop: partition loop iterations wait for flag wakeup sh_var flag call loopsub(id,lb,ub,sh_var) call loopsub(...) reset flag barrier (all flags reset) Control blocks return (shared data) R. Eigenmann, Optimizing compilers Slide 32 III. Performance of Advanced Parallelization

R. Eigenmann, Optimizing compilers Slide 33 Manual Improvements of the Perfect Benchmarks (1995) Same information as Rudolf Eigenmann, Jay on Slide 17 Hoeflinger, and David Padua, On the Automatic Parallelization of the Perfect Benchmarks. IEEE Transactions on Parallel and Distributed Systems, volume 9, number 1, January 1998, pages 5-23.

a eliminated file I/O b parallelized random number generator

R. Eigenmann, Optimizing compilers Slide 34 Performance of Individual Techniques in Manually Improved Programs (1995)

Performance loss when disabling individual techniques (Cedar machine) R. Eigenmann, Optimizing compilers Slide 35 Overall Performance of the Cetus and ICC Compilers (2011)

    

            

      

NAS (Class A) Benchmarks on 8-core x86 processor

R. Eigenmann, Optimizing compilers Slide 36 Performance of Individual Cetus Techniques (2011)

                             

NAS Benchmarks (Class A) on 8-core x86 processor

R. Eigenmann, Optimizing compilers Slide 37 IV. Analysis and Transformation Techniques

• 1 Data-dependence analysis • 2 Parallelism enabling transformations • 3 Techniques for multiprocessors/multicores • 4 Advanced program analysis • 5 Dynamic decision making • 6 Techniques for vector architectures • 7 Techniques for heterogeneous multicores • 8 Techniques distributed-memory machines

R. Eigenmann, Optimizing compilers Slide 38 IV.1 Data Dependence Testing

Earlier, we have considered the simple case of a 1-dimensional array enclosed by a single loop:

DO i=1,n the question to answer: a[4*i] = . . . can 4*i ever be equal to 2*i+1 within i Î[1,n] ? . . . = a[2*i+1] ENDDO

In general: given • two subscript functions f and g and • loop bounds lower, upper. Does

f(i1) = g(i2) have a solution such that lower £ i1, i2 £ upper ?

R. Eigenmann, Optimizing compilers Slide 39 DDTests: doubly-nested loops

• Multiple loop indices: DO i=1,n DO j=1,m X[a1*i + b1*j + c1] = ...... = X[a2*i + b2*j + c2] ENDDO ENDDO

dependence problem:

a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1 1 £ i1, i2 £ n 1 £ j1, j2 £ m

Almost all DD tests expect the coefficients ax to be integer constants. Such subscript expressions are called affine. R. Eigenmann, Optimizing compilers Slide 40 DDTests: even more complexity

• Multiple loop indices, multi-dimensional array: DO i=1,n DO j=1,m X[a1*i1 + b1*j1 + c1, d1*i1 + e1*j1 + f1] = ...... = X[a2*i2 + b2*j2 + c2, d2*i2 +e2*j2 + f2] ENDDO ENDDO dependence problem:

a1*i1 - a2*i2 + b1*j1 - b2*j2 = c2 - c1

d1*i1 - d2*i2 + e1*j1 - e2*j2 = f2 - f1

1 £ i1, i2 £ n 1 £ j1, j2 £ m

R. Eigenmann, Optimizing compilers Slide 41 Data Dependence Tests: The Simple Case

Note: variables i1, i2 are integers ® diophantine equations.

Equation a * i1 - b* i2 = c has a solution if and only iff gcd(a,b) (evenly) divides c

in our example this means: gcd(4,2)=2, which does not divide 1 and thus there is no dependence.

If there is a solution, we can test if it lies within the loop bounds. If not, then there is no dependence.

R. Eigenmann, Optimizing compilers Slide 42 Performing the GCD Test • The diophantine equation

a1*i1 + a2*i2 +...+ an*in = c has a solution iff gcd(a1,a2,...,an) evenly divides c Examples: 15*i +6*j -9*k = 12 has a solution gcd=3 2*i + 7*j = 3 has a solution gcd=1 9*i + 3*j + 6*k = 5 has no solution gcd=3

Euklid Algorithm: find gcd(a,b) Repeat for more than two numbers: a ¬ a mod b gcd(a,b,c) = (gcd(a,gcd(b,c)) swap a,b Until b=0 ®The resulting a is the gcd

R. Eigenmann, Optimizing compilers Slide 43 Other Data Dependence Tests

• The GCD test is simple but not accurate • Other tests – Banerjee(-Wolfe) test: widely used test – Power Test: improvement over Banerjee test – Omega test: “precise” test, most accurate for linear subscripts – Range test: handles non-linear and symbolic subscripts – many variants of these tests

R. Eigenmann, Optimizing compilers Slide 44 The Banerjee(-Wolfe) Test

Basic idea: if the total subscript range accessed by ref1 does not overlap with the range accessed by ref2, then ref1 and ref2 are independent. DO j=1,100 ranges accesses: a[j] = … [1:100] … = a[j+200] [201:300] ENDDO à independent

R. Eigenmann, Optimizing compilers Slide 45 Mathematical Formulation of the Test – Banerjee’s Inequalities

j1-j2 = 200

Min: 1-100=-99 The general case of a doubly-nested loop and single subscript, as shown on Slide 40: Max: 100-1=99

a1*i1-a2*i2 + b1*j1-b2*j2 = c2-c1

Min: a1-a2*n Min: b1-b2*m Assuming positive coefficients Max: a1*n-a2 Max: b1*n-b2

Multiple dimensions: apply test separately on each subscript or linearize

R. Eigenmann, Optimizing compilers Slide 46 Banerjee(-Wolfe) Test continued

Weakness of the test: Consider this flow dependence DO j=1,100 ranges accessed: a[j] = … [1:100] … = a[j+5] [6:105] ENDDO à no dependence ?

We did not take into consideration that only loop-carried dependences matter for parallelization. A loop-carried flow dependence only exists, if a read in some

iteration, j1, conflicts with a write in some later iteration, j2> j1

R. Eigenmann, Optimizing compilers Slide 47 Using Dependence Direction Information in the Banerjee(-Wolfe) Test Idea for overcoming the weakness: for loop-carried dependences, make use of the fact that j in ref2 is greater than in ref1 Still considering the potential flow Ranges accessed by dependence from a(j) to a(j+5) iteration j1 and any other iteration j , where j < j : DO j=1,100 2 1 2 [j1] a[j] = … [j1+6:105] … = a[j+5] à Independent for “>” direction ENDDO Clearly, this loop has a dependence. But, it is an anti-dependence This is commonly referred to as the from a[j+5] to a[j] Banerjee test with direction vectors. R. Eigenmann, Optimizing compilers Slide 48 DD Testing with Direction Vectors

Considering direction vectors can increase the complexity of the DD test substantially. For long vectors (corresponding to deeply-nested loops), there are many possible combinations of directions.

*, * , . . . , * = = = (d ,d ,…,d ) 1 2 n < < < > > A possible algorithm: 1. try (*,*…*) , i.e., do not consider directions 2. (if not independent) try (<,*,*…*), (=,*,*…*) 3. (if still not independent) try (<,<,*…*),(<,>,*…*) ,(<,=,*…*) (=,=,*…*), (=,<,*…*) . . . (This forms a tree)

R. Eigenmann, Optimizing compilers Slide 49 Data-dependence Test Driver procedure DataDependenceAnalysis( PROG ) input : Program representing all source files: PROG output : Data dependence graph containing dependence arcs DDG // Collect all FOR loops meeting eligibility // Checks: Canonical, FunctionCall, ControlFlowModifier ELIGIBLE LOOPS = getOutermostEligibleLoops( PROG ) foreach LOOP in ELIGIBLE LOOPS // Obtain lower bounds, upper bounds and loop steps // for this loop and all enclosed loops i.e. the loop-nest // Substitute symbolic information if available, LOOP_INFO = collectLoopInformation( LOOP and enclosed nest ) // Collect all array access expressions appearing within the // body of this loop, this includes enclosed loops and non-perfectly // nested statements ACCESSES = collectArrayAccesses( LOOP and enclosed nest ) // Traverse all array accesses, test relevant pairs and // create a set of dependence arcs for the loop-nest LOOP_DDG = runDependenceTest( LOOP_INFO, ACCESSES ) // Add loop dependence graph to the program-wide DDG // The program-wide DDG is initially empty DDG += LOOP_DDG // return the program-wide data dependence graph once all loops are done return DDG R. Eigenmann, Optimizing compilers Slide 50 Data-dependence Test Driver (continued) procedure runDependenceTest( LOOP_INFO, ACCESSES ) input : Loop information for the current loop nest LOOP_INFO List of array access expressions, ACCESSES output : Loop data dependence graph LOOP_DDG foreach ARRAY_1 in ACCESSES of type write // Obtain alias information i.e. aliases to this array name // Alias information in Cetus is generated through points-to analysis ALIAS_SET = getAliases( ARRAY_1 ) // Collect all expressions/references to the same array from the entire list of accesses TEST_LIST = getOtherReferences( ALIAS_SET, ACCESSES ) foreach ARRAY_2 in TEST_LIST // Obtain the common loops enclosing the pair COMMON NEST = getCommonNest( ARRAY_1, ARRAY_2 ) // Possibly empty set of direction vectors under which // dependence exists is returned by the test DV_SET = testAccessPair( ARRAY_1, ARRAY_2, COMMON_NEST, LOOP_INFO ) foreach DV in DV_SET // Create arc from source to sink DEP_ARC = buildDependenceArc( ARRAY_1, ARRAY_2, DV ) // Build the loop dependence graph by accumulating all arcs LOOP_DDG += DEP ARC // All expressions have been tested, return the loop dependence graph return LOOP_DDG R. Eigenmann, Optimizing compilers Slide 51 Data-dependence Test Driver (continued) procedure testAccessPair( A1, A2, COMMON_NEST, LOOP_INFO) input : Pair of array accesses to be tested A1 and A2 Nest of common enclosing loops COMMON NEST Information for these loops LOOP INFO output : Possibly empty set of direction vectors under which dependence exists DV SET // Partition the subscripts of the array accesses into dimension pairs // Coupled subscripts may be handled PARTITIONS = partitionSubscripts( A1, A2, COMMON_NEST ) foreach PARTITION in PARTITIONS // Depending on the number of loop index variables in the partition, // use the corresponding test. if( ZIV ) // zero index variables ZIV DVs = simpleZIVTest( PARTITION ) else // single or multi-loop index variables: SIV, MIV // traverse and prune over tree of direction vectors, collect DVs where // dependence exists (traversal not shown here) foreach DV in DV_TREE using prune // In Cetus, the MIV test is performed using Banerjee or Range test DVs += MIVTest( PARTITION, DV, COMMON_NEST, LOOP_INFO ) // Merge DVs for all partitions DV_SET = merge( DVs ) return DV_SET R. Eigenmann, Optimizing compilers Slide 52 Non-linear and Symbolic DD Testing

Weakness of most data dependence tests: subscripts and loop bounds must be affine, i.e., linear with integer-constant coefficients

Approach of the Range Test: capture subscript ranges symbolically compare ranges: find their upper and lower bounds by determining monotonicity. Monotonically increasing/decreasing ranges can be compared by comparing their upper and lower bounds.

R. Eigenmann, Optimizing compilers Slide 53 The Range Test Basic idea : 1. Find the range of array accesses made in a given loop iteration j => r(j). 2. If r(j) does not overlap with r(j+1) then there is no cross-iteration dependence

Symbolic comparison of ranges r1 and r2: max(r1)max(r2) => no overlap

Example: testing independence of the outer loop: ubx

DO i=1,n range of A accessed in iteration ix: [ix*m+1:(ix+1)*m] DO j=1,m A[i*m+j] = 0 range of A accessed in iteration i +1: [(i +1)*m+1:(i +2)*m] ENDDO x x x ENDDO lbx+1

ubx < lbx+1 Þ no cross-iteration dependence

R. Eigenmann, Optimizing compilers Slide 54 we need powerful expression manipulation and comparison Range Test continued utilities

DO i1=L1,U1 Assume f,g are monotonically increasing w.r.t. all ix: ... find upper bound of access range at loop k, 1

we need Determining monotonicity: consider d = f(...,ik,...) - f(...,ik-1,...) range If d>0 (for all values of ik) then f is monotonically increasing w.r.t. k analysis If d<0 (for all values of ik) then f is monotonically decreasing w.r.t. k

What about symbolic coefficients? • in many cases they cancel out • if not, find their range (i.e., all possible values they can assume at this point in the program), and replace them by the upper or lower bound of the range. R. Eigenmann, Optimizing compilers Slide 55 Handling Non-contiguous Ranges

DO i1=1,u1 The basic Range Test finds DO i2=1,u2 independence A[n*i1+m*i2] = … of the outer loop ENDDO if n >= u2 and m=1 ENDDO But not if n=1 and m>=u1

Idea: - temporarily (during program analysis) interchange the loops, - test independence, - interchange back

Issues: • legality of loop interchanging, • change of parallelism as a result of loop interchanging

R. Eigenmann, Optimizing compilers Slide 56 Some Engineering Tasks and Questions for DD Test Pass Writers

- Start with the simple case: linear (affine) subscripts, single nests with 1-dim arrays. Subscript and loop bounds are integer constants. Stride 1 loop, lower bound =1 - Deal with multiple array dims and loop nests - Add capabilities for non-stride-1 loops and lower bounds ¹1 - How to deal with symbolic subscript coefficients and bounds - Ignore dependences in private variables and reductions - Generate DD vectors - Mark parallel loops - Things to think about: -- how to handle loop-variant coefficients -- how to deal with private, reduction, induction variables -- how to represent DD information -- how to display the DD info -- how to deal with non-parallelizable loops (IO op, function calls, other?) -- how to find eligible DO loops? -- how to find eligible loop bounds, array subscripts? -- what is the result of the pass? Generate DD info or set parallel loop flags? -- what symbolic analysis capabilities are needed?

R. Eigenmann, Optimizing compilers Slide 57 Data-Dependence Test, References • Banerjee/Wolfe test – M.Wolfe, U.Banerjee, "Data Dependence and its Application to Parallel Processing", Int. J. of Parallel Programming, Vol.16, No.2, pp.137-178, 1987 • Power Test – M. Wolfe and C.W. Tseng, The Power Test for Data Dependence, IEEE Transactionson Parallel and Distributed Systems, IEEE Computer Society, 3(5), 591-601,1992. • Range test – William Blume and Rudolf Eigenmann. Non-Linear and Symbolic Data Dependence Testing, IEEE Transactions of Parallel and Distributed Systems, Volume 9, Number 12, pages 1180-1194, December 1998. • Omega test – William Pugh. The Omega test: a fast and practical integer programming algorithm for dependence. Proceedings of the 1991 ACM/IEEE Conference on Supercomputing,1991 • I Test – Xiangyun Kong, David Klappholz, and Kleanthis Psarris, "The I Test: A New Test for Subscript Data Dependence," Proceedings of the 1990 International Conference on Parallel Processing, Vol. II, pages 204-211, R. Eigenmann, OptimizingAugust compilers 1990. Slide 58 IV.2 Parallelism Enabling Techniques

R. Eigenmann, Optimizing compilers Slide 59 Advanced Privatization DO i=1,n DO j=1,n t = A[i]+B[i)] t[1:m] = A[j,1:m]+B[j] loop-carried anti dependence C[i] = t + t**2 C[j,1:m] = t[1:m] + t[1:m]**2 ENDDO ENDDO

scalar privatization array privatization

!$OMP PARALLEL DO !$OMP PARALLEL DO !$OMP+PRIVATE(t) !$OMP+PRIVATE(t) DO i=1,n DO j=1,n t = A[i]+B[i] t[1:m] = A[j,1:m]+B[j] C[i] = t + t**2 C[j,1:m] = t[1:m] + t[1:m]**2 ENDDO ENDDO

R. Eigenmann, Optimizing compilers Slide 60 Array Privatization

k = 5 Capabilities needed for DO j=1,n Array Privatization t[1:10] = A[j,1:10]+B[j] • array Def-Use Analysis C[j,iv] = t[k] • combining and intersecting t[11:m] = A[j,11:m]+B[j] subscript ranges C[j,1:m] = t[1:m] ENDDO • representing subscript ranges • representing conditionals DO j=1,n under which sections are IF (cond(j)) defined/used t[1:m] = A[j,1:m]+B[j] • if ranges are too complex to C[j,1:m] = t[1:m] + t[1:m]**2 represent: overestimate ENDIF Uses, underestimate Defs D[j,1] = t[1] ENDDO

R. Eigenmann, Optimizing compilers Slide 61 Array Privatization continued

Array privatization algorithm: • For each loop nest: – iterate from innermost to outermost loop: • for each statement in the loop – Find array definitions; add them to the existing definitions in this loop. – find array uses; if they are covered by a definition, mark this array section as privatizable for this loop, otherwise mark it as upward-exposed in this loop; • aggregate defined and upward-exposed uses (expand from range per-iteration to entire iteration space); record them as Defs and Uses for this loop

R. Eigenmann, Optimizing compilers Slide 62 Some Engineering Tasks and Questions for Privatization Pass Writers

• Start with scalar privatization • Next step: array privatization with simple ranges (contiguous; no range merge) and singly-nested loops • Deal with multiply-nested loops (-> range aggregation) • Add capabilities for merging ranges • Implement advanced range representation (symbolic bounds, non- contiguous ranges) • Deal with conditional definitions and uses (too advanced for this course) • Things to think about – what symbolic analysis capabilities are needed? – how to represent advanced ranges? – how to deal with loop-variant subscript terms? – how to represent private variables?

R. Eigenmann, Optimizing compilers Slide 63 Array Privatization, References • Peng Tu and D. Padua. Automatic Array Privatization. Languages and Compilers for Parallel Computing. Lecture Notes in Computer Science 768, U. Banerjee, D. Gelernter, A. Nicolau, and D. Padua (Eds.), Springer-Verlag, 1994.

• Zhiyuan Li, Array Privatization for Parallel Execution of Loops, Proceedings of the 1992 ACM International Conference on Supercomputing

R. Eigenmann, Optimizing compilers Slide 64 !$OMP PARALLEL PRIVATE(s) s=0 Reduction Privatized reduction !$OMP DO implementation DO i=1,n Parallelization s=s+A[i] ENDDO Scalar Reduction !$OMP ATOMIC sum = sum+s DO i=1,n loop-carried !$OMP END PARALLEL flow sum = sum + A[i] dependenceENDDO DO i=1,num_proc s[i]=0 Expanded reduction Note, OpenMP has a reduction clause, ENDDO implementation only reduction recognition is needed: !$OMP PARALLEL DO !$OMP PARALLEL DO DO i=1,n !$OMP+REDUCTION(+:sum) s[my_proc]=s[my_proc]+A[i] DO i=1,n ENDDO sum = sum + A[i] DO i=1,num_proc ENDDO sum=sum+s[i]

R. Eigenmann, Optimizing compilers ENDDO Slide 65 Parallelizing Array Reductions Array Reductions (a.k.a. irregular or histogram reductions) REAL sum(m),s(m,#proc) REAL sum[m] !$OMP PARALLEL DO DO i=1,n DO i=1,m sum[expr] = sum[expr] + A[i] DO j=1,#proc ENDDO s[i,j]=0 ENDDO ENDDO Expanded reduction implementation REAL sum(m),s(m) !$OMP PARALLEL DO !$OMP PARALLEL PRIVATE(s) DO i=1,n s[1:m] = 0 Privatized reduction s[expr,my_proc]=s[expr,my_proc]+A[i] !$OMP DO implementation ENDDO DO i=1,n !$OMP PARALLEL DO s[expr]=s[expr]+A[i] DO i=1,m ENDDO DO j=1,#proc !$OMP ATOMIC sum[i]=sum[i]+s[i,j] Sum[1:m] = sum[1:m]+s[1:m] ENDDO !$OMP END PARALLEL ENDDO

Note, OpenMP 1.0 does not support such array reductions R. Eigenmann, Optimizing compilers Slide 66 Recognizing Reductions

Recognition Criteria: 1. the loop may contain one or more reduction statements of the form X=X Ä expr ,where • X is either scalar or an array expression, a[sub] (sub must be the same on LHS and RHS) • Ä is a reduction operation, such as +, *, min, max 2. X must not be used in any non-reduction statement of the loop, nor in expr

R. Eigenmann, Optimizing compilers Slide 67 procedure RecognizeSumReductions (L) Reduction Input : Loop L Output: reduction annotations for loop L, inserted in the IR Recognition REDUCTION = {} // set of candidate reduction expressions REF = {} // set of non-reduction variables referenced in L Algorithm foreach stmt in L localREFs = findREF(stmt) // gather all variables referenced in stmt if (stmt is AssignmentStatement) candidate = lhs_expr(stmt) increment = rhs_expr(stmt) – candidate // symbolic subtraction if ( !(baseSymbol(candidate) in findREF(increment)) ) // criterion1 is satisfied REDUCTION = REDUCTION È candidate localREFs = findREF(increment) // all variables referenced in inc. expr. REF = REF È localREFs // collect non-reduction variables for criterion 2 foreach expr in REDUCTION if ( ! (baseSymbol(expr) in REF) ) // criterion 2 is satisfied if (expr is ArrayAccess AND expr.subscript is loop-variant) CreateAnnotation(sum-reduction, ARRAY, expr) else CreateAnnotation(sum-reduction, SCALAR, expr)

R. Eigenmann,end procedure Optimizing compilers Slide 68 Reduction Compiler Passes

Reduction recognition and parallelization passes:

Induction variable recognition recognizes and Reduction recognition annotates reduction Privatization variables Data dependence test Loop parallelization Profitability decision Reduction parallelization performs the reduction transformation compiler passes

R. Eigenmann, Optimizing compilers Slide 69 Performance Considerations for Reduction Parallelization

• Parallelized reductions execute substantially more code than their serial versions Þ overhead if the reduction (n) is small. • In many cases (for large reductions) initialization and sum-up are insignificant. • False sharing can occur, especially in expanded reductions, if multiple processors use adjacent array elements of the temporary reduction array (s). • Expanded reductions exhibit more parallelism in the sum-up operation. • Potential overhead in initialization, sum-up, and memory used for large, sparse array reductions Þ compression schemes can become useful.

R. Eigenmann, Optimizing compilers Slide 70 Induction Variable Substitution

ind = k DO i=1,n loop-carried Parallel DO i=1,n flow ind = ind + 2 A[k+2*i] = B[i] dependence A[ind] = B[i] ENDDO ENDDO

This is the simple case of an induction variable

R. Eigenmann, Optimizing compilers Slide 71 Generalized Induction Variables

ind=k DO j=1,n Parallel DO j=1,n ind = ind + j A[k+(j**2+j)/2] = B[j] A[ind] = B[j] ENDDO ENDDO

DO i=1,n DO i=1,n DO j=1,i ind1 = ind1 + 1 ind = ind + 1 ind2 = ind2 + ind1 A[ind] = B[i] A[ind2] = B[i] ENDDO ENDDO ENDDO

R. Eigenmann, Optimizing compilers Slide 72 Recognizing GIVs

• Pattern Matching: – find induction statements in a loop nest of the form iv=iv+expr or iv=iv*expr, where iv is an scalar integer. – expr must be loop-invariant or another induction variable (there must not be cyclic relationships among IVs) – iv must not be assigned in a non-induction statement • Abstract interpretation: find symbolic increments of iv per loop iteration • SSA-based recognition

R. Eigenmann, Optimizing compilers Slide 73 GIV Closed-form Computation and Substitution Algorithm Step1: find the increment rel. to start of loop L FindIncrement(L) inc=0 Loop structure L0: stmt type foreach si of type I,L if type(s )=I inc += exp For j: 1..ub i … else /* L */ inc+= FindIncrement(si) inc_after[s ]=inc S1: iv=iv+exp I i j-1 … inc_into_loop[L]= å1 (inc) ; inc may depend ub S2: loop using iv L return å1 (inc) ; on j … Step 2: substitute IV S3: stmt using iv U … Replace (L,initval) Rof val = initval+inc_into_loop[L]

foreach si of type I,L,U Main: if type(si)=L Replace(si,val) Insert this if type(s )=L,I val=initialval totalinc = FindIncrement(L0) statement i If iv is live-out +inc_into_loop[L] Replace(L0,iv) InsertStatement(“iv = iv+totalinc”) +inc_after[si]

if type(si)=U Substitute(si,iv,val) For coupled GIVs: begin with independent iv. R. Eigenmann, Optimizing compilers Slide 74 Induction Variables, References

• B. Pottenger and R. Eigenmann. Idiom Recognition in the Polaris Parallelizing Compiler. ACM Int. Conf. on Supercomputing (ICS'95), June 1995.

• Mohammad R. Haghighat , Constantine D. Polychronopoulos, Symbolic analysis for parallelizing compilers, ACM Transactions on Programming Languages and Systems (TOPLAS), v.18 n.4, p.477-518, July 1996

• Michael P. Gerlek , Eric Stoltz , Michael Wolfe, Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form, ACM Transactions on Programming Languages and Systems (TOPLAS), v.17 n.1, p.85-122, Jan. 1995

R. Eigenmann, Optimizing compilers Slide 75 Loop Skewing !$OMP PARALLEL DO DO i=1,4 DO set=1,?1,9 DO j=1,6 i = ?max(5-set,1) A[i,j]= A[i-1,j-1] j = ?max(-3+set,1) ENDDO setsize== ? min(4,5-abs(set-5)) ENDDO DO k=0,setsize-1 A(i+k,j+kA[i+k,j+k])=A(i = A[i-1+k,j-1+k,j-1+k)-1+k] ENDDO ENDDO

i Iteration space graph: Shared regions show sets of iterations in the transformed code that can be executed in parallel.

j R. Eigenmann, Optimizing compilers Slide 76 Loop Skewing for the Wavefront Method

DO i=2,n-1 DO j=2,n-1 A[i,j]= (A[i+1,j] +A[i-1,j] +A[i,j+1] +A[i,j-1])/4 ENDDO ENDDO Outer loop is serial Inner loop is parallel j 2 3 4 5 6 7 . . . DO j=4, n+n-2 2 DOALL i= max(2, n- j+ 1), min(n- 1, j- 2) 3 A[i, j- i] = (A[i+ 1, j- i] + A[i- 1, j- i] +A[i, j+ 1- i] + A[i, j- 1 +i]/4 4 ENDDO 5 ENDDO i . . . R. Eigenmann, Optimizing compilers Slide 77 IV.3 Techniques for Multiprocessors: Mapping Parallelism to Shared-memory Machines

R. Eigenmann, Optimizing compilers Slide 78 Loop Fusion and Distribution

DO i=1,n A(i) = B(i) DO i=1,n loop fusion ENDDO A(i) = B(i) C(i) = A(i-1) + D(i) DO i=1,n ENDDO C(i) = A(i-1)+D(i) loop distribution ENDDO (fission)

• Loop fusion is the reverse of loop distribution • Fusion reduces the loop fork/join overhead and enhances data affinity • Distribution inserts a barrier synchronization between parallel loops • Both transformations reorder computation • Legality: dependences in fused loop must be lexically forward

R. Eigenmann, Optimizing compilers Slide 79 Loop Distribution Enables Other Techniques DO i=1,n DO i=1,n A(i) = B(i)+A(i-1) • enables A(i) = B(i)+A(i-1) interchange ENDDO DO j=1,m • separates D(i,j)=E(i,j) out partial DOALL j=1,m ENDDO paralleism DO i=1,n ENDDO D(i,j)=E(i,j) ENDDO ENDDO In a program with multiply-nested loops, there can be a large number of possible program variants obtained through distribution and interchanging

R. Eigenmann, Optimizing compilers Slide 80 Enforcing Data Dependence

Criterion for correct transformation and execution of a computation involving a data dependence with vector v : (=,…<,…*)

Let L be the outermost loop with non-“=” DD-direction : s Ls – Ls must be executed serially

– The direction at Ls must be “<”

Same rule applies to all dependences

Note that a data dependence is defined with respect to an ordered execution. For autoparallelization, this is the serial program order. User-defined, fully parallel loops by definition do not have cross-iteration dependences. Legality rules for transforming already parallel programs are different.

R. Eigenmann, Optimizing compilers Slide 81 Loop Interchange Legality of Loop interchange and resulting parallelism can be tested with the above rules: After loop interchange, the two conditions must still hold.

DO i=1,n DOALL j=1,m DOALL j=1,m DO i=1,n A(i,j) = A(i-1,j) A(i,j) = A(i-1,j) ENDDO ENDDO ENDDO ENDDO DOALL i=1,n DOALL j=1,m DO j=1,m DO i=1,n A(i,j) = A(i-1,j-1) A(i,j) = A(i-1,j-1) ENDDO ENDDO ENDDO ENDDO

R. Eigenmann, Optimizing compilers Slide 82 Loop Coalescing a.k.a. loop collapsing

PARALLEL DO i=1,n PARALLEL DO ij=1,n*m DO j=1,m i = 1 + (ij-1) DIV m loop j = 1 + (ij-1) MOD m A(i,j) = B(i,j) coalescing ENDDO A(i,j) = B(i,j) ENDDO ENDDO

Loop coalescing • can increase the number of iterations of a parallel loop à load balancing • adds additional computation à overhead

R. Eigenmann, Optimizing compilers Slide 83 Loop Blocking/Tiling DO PARALLEL i1=1,n,block DO j=1,m DO j=1,m DO i=1,n DO i=i1,min(i1+block-1,n) loop B(i,j)=A(i,j)+A(i,j-1) B(i,j)=A(i,j)+A(i,j-1) blocking ENDDO ENDDO ENDDO ENDDO ENDDO j j p1

p2 i i

p3

p4

This is basically the same transformation as stripmining, but followed by loop interchanging.

R. Eigenmann, Optimizing compilers Slide 84 Loop !$OMP PARALLEL Blocking/Tiling DO j=1,m continued !$OMP DO DO i=1,n DO j=1,m B(i,j)=A(i,j)+A(i,j-1) DO i=1,n ENDDO B(i,j)=A(i,j)+A(i,j-1) !$OMP ENDDO NOWAIT ENDDO ENDDO ENDDO !$OMP END PARALLEL

j j p1

p2 i i

p3

p4

R. Eigenmann, Optimizing compilers Slide 85 Choosing the Block Size

The block size must be small enough so that all data references between the use and the reuse fit in cache.

DO j=1,m DO k=1,block Number of references made between the … (r1 data references) access A(k,j) and the access A(k,j-d) when … = A(k,j) + A(k,j-d) referencing the same memory location: … (r2 data references) (r1+r2+3)*d*block ENDDO à block < cachesize / (r1+r2+3)*d ENDDO If the cache is shared, all cores use it simultaneously. Hence the effective cache size appears smaller: block < cachesize / (r1+r2+3)*d*num_cores

Reference: Zhelong Pan, Brian Armstrong, Hansang Bae and Rudolf Eigenmann, On the Interaction of Tiling and Automatic Parallelization, First International Workshop on OpenMP (Wompat), 2005.

R. Eigenmann, Optimizing compilers Slide 86 Multi-level Parallelism from Single Loops DO i=1,n A(i) = B(i) strip mining ENDDO for multi-level parallelism

PARALLEL DO (inter-cluster) i1=1,n,strip PARALLEL DO (intra-cluster) i=i1,min(i1+strip-1,n) A(i) = B(i) ENDDO ENDDO

M cluster M M M M P P P P P P P P P P P P P P P P

R. Eigenmann, Optimizing compilers Slide 87 References

• High Performance Compilers for Parallel Computing, Michale Wolfe, Addison-Wesley, ISBN 0-8053-2730-4. • Optimizing Compilers for Modern Architectures: A Dependence-based Approach, Ken Kennedy and John R. Allen, Morgan Kaufmann Publishers, ISBN 1558602860

R. Eigenmann, Optimizing compilers Slide 88 IV.4 Advanced Program Analysis

R. Eigenmann, Optimizing compilers Slide 89 Interprocedural Analysis

• Most compiler techniques work intra- procedurally • Ideally, inter-procedural analyses and transformations available • In practice: inter-procedural operation of basic analysis works well • helps but no silver bullet

R. Eigenmann, Optimizing compilers Slide 90 Interprocedural Constant Propagation Making constant values of variables known across subroutine calls

Subroutine A Subroutine B(m) j = 150 DO k=1,100 call B(j) X(i)=X(i+m) ENDDO END END

knowing that m>100 allows this loop to be parallelized

R. Eigenmann, Optimizing compilers Slide 91 An Algorithm for Interprocedural Constant Propagation Intra-procedural part: determine jump functions for all subroutines

Subroutine X(a,b,c) e = 10 JY,1 = c d = b+2 J = a (jump function of first parameter) call Y(c) Z,1 JZ,2 = b+2 f = b*2 JZ,3 = ^ (called bottom, meaning non-constant) call Z(a,d,c,e,f) JZ,4 = 10 END JZ,5 = ^

• Mechanism for finding jump functions: (local) forward substitution and interprocedural MAYMOD information. • Here we assume the compiler supports jump functions of the form P+const (P is a subroutine parameter of the callee).

R. Eigenmann, Optimizing compilers Slide 92 Constant Propagation Algorithm:

Interprocedural Part

1. initialize all formal parameters to the value T (called top = non yet known) 2. for all jump functions: – if it is ^: set formal parameter value to ^ (called bottom = unknown) – if it is constant and the value of the formal parameter is the same constant or T : set the value to this constant 3. put all formal parameters on a work queue 4. repeat: take a parameter from the queue until queue is empty for all jump functions that contain this parameter: • determine the value of the target parameter of this jump function. Set it to this value, or to ^ if it is different from a previously set value. • if the value of the target parameter changes, put this parameter on the queue

R. Eigenmann, Optimizing compilers Slide 93 Examples of Constant Propagation

x = 3 Consider Call SubY(x) what happens if x = 3 x = 3 t = 6 t = 7 Call SubY(x) Call SubY(x) Call SubU(t)

Subroutine SubY(a) Subroutine SubY(a) Subroutine SubY(a) b = a+2 b = a+2 Call SubZ(b) … = ….a… Call SubZ(b)

Subroutine SubU(c) Subroutine SubZ(e) d = c-1 Call SubZ(d) … = … e….

Subroutine SubZ(e)

… = … e….

R. Eigenmann, Optimizing compilers Slide 94 Interprocedural Data-Dependence Analysis • Motivational examples:

DO i=1,n DO i=1,n DO k=1,m call clear(a,i) a(i) = b(i) DO i=1,n ENDDO call dupl(a,i) a(i,k) = math(i,k) ENDDO call smooth(a(i,k)) Subroutine clear(x,j) ENDDO x(j) = 0 Subroutine dupl(x,j) END x(j) = 2*x(j) Subroutine smooth(x,j) END x(j) = (x(j-1)+x(j)+x(j+1))/3 END

R. Eigenmann, Optimizing compilers Slide 95 Interprocedural Data-Dependence Analysis • Overall strategy: – subroutine inlining – move loop into called subroutine – collect array access information in callee and use in the analysis of the caller ® will be discussed in more detail

R. Eigenmann, Optimizing compilers Slide 96 Interprocedural Data-Dependence Analysis • Representing array access information – summary information • [low:high] or [low:high:stride] • sets of the above – exact representation • essentially all loop bound and subscript information is captured – representation of multiple subscripts • separate representation • linearized

R. Eigenmann, Optimizing compilers Slide 97 Interprocedural Data-Dependence Analysis • Reshaping arrays – simple conversion • matching subarray or 2-D®1-D – exact reshaping with div and mod – linearizing both arrays – equivalencing the two shapes • can be used in subroutine inlining Important: reshaping may lose the implicit assertion that array bounds are not violated!

R. Eigenmann, Optimizing compilers Slide 98 Symbolic Analysis

• Expression manipulation techniques – Expression simplification/normalization – Expression comparison – Symbolic arithmetic • Range analysis – Find lower/upper bounds of variable values at a given statement • For each statement and variable, or • Demand-driven, for a given statement and variable

R. Eigenmann, Optimizing compilers Slide 99 Symbolic Range Analysis Example

int foo(int k) {} [] int i, j; [] double a; [] for ( i=0; i<10; ++i ) { [0<=i<=9] a=(0.5*i); } [i=10] j=(i+k); [i=10, j=(i+k)] return j; }

R. Eigenmann, Optimizing compilers Slide 100

Find references to the same storage by different names Þ Program analyses and transformations must consider all these names

Simple case: different named variables allocated in same storage location • Fortran Equivalence statement • Same variable passed to subroutine by-reference as two different parameters (can happen in Fortran and C++, but not in C) • Global variable also passed as subroutine parameter

R. Eigenmann, Optimizing compilers Slide 101 Pointer Alias Analysis

• More complex: variables pointed to by named pointers – p=&a; q=&a => *p, *q are aliases – Same variable passed to C subroutines via pointer

• Most complex: pointers between dynamic data structure objects – This is commonly referred to as shape analysis

R. Eigenmann, Optimizing compilers Slide 102 Is Alias Analysis in Parallelizing Compilers Important? • Fortran77: alias analysis is simple/absent – By Fortran rule, aliased subroutine parameters must not be written to – there are no pointers • C programs: alias analysis is a must – Pointers, pointer arithmetic – No Fortran-like rule about subroutine parameters – Without alias information, compilers would have to be very conservative => big loss of parallelism – Classical science/engineering applications do not have dynamic data structures => no shape analysis needed

R. Eigenmann, Optimizing compilers Slide 103 IV.5 Dynamic Decision Support

R. Eigenmann, Optimizing compilers Slide 104 Achilles’ Heel of Compilers

Big compiler limitations: – Insufficient compile-time knowledge • Input data • Architecture parameters (e.g., cache size) • Memory layout – Even if this information is known: Performance models too complex Effect: – Unknown profitability of optimizations – Inconsistent performance behavior – Conservative behavior of compilers – Many compiler options – Users need to experiment with options

R. Eigenmann, Optimizing compilers Slide 105 Multi-version Code

IF (d>n) Limitations PARALLEL DO i=1,n • Less readable a(i) = a(i+d) • Additional code ENDDO • Not feasible for all optimizations ELSE • Combinatorial explosion DO i=1,n when trying to apply to a(i) = a(i+d) many optimization decisions ENDDO

R. Eigenmann, Optimizing compilers Slide 106 Profiling

• Gather missing information in a profile run – Compiler instruments code that gathers at runtime information needed for optimization decisions • Use the gathered profile information for improved decision making in a second compiler invocation • Training vs. production data • Initially used for branch prediction. Now increasingly used to guide additional transformations. • Requires a compiler performance model

R. Eigenmann, Optimizing compilers Slide 107 Autotuning – Empirical Tuning

Try many optimization • No compiler performance model needed variants; pick the • Optimization decisions best at runtime. based on true execution time • Dependence on training data Search (same as profiling) Space Version Generation • Potentially huge search Navigation space • Whole-program vs. section- level tuning Runtime Evaluation

Many active research projects

R. Eigenmann, Optimizing compilers Slide 108 IV.4 Techniques for Vector Machines

R. Eigenmann, Optimizing compilers Slide 109 Vector Instructions

A vector instruction operates on a number of data elements at once. Example: vadd va,vb,vc,32 vector operation of length 32 on vector registers va,vb, and vc – va,vb,vc can be • Special cpu registers or memory ® classical supercomputers • Regular registers, subdivided into shorter partitions (e.g., 64bit register split 8-way) ® multi-media extensions – The operations on the different vector elements can overlap ® vector pipelining

R. Eigenmann, Optimizing compilers Slide 110 Applications of Vector Operations • Science/engineering applications are typically regular with large loop iteration counts. This was ideal for classical supercomputers, which had long vectors (up to 256; vector pipeline startup was costly). • Graphics applications can exploit “multi- media” register features and instruction sets.

R. Eigenmann, Optimizing compilers Slide 111 Basic Vector Transformation

DO i=1,n A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n) ENDDO

DO i=1,n A(i) = B(i)+C(i) A(1:n)=B(1:n)+C(1:n) C(i-1) = D(i)**2 C(0:n-1)=D(1:n)**2 ENDDO

The triplet notation is interpreted to mean “vector operation”. Notice that this is not (necessarily) the same meaning as in Fortran 90,

R. Eigenmann, Optimizing compilers Slide 112 Distribution and Vectorization The transformation done on the previous slide involves loop distribution. Loop distribution reorders computation and is thus subject to data dependence constraints. loop DO i=1,n distribution A(i) = B(i)+C(i) DO i=1,n ENDDO A(i) = B(i)+C(i) D(i) = A(i)+A(i-1) DO i=1,n dependence ENDDO D(i) = A(i)+A(i-1) ENDDO vectorization

The transformation is not legal if there is a A(1:n)=B(1:n)+C(1:n) lexical-backward dependence: D(1:n)=A(1:n)+A(0:n-1) DO i=1,n loop-carried A(i) = B(i)+C(i) dependence Statement reordering may help C(i+1) = D(i)**2 resolve the problem. However, this is not possible if there is a dependence ENDDO cycle.

R. Eigenmann, Optimizing compilers Slide 113 Vectorization Needs Expansion

... as opposed to privatization

DO i=1,n DO i=1,n t = A(i)+B(i) expansion T(i) = A(i)+B(i) C(i) = t + t**2 C(i) = T(i) + T(i)**2 ENDDO ENDDO vectorization

T(1:n) = A(1:n)+B(1:n) C(1:n) = T(1:n)+T(1:n)**2

R. Eigenmann, Optimizing compilers Slide 114 Conditional Vectorization

DO i=1,n IF (A(i) < 0) A(i)=-A(i) ENDDO

conditional vectorization

WHERE (A(1:n) < 0) A(1:n)=-A(1:n)

R. Eigenmann, Optimizing compilers Slide 115 Stripmining for Vectorization

stripmining DO i1=1,n,32 DO i=1,n DO i=i1,min(i1+31,n) A(i) = B(i) A(i) = B(i) ENDDO ENDDO ENDDO

Stripmining turns a single loop into a doubly-nested loop for two-level parallelism. It also needs to be done by the code-generating compiler to split an operation into chunks of the available vector length.

R. Eigenmann, Optimizing compilers Slide 116 IV.7 Compiling for Heterogeneous Architectures

R. Eigenmann, Optimizing compilers Slide 117 Why Heterogeneous Architectures? • Performance – Fast uniprocessor best for serial code – Many simple cores best for highly parallel code – Special-purpose architectures for accelerating certain code patterns • E.g., math co-processor • Energy – Same arguments hold for power savings

R. Eigenmann, Optimizing compilers Slide 118 Examples of Accelerators

• GPU Accelerators are typically used as co-processors. • nvidia GPGPU • CPU+accelerator = heterogeneous • Shared or distributed address space • IBM Cell • Intel MIC • FPGAs • Crypto processor • Network processor • Video Encoder/decoder

R. Eigenmann, Optimizing compilers Slide 119 Accelerator Architecture

CPU GPU

Example GPGPU: Grid Thread Block M

• Address space is Thread Block 0 separate from CPU Shared Memory

• Complex Memory Registers Registers CUDA hierarchy Memory • Large number of cores Thread 0 ••• Thread K Model Local Local • Multithreaded SIMD Memory Memory execution Global Memory • Optimized for coalesced CPU Texture Memory with a Dedicated Cache (stride-1) accesses Memory Constant Memory with a Dedicated Cache

R. Eigenmann, Optimizing compilers Slide 120 Compiler Optimizations for GPGPUs • Optimizing GPU Global Memory Accesses – Parallel Loop Swap – Loop Collapsing – Matrix Transpose

• Exploiting GPU On-chip Memories

• Optimizing CPU-GPU Data Movement – Resident GPU Variable Analysis – Live CPU Variable Analysis – Memory Transfer Promotion Optimization

R. Eigenmann, Optimizing compilers Slide 121 Parallel Loop-Swap Transformation Memory access at time t k #pragma omp parallel for Thread ID for(i=0; i< N; i++) T0 for(k=0; k

#pragma omp parallel for Thread ID T0 T1 T2 T3 k schedule(static, 1) for(k=0; k

R. Eigenmann, Optimizing compilers Slide 122 Loop Collapsing Transformation k #pragma omp parallel for Thread ID for(i=0; i

#pragma omp parallel Thread ID T0 T1 T2 T3 T4 T5 T6 T7 #pragma omp for collapse(2) schedule(static, 1) for(i=0; i

R. Eigenmann, Optimizing compilers Slide 123 Matrix-Transpose Transformation Memory access at time t i Thread ID A float d[N][M] T0 ... T1 k T2 T3 #kernel function: Global Memory float d[M][N] # pragma omp parallel Thread ID T0 T1 T2 T3 for(k=0; i< N; i++) k for(i=0; k

Global Memory

R. Eigenmann, Optimizing compilers Slide 124 Techniques to Exploit GPU On-chip Memories Caching Strategies

Variable Type Caching Strategy R/O shared scalar w/o locality SM R/O shared scalar w/ locality SM, CM, Reg R/W shared scalar w/ locality Reg, SM R/W shared array element w/ locality Reg R/O 1-dimensional shared array w/ locality TM R/W private array w/ locality SM

Reg: Registers CM: Constant Memory SM: Shared Memory TM: Texture Memory

R. Eigenmann, Optimizing compilers Slide 125 Techniques to Optimize Data Movement between CPU and GPU • Resident GPU Variable Analysis – Up-to-date data in GPU global memory: do not copy again from CPU. • Live CPU Variable Analysis – After a kernel finishes: only transfer live CPU variables from GPU to CPU. • Memory Transfer Promotion Optimization – Find optimal points to insert necessary memory transfers

R. Eigenmann, Optimizing compilers Slide 126 GPGPU Performance Relative to CPU

R. Eigenmann, Optimizing compilers Slide 127 3./, 3, -.2, -.1,

!"##$%"& -.0, -./, -, $4E'5'==;$C:6>"H"=L3, $4E'5"&9(C:6>"H"=L/, B=;M'=N5'==;$C:6, 4*"N=;M'=N5'==;$,

3./, 3, -.2, -.1,

!"##$%"& -.0, -./, Importance of -, 4*"5'==;$<76$%, '**4&"O;#P"(;9(7:>;;:*, $4E'9%("'E!=;$)?7Q", =;$'=D"ER'(G;#SL-,

3./, 3, Individual -.2, -.1,

!"##$%"& -.0, -./, Optimizations -, :(H6I((JG'$%7#BC#?5, *%(EI((JG'$%7#BC#G;#*6, *%(EI((JG'$%7#BC#95, *%(EI((JK=&6G'$%7#BC#D"B,

3./, 3, -.2, -.1,

!"##$%"& -.0, -./, -, *%(E?$=(G'$%7#BC#G;#*6, *%(E?$=(G'$%7#BC#D"B, *%(E?$=(G'$%7#BC#?5,

3./, 3, -.2, -.1,

!"##$%"& -.0, -./, -, 4*"5'6(789('#*:;*", 4*"<'('=="=>;;:?@':, 4*"A#(;==7#BC#D"E4$F;#,

!"#$%&'()*+, R. Eigenmann, Optimizing compilers Slide 128 IV.8 Techniques Specific to Distributed-memory Machines

R. Eigenmann, Optimizing compilers Slide 129 Execution Scheme on a Distributed-Memory Machine Typical execution scheme: • All nodes execute the same program M M M M • Program uses node_id to select the P P P P subcomputation to execute on each participating processor and the data to access. For example, mystrip=én/max_nodesù DO i=1,n lb = node_id*mystrip +1 ub = min(lb+mystrip-1,n) . . . how to place DO i=lb,ub and access ENDDO . . . data ? ENDDO how/when to synchronize ? This is called Single-Program-Multiple-Data (SPMD) execution scheme

R. Eigenmann, Optimizing compilers Slide 130 Data Placement

Single owner: • Data is distributed onto the participating processors’ memories

Replication: • Multiple versions of the data are placed on some or all nodes.

R. Eigenmann, Optimizing compilers Slide 131 Data Distribution Schemes

numbers indicate the node of a 4-processor distributed-memory machine on which the array section is placed block 1 2 3 4 distribution

1234123412341234 ž ž ž cyclic distribution

block-cyclic 1 2 3 4 1 ž ž ž distribution

indexed IND(1)IND(2)IND(3)IND(4)IND(5) ž ž ž distribution

ž ž ž index array

Automatic data distribution is difficult because it is a global optimization. R. Eigenmann, Optimizing compilers Slide 132 Message Generation for single-owner placement

EXAMPLE message send (A(ub),my_proc+1) DO i=1,n generation receive (A(lb-1),my_proc-1) B(i) = A(i)+A(i-1) DO i=lb,ub ENDDO B(i) = A(i)+A(i-1) ENDDO

• lb,ub determine the iterations assigned to each processor. • data uses block distribution and matches the iteration distribution • my_proc is the current processor number

Compilers for languages such as HPF (High-Performance Fortran) have explored these ideas extensively

R. Eigenmann, Optimizing compilers Slide 133 Owner-computes Scheme In general, the elements accessed by a processor are different from the elements owned by this processor as defined by the data distribution DO i=1,n send/receive what’s necessary DO i=1,n IF I_own(A(i)) THEN A(i)=B(i)+B(i-m) A(i) = B(i)+B(i-m) C(ind(i))=D(ind2(i)) ENDIF ENDDO send/receive what’s necessary IF I_own(C(ind(i)) THEN C(ind(i))=D(ind2(i)) ENDIF ENDDO • nodes execute those iterations and statements whose LHS they own • first they receive needed RHS elements from remote nodes • nodes need to send all elements needed by other nodes Example shows basic idea only. Compiler optimizations needed!

R. Eigenmann, Optimizing compilers Slide 134 Compiler Optimizations for the raw owner computes scheme • Eliminate conditional execution – combine if statements with same condition – reduce iteration space if possible • Aggregate communication – combine small messages into larger ones – tradeoff: delaying a message enables message aggregation but increases the message latency. • Message Prefetch – moving send operations earlier in order to reduce message latencies. there is a large number of research papers describing such techniques R. Eigenmann, Optimizing compilers Slide 135 Message Generation for Virtual Data Replication

Fully parallel section w. local reads and writes

Broadcast written data Optimization: reduce broadcast operations to necessary point-to-point Fully parallel section communication w. local reads and writes

time Advantages: •Fully parallel sections with local reads and writes •Easier message set computation (no partitioning per processor needed)

Disadvantages: •Not data-scalable •More write operations necessary (but, collective communication can be used) R. Eigenmann, Optimizing compilers Slide 136 7 Techniques for Instruction-Level Parallelization

R. Eigenmann, Optimizing compilers Slide 137 Implicit vs. Explicit ILP

Implicit ILP: ISA is the same as for sequential programs. – most processors today employ a certain degree of implicit ILP – parallelism detection is entirely done by the hardware – compiler can assist ILP by arranging the code so that the detection gets easier.

R. Eigenmann, Optimizing compilers Slide 138 Implicit vs. Explicit ILP

Explicit ILP: ISA expresses parallelism. – parallelism is detected by the compiler – parallelism is expressed in the form of • VLIW (very long instruction words): packing several instructions into one long word • EPIC (Explicitly Parallel Instruction Computing): bundles of (up to three) instructions are issued. Dependence bits can be specified. Used in Intel/HP IA-64 architecture. The processor also supports predication, early (speculative) loads, prepare-to- branch, rotating registers.

R. Eigenmann, Optimizing compilers Slide 139 Trace Scheduling (invented for VLIW processors, still a useful terminology)

Two big issues must be solved by all approaches: 1. Identifying the instruction sequence that will be inspected for ILP. trace selection Main obstacle: branches 2. reordering instructions so that machine resources are exploited efficiently. trace compaction

trace scheduling

R. Eigenmann, Optimizing compilers Slide 140 Trace Selection

• It is important to have a large instruction window (block) within which the compiler can find parallelism. • Branches are the problem. Instruction pipelines have to be flushed/squashed at branches • Possible remedies: – eliminate branches – code motion can increase block size – block can contain out-branches with low probability – predicated execution

R. Eigenmann, Optimizing compilers Slide 141 Branch Elimination

• Example:

comp R0 R1 comp R0 R1 bne L1: beq L2: bra L2: L1: . . . L1: ......

L2: . . . L2: . . .

R. Eigenmann, Optimizing compilers Slide 142 Code Motion

c1

I1 I1 I1 c2

I2 I3

I1 c2

c1 c1

I1 I2 I1 I3 Code motion can increase window sizes and eliminate subtrees

R. Eigenmann, Optimizing compilers Slide 143 Predicated Execution

IF (a>0) THEN p = a>0 ; assignment of predicate b=a p: b=a ; executed if predicate true ELSE p: b=-a ; executed if predicate false b=-a ENDIF

Predication • increases the window size for analyzing and exploiting parallelism • increases the number of instructions “executed” These are opposite demands!

Compare this technique to conditional vectorization

R. Eigenmann, Optimizing compilers Slide 144 Dependence-removing ILP Techniques ind = i0 ind = i0 ...... dependenceind = ind+1 ind = i0+1 ...... dependenceind = ind+1 ind = i0+2

sum = sum+expr1 s1=expr1 . . . dependence. . . sum = sum+expr2 s1=s1+expr2 dependence...... sum = sum+expr3 s2=expr3 dependence...... sum = sum+expr4 s2=s2+expr4 . . . sum=sum+s1+s2

shaded blocks of statements are independent of each other and can be executed as parallel instructions

R. Eigenmann, Optimizing compilers Slide 145 Speculative ILP Speculation is performed by the architecture in various forms – Superscalar processors: compiler only has to deal with the performance model. ISA is the same as for non-speculative processors – Multiscalar processors: (research only) compiler defines tasks that the hardware can try execute speculatively in parallel. Other than task boundaries, the ISA is the same. References: • Task Selection for a Multiscalar Processor, T. N. Vijaykumar and Gurindar S. Sohi, The 31st International Symposium on Microarchitecture (MICRO-31), pp. 81-92, December 1998. • Reference Idempotency Analysis: A Framework for Optimizing Speculative Execution, Seon-Wook Kim, Chong-Liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T.N. Vijaykumar,, In Proc. of PPOPP'01, Symposium on Principles and Practice of Parallel Programming, 2001.

R. Eigenmann, Optimizing compilers Slide 146 Compiler Model of Explicit Specluative Parallel Execution (Multicalar Processor) • Overall Execution: speculative • Dependence Tracking: Data threads choose and start the Flow and Control Flow execution of any predicted next dependences are detected thread. directly. Lead to roll-back. Anti • Data Dependence and Control and Output dependences are Flow Violations lead to roll- satisfied via speculative backs. storage. • Final Execution: satisfies all • Segment Commit: Correctly cross-segment flow and control executed threads (I.e., their final dependences. execution) commit their • Data Access: Writes go to speculative storage to the thread-private speculative memory, in sequential order. storage. Reads read from ancestor thread or memory.

R. Eigenmann, Optimizing compilers Slide 147