TDDC86 Compiler optimizations and code generation Outline

Towards (semi-)automatic parallelization of sequential programs Optimization and Parallelization  Data dependence analysis for loops of Sequential Programs  Some loop transformations  Loop invariant code hoisting, , loop fusion, , loop blocking and tiling Lecture 7  Static loop parallelization  Run-time loop parallelization Christoph Kessler IDA / PELAB  Doacross parallelization, Inspector-executor method Linköping University  Speculative parallelization (as time permits) Sweden  Auto-tuning (later, if time)

Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 2 TDDD56 Multicore and GPU Programming

Foundations: Control and Data Dependence Foundations: Control and Data Dependence

 Consider statements S, T in a sequential program (S=T possible)  Data dependence S  T, Example: if statement S may execute (dynamically) before T  Scope of analysis is typically a function, i.e. intra-procedural analysis S: z = … ; and both may access the same memory location …  Assume that a control flow path S … T is possible and at least one of these accesses is a write T: … = ..z.. ;  Can be done at arbitrary granularity (instructions, operations,  Means that execution order ”S before T” must statements, compound statements, program regions) be preserved when restructuring the program (flow dependence)  Relevant are only the read and write effects on memory  In general, only a conservative over-estimation (i.e. on program variables) by each operation, can be determined statically and the effect on control flow  flow dependence: (RAW, read-after-write) Example:  S may write a location z that T may read  Control dependence S  T, S: if (…) {  anti dependence: (WAR, write-after-read) if the fact whether T is executed may depend on S … (e.g. condition) T:…  S may read a location x that T may overwrites  Implies that relative execution order S  T …  output dependence: (WAW, write-after-write) must be preserved when restructuring the program }  both S and T may write the same location  Mostly obvious from nesting structure in well-structured programs, but more tricky in arbitrary branching code (e.g. assembler code)

C. Kessler, IDA, Linköpings universitet. 3 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 4 TDDD56 Multicore and GPU Programming

Dependence Graph Data Dependence Graph

 (Data, Control, Program) Dependence Graph:  Data dependence graph for straight-line code (”basic Directed graph, consisting of all statements as vertices block”, no branching) is always acyclic, because relative and all (data, control, any) dependences as edges. execution order of statements is forward only.  Data dependence graph for a loop:  Dependence edge ST if a dependence may exist for some pair of instances (iterations) of S, T  Cycles possible  Loop-independent versus loop-carried dependences

Example:

(assuming we know statically

C. Kessler, IDA, Linköpings universitet. 5 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet.that arrays a and6 b do not intersect)TDDD56 Multicore and GPU Programming

1 Example Why and Parallelization?

(assuming that we statically know that Loops are a promising object for program optimizations, arrays A, X, Y, Z do not intersect, otherwise there might be further including automatic parallelization: dependences)  High execution frequency  Most computation done in (inner) loops  Even small optimizations can have large impact (cf. Amdahl’s Law)  Regular, repetitive behavior  compact description  relatively simple to analyze statically

(Iterations unrolled)  Well researched S Data dependence graph: 1

C. Kessler, IDA, Linköpings universitet. S2 7 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 8 TDDD56 Multicore and GPU Programming

Loop Optimizations – General Issues TDDC86 Compiler optimizations and code generation

 Move loop invariant computations out of loops  Modify the order of iterations or parts thereof

Goals:  Improve data access locality  Faster execution Data Dependence Analysis  Reduce loop control overhead  Enhance possibilities for loop parallelization or vectorization A more formal introduction

Only transformations that preserve the program semantics (its input/output behavior) are admissible  Conservative (static) criterium: preserve data dependences  Need data dependence analysis for loops (see TDDC86) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 9 TDDD56 Multicore and GPU Programming

Data Dependence Analysis – Overview Precedence relation between statements

 Important for loop optimizations, vectorization and parallelization, , data cache optimizations  Conservative approximations to disjointness of pairs of memory accesses  weaker than data-flow analysis  but generalizes nicely to the level of individual array element  Loops, loop nests  Iteration space  Array subscripts in loops  Index space  Dependence testing methods  Data dependence graph  Data + control dependence graph  Program dependence graph

C. Kessler, IDA, Linköpings universitet. 11 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 12 TDDD56 Multicore and GPU Programming

2 Control and Data Depencence; Data Dependence Graph Dependence Graph

C. Kessler, IDA, Linköpings universitet. 13 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 14 TDDD56 Multicore and GPU Programming

Remark: Target-Level Dependence Graphs Loop Normalization

 For VLIR / target code, the dependence edges may be labeled with latency or delay of the operation  See lecture on instruction scheduling for details

C. Kessler, IDA, Linköpings universitet. 15 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 16 TDDD56 Multicore and GPU Programming

Loop Iteration Space Dependence Distance and Direction

C. Kessler, IDA, Linköpings universitet. 17 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 18 TDDD56 Multicore and GPU Programming

3 Example Dependence Equation System

C. Kessler, IDA, Linköpings universitet. 19 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 20 TDDD56 Multicore and GPU Programming

Linear Diophantine Equations Dependence Testing, 1: GCD-Test

C. Kessler, IDA, Linköpings universitet. 21 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 22 TDDD56 Multicore and GPU Programming

For multidimensional arrays? Survey of Dependence Tests

C. Kessler, IDA, Linköpings universitet. 23 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 24 TDDD56 Multicore and GPU Programming

4 TDDC86 Compiler optimizations and code generation Loop Transformations

Loop Transformations and Parallelization

Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 26 TDDD56 Multicore and GPU Programming

Some important loop transformations Loop Invariant Code Hoisting

 Loop normalization  Move loop invariant code out of the loop  Loop parallelization  Compilers can do this automatically if they can statically  Loop invariant code hoisting find out what code is loop invariant  Loop interchange  Example:  Loop fusion vs. Loop distribution / fission for (i=0; i<10; i++) tmp = c / d;  Strip-mining / loop tiling / blocking vs. Loop linearization a[i] = b[i] + c / d; for (i=0; i<10; i++)  Loop unrolling, unroll-and-jam a[i] = b[i] + tmp;  Loop peeling  Index set splitting, Loop unswitching  Scalar replacement, Scalar expansion  Later:  More: Cycle shrinking, Loop skewing, ...

C. Kessler, IDA, Linköpings universitet. 27 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 28 TDDD56 Multicore and GPU Programming

Loop Unrolling Loop Interchange (1)

 Loop unrolling  For properly nested loops  Can be enforced with compiler options e.g. –funroll=2 (statements in innermost loop body only)   Example: Example 1: for (i=0; i<50; i++) { for (i =0; i<50; i+=2) { for (j=0; j

. in C, Java

 Longer loop body may enable further local optimizations .

. new iteration order . . . . (e.g. common subexpression elimination, . , instruction scheduling, old iteration order a[N-1][0] a[N-1][0] using SIMD instructions)  Can improve data access locality in memory hierarchy  longer code (fewer cache misses / page faults) C. Kessler,Exercise: IDA, Linköpings Formulate universitet. the unrolling rule for29 statically unknownTDDD56 upperMulticore loop and GPU limit Programming C. Kessler, IDA, Linköpings universitet. 30 TDDD56 Multicore and GPU Programming

5 Foundations: Loop-Carried Data Dependences Loop Interchange (2)

 Recall: Data dependence S  T, S: z = … ;  Be careful with loop carried data dependences! if operation S may execute (dynamically) before operation T … and both may access the same memory location T: … = ..z.. ;  Example 2: and at least one of these accesses is a write for (j=1; j

 Data dependence ST is called loop carried by a loop L a[i][j] =…a[i+1][j-1]...; a[i][j] =…a[i+1][j-1]…; if the data dependence ST may exist for instances of S and T j Iteration j in different iterations of L. Iteration (j,i) reads space: Iteration (i,j) reads  Example: Iteration space: location a[i+1][j-1] that location a[i+1][j-1], was written in an earlier i i that will be over- L: for (i=1; i

Loop Interchange (3) Loop Fusion

 Be careful with loop-carried data dependences!  Merge subsequent loops with same header  Example 3:  Safe if neither loop carries a (backward) dependence for (j=1; j

 Generally: Interchanging loop headers is only admissible if loop-carried dependences have the same direction for all loops in the loop nest  Can improve data access locality (all directed along or all against the iteration order) and reduces number of branches C. Kessler, IDA, Linköpings universitet. 33 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 34 TDDD56 Multicore and GPU Programming

Loop Iteration Reordering Loop Parallelization

Example: Example: j-loop carries a dependence, its j-loop carries a dependence, its iteration order must be preserved iteration order must be preserved

Example: Loop parallelization

C. Kessler, IDA, Linköpings universitet. 35 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 36 TDDD56 Multicore and GPU Programming

6 Strip Mining / Loop Blocking / -Tiling Tiled Matrix-Matrix Multiplication (1)

 Matrix-Matrix multiplication C = A x B here for square (n x n) matrices C, A, B, with n large (~103):

 Ci j = S k=1..n A i k B k j for all i, j = 1...n

 Standard algorithm for Matrix-Matrix multiplication (here without the initialization of C-entries to 0): j k for (i=0; i

for (k=0; k

Tiled Matrix-Matrix Multiplication (2) Remark on Locality Transformations

 Block each loop by block size S  An alternative can be to change the data layout rather than (choose S so that a block of A, B, C fit in cache together), the control structure of the program j jj then interchange loops k k  Example: Store matrix B in transposed form, i k  Code after tiling: kk or, if necessary, consider transposing it, which may pay off k i k for (ii=0; ii

Loop Distribution (a.k.a. Loop Fission) Loop Fusion

C. Kessler, IDA, Linköpings universitet. 41 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 42 TDDD56 Multicore and GPU Programming

7 Loop Nest Flattening / Linearization Loop Unrolling

C. Kessler, IDA, Linköpings universitet. 43 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 44 TDDD56 Multicore and GPU Programming

Loop Unrolling with Unknown Upper Bound Loop Unroll-And-Jam

C. Kessler, IDA, Linköpings universitet. 45 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 46 TDDD56 Multicore and GPU Programming

Loop Peeling Index Set Splitting

C. Kessler, IDA, Linköpings universitet. 47 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 48 TDDD56 Multicore and GPU Programming

8 Loop Unswitching Scalar Replacement

C. Kessler, IDA, Linköpings universitet. 49 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 50 TDDD56 Multicore and GPU Programming

Scalar Expansion / Array Privatization Pattern Matching for Algorithm Replacement

C. Kessler, IDA, Linköpings universitet. 51 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 52 TDDD56 Multicore and GPU Programming

TDDC86 Compiler optimizations and code generation Remark on static analyzability (1)

 Static dependence information is always a (safe) overapproximation of the real (run-time) dependences  Finding out the real ones exactly is statically undecidable!  If in doubt, a dependence must be assumed Concluding Remarks  may prevent some optimizations or parallelization  One main reason for imprecision is aliasing, i.e. the program may have several ways to refer to the same memory location Limits of Static Analyzability  Example: Pointer aliasing Outlook: Runtime Analysis and void mergesort ( int* a, int n ) How could a static analysis Parallelization {… tool (e.g., compiler) know mergesort ( a, n/2 ); that the two recursive calls read and write mergesort ( a + n/2, n-n/2 ); disjoint subarrays of a? … Christoph Kessler, IDA, Linköpings universitet, 2011. } C. Kessler, IDA, Linköpings universitet. 54 TDDD56 Multicore and GPU Programming

9 Remark on static analyzability (2) Outlook: Runtime Parallelization

 Static dependence information is always a (safe) overapproximation of the real (run-time) dependences  Finding out the latter exactly is statically undecidable!  If in doubt, a dependence must be assumed  may prevent some optimizations or parallelization  Another reason for imprecision are statically unknown values that imply whether a dependence exists or not  Example: Unknown dependence distance // value of K statically unknown Loop-carried dependence for ( i=0; i

Some references on Dependence Analysis, Loop optimizations and Transformations

 H. Zima, B. Chapman: Supercompilers for Parallel and Vector Computers. Addison-Wesley / ACM press, 1990.  R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002.

C. Kessler, IDA, Linköpings universitet. 73 TDDD56 Multicore and GPU Programming

10