TDDC86 Compiler optimizations and code generation Outline
Towards (semi-)automatic parallelization of sequential programs Optimization and Parallelization Data dependence analysis for loops of Sequential Programs Some loop transformations Loop invariant code hoisting, loop unrolling, loop fusion, loop interchange, loop blocking and tiling Lecture 7 Static loop parallelization Run-time loop parallelization Christoph Kessler IDA / PELAB Doacross parallelization, Inspector-executor method Linköping University Speculative parallelization (as time permits) Sweden Auto-tuning (later, if time)
Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 2 TDDD56 Multicore and GPU Programming
Foundations: Control and Data Dependence Foundations: Control and Data Dependence
Consider statements S, T in a sequential program (S=T possible) Data dependence S T, Example: if statement S may execute (dynamically) before T Scope of analysis is typically a function, i.e. intra-procedural analysis S: z = … ; and both may access the same memory location … Assume that a control flow path S … T is possible and at least one of these accesses is a write T: … = ..z.. ; Can be done at arbitrary granularity (instructions, operations, Means that execution order ”S before T” must statements, compound statements, program regions) be preserved when restructuring the program (flow dependence) Relevant are only the read and write effects on memory In general, only a conservative over-estimation (i.e. on program variables) by each operation, can be determined statically and the effect on control flow flow dependence: (RAW, read-after-write) Example: S may write a location z that T may read Control dependence S T, S: if (…) { anti dependence: (WAR, write-after-read) if the fact whether T is executed may depend on S … (e.g. condition) T:… S may read a location x that T may overwrites Implies that relative execution order S T … output dependence: (WAW, write-after-write) must be preserved when restructuring the program } both S and T may write the same location Mostly obvious from nesting structure in well-structured programs, but more tricky in arbitrary branching code (e.g. assembler code)
C. Kessler, IDA, Linköpings universitet. 3 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 4 TDDD56 Multicore and GPU Programming
Dependence Graph Data Dependence Graph
(Data, Control, Program) Dependence Graph: Data dependence graph for straight-line code (”basic Directed graph, consisting of all statements as vertices block”, no branching) is always acyclic, because relative and all (data, control, any) dependences as edges. execution order of statements is forward only. Data dependence graph for a loop: Dependence edge ST if a dependence may exist for some pair of instances (iterations) of S, T Cycles possible Loop-independent versus loop-carried dependences
Example:
(assuming we know statically
C. Kessler, IDA, Linköpings universitet. 5 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet.that arrays a and6 b do not intersect)TDDD56 Multicore and GPU Programming
1 Example Why Loop Optimization and Parallelization?
(assuming that we statically know that Loops are a promising object for program optimizations, arrays A, X, Y, Z do not intersect, otherwise there might be further including automatic parallelization: dependences) High execution frequency Most computation done in (inner) loops Even small optimizations can have large impact (cf. Amdahl’s Law) Regular, repetitive behavior compact description relatively simple to analyze statically
(Iterations unrolled) Well researched S Data dependence graph: 1
C. Kessler, IDA, Linköpings universitet. S2 7 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 8 TDDD56 Multicore and GPU Programming
Loop Optimizations – General Issues TDDC86 Compiler optimizations and code generation
Move loop invariant computations out of loops Modify the order of iterations or parts thereof
Goals: Improve data access locality Faster execution Data Dependence Analysis Reduce loop control overhead Enhance possibilities for loop parallelization or vectorization A more formal introduction
Only transformations that preserve the program semantics (its input/output behavior) are admissible Conservative (static) criterium: preserve data dependences Need data dependence analysis for loops (see TDDC86) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 9 TDDD56 Multicore and GPU Programming
Data Dependence Analysis – Overview Precedence relation between statements
Important for loop optimizations, vectorization and parallelization, instruction scheduling, data cache optimizations Conservative approximations to disjointness of pairs of memory accesses weaker than data-flow analysis but generalizes nicely to the level of individual array element Loops, loop nests Iteration space Array subscripts in loops Index space Dependence testing methods Data dependence graph Data + control dependence graph Program dependence graph
C. Kessler, IDA, Linköpings universitet. 11 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 12 TDDD56 Multicore and GPU Programming
2 Control and Data Depencence; Data Dependence Graph Dependence Graph
C. Kessler, IDA, Linköpings universitet. 13 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 14 TDDD56 Multicore and GPU Programming
Remark: Target-Level Dependence Graphs Loop Normalization
For VLIR / target code, the dependence edges may be labeled with latency or delay of the operation See lecture on instruction scheduling for details
C. Kessler, IDA, Linköpings universitet. 15 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 16 TDDD56 Multicore and GPU Programming
Loop Iteration Space Dependence Distance and Direction
C. Kessler, IDA, Linköpings universitet. 17 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 18 TDDD56 Multicore and GPU Programming
3 Example Dependence Equation System
C. Kessler, IDA, Linköpings universitet. 19 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 20 TDDD56 Multicore and GPU Programming
Linear Diophantine Equations Dependence Testing, 1: GCD-Test
C. Kessler, IDA, Linköpings universitet. 21 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 22 TDDD56 Multicore and GPU Programming
For multidimensional arrays? Survey of Dependence Tests
C. Kessler, IDA, Linköpings universitet. 23 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 24 TDDD56 Multicore and GPU Programming
4 TDDC86 Compiler optimizations and code generation Loop Transformations
Loop Transformations and Parallelization
Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 26 TDDD56 Multicore and GPU Programming
Some important loop transformations Loop Invariant Code Hoisting
Loop normalization Move loop invariant code out of the loop Loop parallelization Compilers can do this automatically if they can statically Loop invariant code hoisting find out what code is loop invariant Loop interchange Example: Loop fusion vs. Loop distribution / fission for (i=0; i<10; i++) tmp = c / d; Strip-mining / loop tiling / blocking vs. Loop linearization a[i] = b[i] + c / d; for (i=0; i<10; i++) Loop unrolling, unroll-and-jam a[i] = b[i] + tmp; Loop peeling Index set splitting, Loop unswitching Scalar replacement, Scalar expansion Later: Software pipelining More: Cycle shrinking, Loop skewing, ...
C. Kessler, IDA, Linköpings universitet. 27 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 28 TDDD56 Multicore and GPU Programming
Loop Unrolling Loop Interchange (1)
Loop unrolling For properly nested loops Can be enforced with compiler options e.g. –funroll=2 (statements in innermost loop body only) Example: Example 1: for (i=0; i<50; i++) { for (i =0; i<50; i+=2) { for (j=0; j . in C, Java Longer loop body may enable further local optimizations . . new iteration order . . . . (e.g. common subexpression elimination, . register allocation, instruction scheduling, old iteration order a[N-1][0] a[N-1][0] using SIMD instructions) Can improve data access locality in memory hierarchy longer code (fewer cache misses / page faults) C. Kessler,Exercise: IDA, Linköpings Formulate universitet. the unrolling rule for29 statically unknownTDDD56 upperMulticore loop and GPU limit Programming C. Kessler, IDA, Linköpings universitet. 30 TDDD56 Multicore and GPU Programming 5 Foundations: Loop-Carried Data Dependences Loop Interchange (2) Recall: Data dependence S T, S: z = … ; Be careful with loop carried data dependences! if operation S may execute (dynamically) before operation T … and both may access the same memory location T: … = ..z.. ; Example 2: and at least one of these accesses is a write for (j=1; j Data dependence ST is called loop carried by a loop L a[i][j] =…a[i+1][j-1]...; a[i][j] =…a[i+1][j-1]…; if the data dependence ST may exist for instances of S and T j Iteration j in different iterations of L. Iteration (j,i) reads space: Iteration (i,j) reads Example: Iteration space: location a[i+1][j-1] that location a[i+1][j-1], was written in an earlier i i that will be over- L: for (i=1; i Loop Interchange (3) Loop Fusion Be careful with loop-carried data dependences! Merge subsequent loops with same header Example 3: Safe if neither loop carries a (backward) dependence for (j=1; j Generally: Interchanging loop headers is only admissible if loop-carried dependences have the same direction for all loops in the loop nest Can improve data access locality (all directed along or all against the iteration order) and reduces number of branches C. Kessler, IDA, Linköpings universitet. 33 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 34 TDDD56 Multicore and GPU Programming Loop Iteration Reordering Loop Parallelization Example: Example: j-loop carries a dependence, its j-loop carries a dependence, its iteration order must be preserved iteration order must be preserved Example: Loop parallelization C. Kessler, IDA, Linköpings universitet. 35 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 36 TDDD56 Multicore and GPU Programming 6 Strip Mining / Loop Blocking / -Tiling Tiled Matrix-Matrix Multiplication (1) Matrix-Matrix multiplication C = A x B here for square (n x n) matrices C, A, B, with n large (~103): Ci j = S k=1..n A i k B k j for all i, j = 1...n Standard algorithm for Matrix-Matrix multiplication (here without the initialization of C-entries to 0): j k for (i=0; i for (k=0; k Tiled Matrix-Matrix Multiplication (2) Remark on Locality Transformations Block each loop by block size S An alternative can be to change the data layout rather than (choose S so that a block of A, B, C fit in cache together), the control structure of the program j jj then interchange loops k k Example: Store matrix B in transposed form, i k Code after tiling: kk or, if necessary, consider transposing it, which may pay off k i k for (ii=0; ii Loop Distribution (a.k.a. Loop Fission) Loop Fusion C. Kessler, IDA, Linköpings universitet. 41 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 42 TDDD56 Multicore and GPU Programming 7 Loop Nest Flattening / Linearization Loop Unrolling C. Kessler, IDA, Linköpings universitet. 43 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 44 TDDD56 Multicore and GPU Programming Loop Unrolling with Unknown Upper Bound Loop Unroll-And-Jam C. Kessler, IDA, Linköpings universitet. 45 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 46 TDDD56 Multicore and GPU Programming Loop Peeling Index Set Splitting C. Kessler, IDA, Linköpings universitet. 47 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 48 TDDD56 Multicore and GPU Programming 8 Loop Unswitching Scalar Replacement C. Kessler, IDA, Linköpings universitet. 49 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 50 TDDD56 Multicore and GPU Programming Scalar Expansion / Array Privatization Pattern Matching for Algorithm Replacement C. Kessler, IDA, Linköpings universitet. 51 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 52 TDDD56 Multicore and GPU Programming TDDC86 Compiler optimizations and code generation Remark on static analyzability (1) Static dependence information is always a (safe) overapproximation of the real (run-time) dependences Finding out the real ones exactly is statically undecidable! If in doubt, a dependence must be assumed Concluding Remarks may prevent some optimizations or parallelization One main reason for imprecision is aliasing, i.e. the program may have several ways to refer to the same memory location Limits of Static Analyzability Example: Pointer aliasing Outlook: Runtime Analysis and void mergesort ( int* a, int n ) How could a static analysis Parallelization {… tool (e.g., compiler) know mergesort ( a, n/2 ); that the two recursive calls read and write mergesort ( a + n/2, n-n/2 ); disjoint subarrays of a? … Christoph Kessler, IDA, Linköpings universitet, 2011. } C. Kessler, IDA, Linköpings universitet. 54 TDDD56 Multicore and GPU Programming 9 Remark on static analyzability (2) Outlook: Runtime Parallelization Static dependence information is always a (safe) overapproximation of the real (run-time) dependences Finding out the latter exactly is statically undecidable! If in doubt, a dependence must be assumed may prevent some optimizations or parallelization Another reason for imprecision are statically unknown values that imply whether a dependence exists or not Example: Unknown dependence distance // value of K statically unknown Loop-carried dependence for ( i=0; i Some references on Dependence Analysis, Loop optimizations and Transformations H. Zima, B. Chapman: Supercompilers for Parallel and Vector Computers. Addison-Wesley / ACM press, 1990. R. Allen, K. Kennedy: Optimizing Compilers for Modern Architectures. Morgan Kaufmann, 2002. C. Kessler, IDA, Linköpings universitet. 73 TDDD56 Multicore and GPU Programming 10