Optimization and Parallelization of Sequential Programs

TDDC86 Compiler optimizations and code generation Outline Towards (semi-)automatic parallelization of sequential programs Optimization and Parallelization Data dependence analysis for loops of Sequential Programs Some loop transformations Loop invariant code hoisting, loop unrolling, loop fusion, loop interchange, loop blocking and tiling Lecture 7 Static loop parallelization Run-time loop parallelization Christoph Kessler IDA / PELAB Doacross parallelization, Inspector-executor method Linköping University Speculative parallelization (as time permits) Sweden Auto-tuning (later, if time) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 2 TDDD56 Multicore and GPU Programming Foundations: Control and Data Dependence Foundations: Control and Data Dependence Consider statements S, T in a sequential program (S=T possible) Data dependence S T, Example: if statement S may execute (dynamically) before T Scope of analysis is typically a function, i.e. intra-procedural analysis S: z = … ; and both may access the same memory location … Assume that a control flow path S … T is possible and at least one of these accesses is a write T: … = ..z.. ; Can be done at arbitrary granularity (instructions, operations, Means that execution order ”S before T” must statements, compound statements, program regions) be preserved when restructuring the program (flow dependence) Relevant are only the read and write effects on memory In general, only a conservative over-estimation (i.e. on program variables) by each operation, can be determined statically and the effect on control flow flow dependence: (RAW, read-after-write) Example: S may write a location z that T may read Control dependence S T, S: if (…) { anti dependence: (WAR, write-after-read) if the fact whether T is executed may depend on S … (e.g. condition) T:… S may read a location x that T may overwrites Implies that relative execution order S T … output dependence: (WAW, write-after-write) must be preserved when restructuring the program } both S and T may write the same location Mostly obvious from nesting structure in well-structured programs, but more tricky in arbitrary branching code (e.g. assembler code) C. Kessler, IDA, Linköpings universitet. 3 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 4 TDDD56 Multicore and GPU Programming Dependence Graph Data Dependence Graph (Data, Control, Program) Dependence Graph: Data dependence graph for straight-line code (”basic Directed graph, consisting of all statements as vertices block”, no branching) is always acyclic, because relative and all (data, control, any) dependences as edges. execution order of statements is forward only. Data dependence graph for a loop: Dependence edge ST if a dependence may exist for some pair of instances (iterations) of S, T Cycles possible Loop-independent versus loop-carried dependences Example: (assuming we know statically C. Kessler, IDA, Linköpings universitet. 5 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet.that arrays a and6 b do not intersect)TDDD56 Multicore and GPU Programming 1 Example Why Loop Optimization and Parallelization? (assuming that we statically know that Loops are a promising object for program optimizations, arrays A, X, Y, Z do not intersect, otherwise there might be further including automatic parallelization: dependences) High execution frequency Most computation done in (inner) loops Even small optimizations can have large impact (cf. Amdahl’s Law) Regular, repetitive behavior compact description relatively simple to analyze statically (Iterations unrolled) Well researched S Data dependence graph: 1 C. Kessler, IDA, Linköpings universitet. S2 7 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 8 TDDD56 Multicore and GPU Programming Loop Optimizations – General Issues TDDC86 Compiler optimizations and code generation Move loop invariant computations out of loops Modify the order of iterations or parts thereof Goals: Improve data access locality Faster execution Data Dependence Analysis Reduce loop control overhead Enhance possibilities for loop parallelization or vectorization A more formal introduction Only transformations that preserve the program semantics (its input/output behavior) are admissible Conservative (static) criterium: preserve data dependences Need data dependence analysis for loops (see TDDC86) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 9 TDDD56 Multicore and GPU Programming Data Dependence Analysis – Overview Precedence relation between statements Important for loop optimizations, vectorization and parallelization, instruction scheduling, data cache optimizations Conservative approximations to disjointness of pairs of memory accesses weaker than data-flow analysis but generalizes nicely to the level of individual array element Loops, loop nests Iteration space Array subscripts in loops Index space Dependence testing methods Data dependence graph Data + control dependence graph Program dependence graph C. Kessler, IDA, Linköpings universitet. 11 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 12 TDDD56 Multicore and GPU Programming 2 Control and Data Depencence; Data Dependence Graph Dependence Graph C. Kessler, IDA, Linköpings universitet. 13 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 14 TDDD56 Multicore and GPU Programming Remark: Target-Level Dependence Graphs Loop Normalization For VLIR / target code, the dependence edges may be labeled with latency or delay of the operation See lecture on instruction scheduling for details C. Kessler, IDA, Linköpings universitet. 15 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 16 TDDD56 Multicore and GPU Programming Loop Iteration Space Dependence Distance and Direction C. Kessler, IDA, Linköpings universitet. 17 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 18 TDDD56 Multicore and GPU Programming 3 Example Dependence Equation System C. Kessler, IDA, Linköpings universitet. 19 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 20 TDDD56 Multicore and GPU Programming Linear Diophantine Equations Dependence Testing, 1: GCD-Test C. Kessler, IDA, Linköpings universitet. 21 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 22 TDDD56 Multicore and GPU Programming For multidimensional arrays? Survey of Dependence Tests C. Kessler, IDA, Linköpings universitet. 23 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 24 TDDD56 Multicore and GPU Programming 4 TDDC86 Compiler optimizations and code generation Loop Transformations Loop Transformations and Parallelization Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 26 TDDD56 Multicore and GPU Programming Some important loop transformations Loop Invariant Code Hoisting Loop normalization Move loop invariant code out of the loop Loop parallelization Compilers can do this automatically if they can statically Loop invariant code hoisting find out what code is loop invariant Loop interchange Example: Loop fusion vs. Loop distribution / fission for (i=0; i<10; i++) tmp = c / d; Strip-mining / loop tiling / blocking vs. Loop linearization a[i] = b[i] + c / d; for (i=0; i<10; i++) Loop unrolling, unroll-and-jam a[i] = b[i] + tmp; Loop peeling Index set splitting, Loop unswitching Scalar replacement, Scalar expansion Later: Software pipelining More: Cycle shrinking, Loop skewing, ... C. Kessler, IDA, Linköpings universitet. 27 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 28 TDDD56 Multicore and GPU Programming Loop Unrolling Loop Interchange (1) Loop unrolling For properly nested loops Can be enforced with compiler options e.g. –funroll=2 (statements in innermost loop body only) Example: Example 1: for (i=0; i<50; i++) { for (i =0; i<50; i+=2) { for (j=0; j<M; j++) for (i=0; i<N; i++) a[i] = b[i]; Unroll a[i] = b[i]; for (i=0; i<N; i++) for (j=0; j<M; j++) by 2: } a[i+1] = b[i+1]; a[ i ][ j ] = 0.0 ; a[ i ][ j ] = 0.0 ; j j } a[0][0] row-wise a[0][0] a[0][M-1] Reduces loop overhead (total # comparisons, branches, increments) i storage of i 2D-arrays . in C, Java Longer loop body may enable further local optimizations . new iteration order . (e.g. common subexpression elimination, . register allocation, instruction scheduling, old iteration order a[N-1][0] a[N-1][0] using SIMD instructions) Can improve data access locality in memory hierarchy longer code (fewer cache misses / page faults) C. Kessler,Exercise: IDA, Linköpings Formulate universitet. the unrolling rule for29 statically unknownTDDD56 upperMulticore loop and GPU limit Programming C. Kessler, IDA, Linköpings universitet. 30 TDDD56 Multicore and GPU Programming 5 Foundations: Loop-Carried Data Dependences Loop Interchange (2) Recall: Data dependence S T, S: z = … ; Be careful with loop carried data dependences! if operation S may execute (dynamically) before operation T … and both may access the same memory location T: … = ..z.. ; Example 2: and at least one of these accesses is a write for (j=1; j<M; j++) In general, only a conservative over-estimation can be determined for (i=0; i<N; i++) statically. for (i=0; i<N; i++) for (j=1; j<M; j++) Data dependence ST is called loop

Load more