Optimization and Parallelization of Sequential Programs

DF00100 Advanced Compiler Construction TDDC86 Compiler optimizations and code generation Outline Towards (semi-)automatic parallelization of sequential programs Optimization and Parallelization Data dependence analysis for loops of Sequential Programs Some loop transformations Loop invariant code hoisting, loop unrolling, loop fusion, loop interchange, loop blocking and tiling Lecture 7 Static loop parallelization Run-time loop parallelization Christoph Kessler IDA / PELAB Doacross parallelization, Inspector-executor method Linköping University Speculative parallelization (as time permits) Sweden Auto-tuning (later, if time) Christoph Kessler, IDA, Linköpings universitet, 2014. C. Kessler, IDA, Linköpings universitet. 2 TDDD56 Multicore and GPU Programming Foundations: Control and Data Dependence Foundations: Control and Data Dependence Consider statements S, T in a sequential program (S=T possible) Data dependence S T, Example: if statement S may execute (dynamically) before T Scope of analysis is typically a function, i.e. intra-procedural analysis S: z = … ; and both may access the same memory location … Assume that a control flow path S … T is possible and at least one of these accesses is a write T: … = ..z.. ; Can be done at arbitrary granularity (instructions, operations, Means that execution order ”S before T” must statements, compound statements, program regions) be preserved when restructuring the program (flow dependence) Relevant are only the read and write effects on memory In general, only a conservative over-estimation (i.e. on program variables) by each operation, can be determined statically and the effect on control flow flow dependence: (RAW, read-after-write) Example: S may write a location z that T may read Control dependence S T, S: if (…) { anti dependence: (WAR, write-after-read) if the fact whether T is executed may depend on S … (e.g. condition) T:… S may read a location x that T may overwrites Implies that relative execution order S T … output dependence: (WAW, write-after-write) must be preserved when restructuring the program } both S and T may write the same location Mostly obvious from nesting structure in well-structured programs, but more tricky in arbitrary branching code (e.g. assembler code) C. Kessler, IDA, Linköpings universitet. 3 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 4 TDDD56 Multicore and GPU Programming Dependence Graph Why Loop Optimization and Parallelization? (Data, Control, Program) Dependence Graph: Loops are a promising object for program optimizations, Directed graph, consisting of all statements as vertices including automatic parallelization: and all (data, control, any) dependences as edges. High execution frequency Most computation done in (inner) loops Even small optimizations can have large impact (cf. Amdahl’s Law) Regular, repetitive behavior compact description relatively simple to analyze statically Well researched C. Kessler, IDA, Linköpings universitet. 5 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 6 TDDD56 Multicore and GPU Programming 1 DF00100 Advanced Compiler Construction Loop Optimizations – General Issues TDDC86 Compiler optimizations and code generation Move loop invariant computations out of loops Modify the order of iterations or parts thereof Goals: Data Dependence Analysis Improve data access locality Faster execution for Loops Reduce loop control overhead Enhance possibilities for loop parallelization or vectorization A more formal introduction Only transformations that preserve the program semantics (its input/output behavior) are admissible Conservative (static) criterium: preserve data dependences Need data dependence analysis for loops Christoph Kessler, IDA, Linköpings universitet, 2014. C. Kessler, IDA, Linköpings universitet. 7 TDDD56 Multicore and GPU Programming Data Dependence Analysis – Overview Precedence relation between statements Important for loop optimizations, vectorization and parallelization, instruction scheduling, data cache optimizations Conservative approximations to disjointness of pairs of memory accesses weaker than data-flow analysis but generalizes nicely to the level of individual array element Loops, loop nests Iteration space Array subscripts in loops Index space Dependence testing methods Data dependence graph Data + control dependence graph Program dependence graph C. Kessler, IDA, Linköpings universitet. 9 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 10 TDDD56 Multicore and GPU Programming Data Dependence Graph Loop Iteration Space Data dependence graph for straight-line code (”basic block”, no branching) is always acyclic, because relative execution order of statements is forward only. Data dependence graph for a loop: Dependence edge ST if a dependence may exist for some pair of instances (iterations) of S, T Cycles possible Loop-independent versus loop-carried dependences Example: (assuming we know statically C. Kessler, IDA, Linköpings universitet. that arrays a and11 b do not intersect)TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 12 TDDD56 Multicore and GPU Programming 2 Example Loop Normalization (assuming that we statically know that arrays A, X, Y, Z do not intersect, otherwise there might be further dependences) (Iterations unrolled) S Data dependence graph: 1 C. Kessler, IDA, Linköpings universitet. S2 13 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 14 TDDD56 Multicore and GPU Programming Dependence Distance and Direction Dependence Equation System C. Kessler, IDA, Linköpings universitet. 15 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 16 TDDD56 Multicore and GPU Programming Linear Diophantine Equations Dependence Testing, 1: GCD-Test C. Kessler, IDA, Linköpings universitet. 17 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 18 TDDD56 Multicore and GPU Programming 3 For multidimensional arrays? Survey of Dependence Tests C. Kessler, IDA, Linköpings universitet. 19 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 20 TDDD56 Multicore and GPU Programming DF00100 Advanced Compiler Construction TDDC86 Compiler optimizations and code generation Loop Optimizations – General Issues Move loop invariant computations out of loops Modify the order of iterations or parts thereof Loop Transformations Goals: Improve data access locality and Parallelization Faster execution Reduce loop control overhead Enhance possibilities for loop parallelization or vectorization Only transformations that preserve the program semantics (its input/output behavior) are admissible Conservative (static) criterium: preserve data dependences Need data dependence analysis for loops Christoph Kessler, IDA, Linköpings universitet, 2014. C. Kessler, IDA, Linköpings universitet. 22 TDDD56 Multicore and GPU Programming Some important loop transformations Loop Invariant Code Hoisting Loop normalization Move loop invariant code out of the loop Loop parallelization Compilers can do this automatically if they can statically Loop invariant code hoisting find out what code is loop invariant Loop interchange Example: Loop fusion vs. Loop distribution / fission for (i=0; i<10; i++) tmp = c / d; Strip-mining / loop tiling / blocking vs. Loop linearization a[i] = b[i] + c / d; for (i=0; i<10; i++) Loop unrolling, unroll-and-jam a[i] = b[i] + tmp; Loop peeling Index set splitting, Loop unswitching Scalar replacement, Scalar expansion Later: Software pipelining More: Cycle shrinking, Loop skewing, ... C. Kessler, IDA, Linköpings universitet. 23 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 24 TDDD56 Multicore and GPU Programming 4 Loop Unrolling Loop Interchange (1) Loop unrolling For properly nested loops Can be enforced with compiler options e.g. –funroll=2 (statements in innermost loop body only) Example: Example 1: for (i=0; i<50; i++) { for (i =0; i<50; i+=2) { for (j=0; j<M; j++) for (i=0; i<N; i++) a[i] = b[i]; Unroll a[i] = b[i]; for (i=0; i<N; i++) for (j=0; j<M; j++) by 2: } a[i+1] = b[i+1]; a[i][j]=0.0; a[i][j]=0.0; j j } a[0][0] row-wise a[0][0] a[0][M-1] Reduces loop overhead (total # comparisons, branches, increments) i storage of i 2D-arrays . in C, Java Longer loop body may enable further local optimizations . new iteration order . (e.g. common subexpression elimination, . register allocation, instruction scheduling, old iteration order a[N-1][0] a[N-1][0] using SIMD instructions) Can improve data access locality in memory hierarchy longer code (fewer cache misses / page faults) C. Kessler,Exercise: IDA, Linköpings Formulate universitet. the unrolling rule for25 statically unknownTDDD56 upperMulticore loop and GPU limit Programming C. Kessler, IDA, Linköpings universitet. 26 TDDD56 Multicore and GPU Programming Foundations: Loop-Carried Data Dependences Loop Interchange (2) Recall: Data dependence S T, S: z = … ; Be careful with loop carried data dependences! if operation S may execute (dynamically) before operation T … and both may access the same memory location T: … = ..z.. ; Example 2: and at least one of these accesses is a write for (j=1; j<M; j++) In general, only a conservative over-estimation can be determined for (i=0; i<N; i++) statically. for (i=0; i<N; i++) for (j=1; j<M; j++) Data dependence ST is called loop carried by a loop L a[i][j]

Optimization and Parallelization of Sequential Programs

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Efficient Symbolic Analysis for Optimizing Compilers*

Using Static Single Assignment SSA Form

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Autotuning for Automatic Parallelization on Heterogeneous Systems

Programming Parallel Machines

An Unfolding-Based Loop Optimization Technique

SSE Optimizations These Optimizations Leverage the Streaming SIMD Extension (SSE) Instruction Set of the CPU

Compiler Construction

Using Openmp Effectively on Theta