Compiler Transformations for High-Performance Computing

School of Electrical and Computer Engineering N.T.U.A. Embedded System Design High-Level Dimitrios Soudris Transformations for Embedded Computing Άδεια Χρήσης Το παρόν εκπαιδευτικό υλικό υπόκειται σε άδειες χρήσης Creative Commons. Για εκπαιδευτικό υλικό, όπως εικόνες, που υπόκειται σε άδεια χρήσης άλλου τύπου, αυτή πρέπει να αναφέρεται ρητώς. Organization of a hypothetical optimizing compiler 2 DEPENDENCE ANALYSIS Dependence analysis identifies these constraints, which are then used to determine whether a particular transformation can be applied without changing the semantics of the computation. A dependence is a relationship between two computations that places constraints on their execution order. Types dependences: (i) control dependence and (ii) data dependence. Control Dependence Two statements have a Data Dependence if they cannot be executed simultaneously due to conflicting uses of the same variable. 3 Types of Data Dependences (1) Flow dependence (also called true dependence) S1: a = c*10 S2: d = 2*a + c Anti-dependence S1: e = f*4 + g S2: g = 2*h 4 Types of Data Dependences (2) Output dependence both statements write the same variable S1: a = b*c S2: a = d+e Input Dependence when two accesses to the same location memory are both reads Dependence Graph nodes represents statements and arcs dependencies between computations 5 Loop Dependence Analysis To compute dependence information for loops, the key problem is understanding the use of arrays; scalar variables are relatively easy to manage. To track array behavior, the compiler must analyze the subscript expressions in each array reference. To discover whether there is a dependence in the loop nest, it is sufficient to determine whether any of the iterations can write a value that is read or written by any of the other iterations. 6 TRANSFORMATIONS Data-Flow Based Loop Transformations Loop Reordering Loop Restructuring Loop Replacement Transformations Memory Access Transformations Partial Evaluation Redundancy Elimination Procedure Call Transformations 7 Data-Flow Based Loop Transformations (1) A number of classical loop optimizations are based on data-flow analysis, which tracks the flow of data through a program's variables Loop-based Strength Reduction Reduction in strength replaces an expression in a loop with one that is equivalent but uses a less expensive operator Common use in induction variables expressions 8 Data-Flow Based Loop Transformations (2) Loop-invariant Code Motion When a computation appears inside a loop but its result does not change between iterations, the compiler can move that computation outside the loop Use for expensive operator 9 Data-Flow Based Loop Transformations (3) Loop Unswitching is applied when a loop contains a conditional with a loop-invariant test condition. The loop is then replicated inside each branch of the conditional, saving the overhead of conditional branching inside the loop, reducing the code size of the loop body, and possibly enabling the parallelization of a branch of the conditional 10 Loop Reordering Transformations Change the relative order of execution of the iterations of a loop nest or nests. Expose parallelism and improve memory locality. 11 Loop Reordering Transformations (1) Loop Interchange enable vectorization by interchanging an inner, dependent loop with an outer, independent loop; improve vectorization by moving the independent loop with the largest range into the innermost position; improve parallel performance by moving an independent loop outwards in a loop nest to increase the granularity of each iteration and reduce the number of barrier synchronizations; reduce stride, ideally to stride 1; and increase the number of loop- invariant expressions in the inner loop. 12 Loop Reordering Transformations (2) Loop Skewing skew iterations execution Useful for Loop Interchange Skewing factor “i” 13 Parallel iterations Loop Reordering Transformations (3) Loop Reversal Reversal changes the direction in which the loop traverses its iteration range. It is often used in conjunction with other iteration space reordering transformations because it changes the dependence vectors 14 Loop Reordering Transformations (4) Strip Mining execute a specific number of iterations in parallel fashion Strip mining is a method of adjusting the granularity of an operation, especially a parallelizable operation 64 parallel iterations 15 Loop Reordering Transformations (5) Tiling is the multi- dimensional generalization of strip-mining. Tiling (also called blocking) is primarily used to improve cache reuse (QC) by dividing an iteration space into tiles and transforming the loop nest to iterate over them 16 Loop Reordering Transformations (6) Loop Distribution (also called loop fission or loop splitting) breaks a single loop into many. It is used to: Create perfect loop nests; Create sub-loops with fewer dependences; Improve instruction cache and instruction TLB locality due to shorter loop bodies; Reduce memory requirements by iterating over fewer arrays; and Increase register re-use by decreasing register pressure. 17 Loop Reordering Transformations (7) Loop Fusion (loop merging) It can improve performance by: reducing loop overhead; increasing instruction parallelism; improving register, vector, data cache, TLB, or page locality Improving the load balance of parallel loops 18 Loop Restructuring Transformations Loop Restructuring transformations that change the structure of the loop, but leave the computations performed by an iteration of the loop body and their relative order unchanged. 19 Loop Restructuring Transformations (1) Loop Unrolling replicates the body of a loop some number of times called the unrolling factor (u) and iterates by step u instead of step 1. It is a fundamental technique for generating the long instruction sequences required by VLIW machines. Unrolling can improve the performance by: Reducing loop overhead; Increasing instruction parallelism; and Improving register, data cache, or TLB locality. 20 Loop Restructuring Transformations (2) Software Pipelining improve instruction parallelism is software pipelining In software pipelining, the operations of a single loop iteration are broken into s stages, and a single iteration performs stage 1 from iteration i, stage 2 from iteration i-1, etc. Startup code must be generated before the loop to initialize the pipeline for the last s-1 iterations 21 Loop unrolling vs. software pipelining The difference between unrolling and software pipelining: unrolling reduces overhead, while pipelining reduces the startup cost of each iteration. 22 Loop Restructuring Transformations (3) Loop Coalescing combines a loop nest into a single loop, with the original indices computed from the resulting single induction variable 23 Loop Restructuring Transformations (4) Loop Collapsing is a simpler, more efficient, but less general version of coalescing in which the number of dimensions of the array is actually reduced. Collapsing eliminates the overhead of multiple nested loops and multi-dimensional array indexing. 24 Loop Restructuring Transformations (5) Loop Peeling a small number of iterations are removed from the beginning or end of the loop and executed separately. Peeling has two uses: for removing dependences created by the first or last few loop iterations, thereby enabling parallelization; and for matching the iteration control of adjacent loops to enable fusion. 25 Loop Replacement Transformations Transformations that operate on whole loops and completely alter their structure. Reduction Recognition: A reduction is an operation that computes a scalar value from an array. Common reductions include computing either the sum or the maximum value of the elements in an array. 26 Loop Replacement Transformations (2) Array Statement Scalarization When a loop is expressed in array notation, the compiler can either convert it into vector operations or scalarize it into one or more serial loops. However, the conversion is not completely straightforward because array notation requires that the operation be performed as if every value on the right-hand side and every sub-expression on the left-hand side were computed before any assignments are performed. 27 Memory Access Transformations Different speeds between CPU and DRAM Factors affecting memory performance include: Re-use, denoted by Q and QC, the ratio of uses of an item to the number of times it is loaded; Parallelism. Vector machines often divide memory into banks, allowing vector registers to be loaded in a parallel or pipelined fashion. Working Set Size. If all the memory elements accessed inside of a loop do not fitin the data cache, then items that will be accessed in later iterations may be flushed, decreasing QC. Memory system performance can be improve using: loop interchange (6.2.1), loop tiling (6.2.6), loop unrolling (6.3.1), loop fusion (6.2.8), and various optimizations that eliminate register saves at procedure calls (6.8). 28 Memory Access Transformations (1) Array Padding is a transformation whereby unused data locations are inserted between the columns of an array or between arrays. Padding is used to ameliorate a number of memory system conflicts, in particular: Bank conflicts on vector machines with banked memory Cache set or TLB set conflicts Cache misses False sharing of cache lines on shared-memory multiprocessors lines loaded by the earlier references, precluding re-use. 29 Memory Access Transformations

Compiler Transformations for High-Performance Computing

CSE 582 – Compilers

ICS803 Elective – III Multicore Architecture Teacher Name: Ms

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A

Functional Array Programming in Sac

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Survey of Compiler Technology at the IBM Toronto Laboratory

Generalized Index-Set Splitting

Qlogic Pathscale™ Compiler Suite User Guide

Chapter 10 Scalar Optimizations

Compiler Transformations for High-Performance Computing