Optimizing Compilers

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing Compilers Optimizing Compilers Source-to-source (high-level) translation and optimization for multicores Rudolf (Rudi) Eigenmann R. Eigenmann, Optimizing compilers Slide 1 I. Motivation and Introduction: Optimizing Compilers are in the Center of the (Software) Universe They translate increasingly advanced human interfaces (programming languages) onto increasingly complex target machines Today Tomorrow C, C++, Problem Human (programming) Java, Specification language Fortran Language Grand Grand Translator Challenge Machine architecture Workstation Globally Multicores Distributed/Cloud HPC Systems Resources Processors have multiple cores. Parallelization is a key optimization. R. Eigenmann, Optimizing compilers Slide 2 Issues in Optimizing / Parallelizing Compilers The Goal: • We would like to run standard (C, C++, Java, Fortran) programs on common parallel computers leads to the following high-level issues: • How to detect parallelism? • How to map parallelism onto the machine? • How to create a good compiler architecture? R. Eigenmann, Optimizing compilers Slide 3 Detecting Parallelism • Program analysis techniques • Data dependence analysis • Dependence removing techniques • Parallelization in the presence of dependences • Runtime dependence detection R. Eigenmann, Optimizing compilers Slide 4 Mapping Parallelism onto the Machine • Exploiting parallelism at many levels – Multiprocessors and multi-cores (our focus) – Distributed computers (clusters or global networks) – Heterogeneous architectures – Instruction-level parallelism – Vector machines • Exploiting memory organizations – Data placement – Locality enhancement – Data communication R. Eigenmann, Optimizing compilers Slide 5 Architecting a Compiler • Compiler generator languages and tools • Internal representations • Implementing analysis and transformation techniques • Orchestrating compiler techniques (when to apply which technique) • Benchmarking and performance evaluation R. Eigenmann, Optimizing compilers Slide 6 Early, Foundational Parallelizing Compiler Books and Survey Papers Books: • Michael Wolfe: High-Performance Compilers for Parallel Computing (1996) • Utpal Banerjee: several books on Data Dependence Analysis and Transformations • Ken Kennedy, John Allen: Optimizing Compilers for Modern Architectures: A Dependence-based Approach (2001) • Zima, H. and Chapman, B., Supercompilers for parallel and vector computers (1990) • Scheduling and automatic Parallelization, Darte, A., Robert Y., and Vivien, F., (2000) Survey Papers: • Rudolf Eigenmann and Jay Hoeflinger, Parallelizing and Vectorizing Compilers, Wiley Encyclopedia of Electrical Engineering, John Wiley &Sons, Inc., 2001 • Utpal Banerjee, Rudolf Eigenmann, Alexandru Nicolau, and David Padua. Automatic Program Parallelization. Proceedings of the IEEE, 81(2), FeBruary 1993. • David F. Bacon, Susan L. Graham, Compiler transformations for high-performance computing, ACM Computing Surveys (CSUR), Volume 26, Issue 4, DecemBer 1994, Pages: 345 - 420,1994 R. Eigenmann, Optimizing compilers Slide 7 Course Approach There are many schools on optimizing compilers. Our approach is performance-driven. We will discuss: – Performance of parallelization techniques – Analysis and Transformation techniques in the Cetus compiler (for multiprocessors/cores) – Additional transformations (for GPGPUs and other architectures) – Compiler infrastructure considerations R. Eigenmann, Optimizing compilers Slide 8 The Heart of Automatic Parallelization Data Dependence Testing If a loop does not have data dependences between any two iterations then it can be safely executed in parallel In many science/engineering applications, loop parallelism is most important. In non- numerical programs other control structures are also important R. Eigenmann, Optimizing compilers Slide 9 Data Dependence Tests: Motivating Examples Loop Parallelization Statement Reordering Can the iterations of this can these two statements be loop be run concurrently? swapped? DO i=1,100,2 DO i=1,100,2 B[2*i] = ... B[2*i] = ... ... = B[2*i] +B[3*i] ... = B[3*i] ENDDO ENDDO DD testing is important not just for DD testing is needed to detect detecting parallelism parallelism A data dependence exists between two adjacent data references iff: • both references access the same storage location and • at least one of them is a write access R. Eigenmann, Optimizing compilers Slide 10 Data Dependence Tests: Concepts Terms for data dependences between statements of loop iterations. • Distance (vector): indicates how many iterations apart are source and sink of dependence. • Direction (vector): is basically the sign of the distance. There are different notations: (<,=,>) or (-1,0,+1) meaning dependence (from earlier to later, within the same, from later to earlier) iteration. • Loop-carried (or cross-iteration) dependence and non-loop-carried (or loop-independent) dependence: indicates whether or not a dependence exists within one iteration or across iterations. – For detecting parallel loops, only cross-iteration dependences matter. – equal dependences are relevant for optimizations such as statement reordering and loop distribution. R. Eigenmann, Optimizing compilers Slide 11 Data Dependence Tests: Concepts • Iteration space graphs: the un-abstracted form of a dependence graph with one node per statement instance. j Example: DO i=1,n DO j=1,m a[i,j] = a[i-1,j-2]+b[i,j] ENDDO ENDDO i order R. Eigenmann, Optimizing compilers Slide 12 Data Dependence Tests: Formulation of the Data-dependence problem DO i=1,n the question to answer: a[4*i] = . can 4*i ever be equal to 2*i +1 within i , i Î[1,n] ? . = a[2*i+1] 1 2 1 2 ENDDO Note that the iterations at which the two expressions are equal may differ. To express this fact, we choose the notation i1, i2. Let us generalize a bit: given • two subscript functions f and g, and • loop bounds lower, upper, Does f(i1) = g(i2) have a solution such that lower £ i1, i2 £ upper ? R. Eigenmann, Optimizing compilers Slide 13 This course would now be finished if: • the mathematical formulation of the data dependence problem had an accurate and fast solution, and • there were enough loops in programs without data dependences, and • dependence-free code could be executed by today’s parallel machines directly and efficiently. • engineering these techniques into a production compiler were straightforward. There are enough hard problems to fill several courses! R. Eigenmann, Optimizing compilers Slide 14 II. Performance of Basic Automatic Program Parallelization R. Eigenmann, Optimizing compilers Slide 15 Three Decades of Parallelizing Compilers A Performance study at the beginning of the 90es (Blume study) Analyzed the performance of state-of-the-art parallelizers and vectorizers using the Perfect Benchmarks. William Blume and Rudolf Eigenmann, Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs, IEEE Transactions on Parallel and Distributed Systems, 3(6), November 1992, pages 643--656. Good reasons for starting two decades back: • We will learn simple techniques first. • We will see how parallelization techniques have evolved • We will see that extensions of the important techniques back then are still the important techniques today. R. Eigenmann, Optimizing compilers Slide 16 Overall Performance of parallelizers in 1990 Speedup on 8 processors with 4-stage vector units R. Eigenmann, Optimizing compilers Slide 17 Performance of Individual Techniques R. Eigenmann, Optimizing compilers Slide 18 Transformations measured in the “Blume Study” • Scalar expansion • Reduction parallelization • Induction variable substitution • Loop interchange • Forward Substitution • Stripmining • Loop synchronization • Recurrence substitution R. Eigenmann, Optimizing compilers Slide 19 Scalar Expansion Privatization DO PARALLEL j=1,n DO j=1,n PRIVATE t output t = a[j]+b[j] flow t = a[j]+b[j] c[j] = t + t2 c[j] = t + t2 ENDDO ENDDO anti Expansion We assume a shared-memory model: • by default, data is shared, i.e., all DO PARALLEL j=1,n processors can see and modify it t0[j] = a[j]+b[j] • processors share the work of parallel loops c[j] = t0[j] + t0[j]2 ENDDO R. Eigenmann, Optimizing compilers Slide 20 Parallel Loop Syntax and Semantics in OpenMP !$OMP PARALLEL PRIVATE(<private data>) <preamble code> #pragma omp parallel for !$OMP DO DO i = lb, ub for (i=lb; i<=ub; i++) { <loop body code> <loop body code> ENDDO } !$OMP END DO <postamble code> !$OMP END PARALLEL work (iterations) shared by participating processors (threads) Same code executed by all participating processors (threads) R. Eigenmann, Optimizing compilers Slide 21 Reduction Parallelization !$OMP PARALLEL, PRIVATE (s) s = 0 DO j=1,n !$OMP DO DO j=1,n flow sum = sum + a[j] s = s + a[j] ENDDO anti ENDDO !$OMP ENDDO !$OMP ATOMIC sum=sum+s !$OMP PARALLEL DO !$OMP END PARALLEL !$OMP+REDUCTION(+:sum) DO j=1,n sum = sum + a[j] ENDDO R. Eigenmann, Optimizing compilers Slide 22 Induction Variable Substitution ind = ind0 ind = ind0 DO j = 1,n DO PARALLEL j = 1,n a[ind] = b[j] a[ind0+k*(j-1)] = b[j] ind = ind+k flow ENDDO dependence ENDDO Note, this is the reverse of strength reduction, an important transformation in classical (code generating) compilers. R0 ¬ &d loop: real d[20,100] loop: ... DO j=1,n ... (R0) ¬ 0 d[1,j]=0 R0 ¬ &d+20*j ... ENDDO (R0) ¬ 0 ... R0 ¬ R0+20 jump loop jump loop R. Eigenmann, Optimizing compilers Slide 23 Forward Substitution m = n+1 m = n+1 … … DO j=1,n DO j=1,n a[j] = a[j+m] a[j] = a[j+n+1] ENDDO ENDDO a = x*y a = x*y b = a+2 b = x*y+2 c = b + 4 c = x*y + 6 dependences
Recommended publications
  • Redundancy Elimination Common Subexpression Elimination
    Redundancy Elimination Aim: Eliminate redundant operations in dynamic execution Why occur? Loop-invariant code: Ex: constant assignment in loop Same expression computed Ex: addressing Value numbering is an example Requires dataflow analysis Other optimizations: Constant subexpression elimination Loop-invariant code motion Partial redundancy elimination Common Subexpression Elimination Replace recomputation of expression by use of temp which holds value Ex. (s1) y := a + b Ex. (s1) temp := a + b (s1') y := temp (s2) z := a + b (s2) z := temp Illegal? How different from value numbering? Ex. (s1) read(i) (s2) j := i + 1 (s3) k := i i + 1, k+1 (s4) l := k + 1 no cse, same value number Why need temp? Local and Global ¡ Local CSE (BB) Ex. (s1) c := a + b (s1) t1 := a + b (s2) d := m&n (s1') c := t1 (s3) e := a + b (s2) d := m&n (s4) m := 5 (s5) if( m&n) ... (s3) e := t1 (s4) m := 5 (s5) if( m&n) ... 5 instr, 4 ops, 7 vars 6 instr, 3 ops, 8 vars Always better? Method: keep track of expressions computed in block whose operands have not changed value CSE Hash Table (+, a, b) (&,m, n) Global CSE example i := j i := j a := 4*i t := 4*i i := i + 1 i := i + 1 b := 4*i t := 4*i b := t c := 4*i c := t Assumes b is used later ¡ Global CSE An expression e is available at entry to B if on every path p from Entry to B, there is an evaluation of e at B' on p whose values are not redefined between B' and B.
    [Show full text]
  • Polyhedral Compilation As a Design Pattern for Compiler Construction
    Polyhedral Compilation as a Design Pattern for Compiler Construction PLISS, May 19-24, 2019 [email protected] Polyhedra? Example: Tiles Polyhedra? Example: Tiles How many of you read “Design Pattern”? → Tiles Everywhere 1. Hardware Example: Google Cloud TPU Architectural Scalability With Tiling Tiles Everywhere 1. Hardware Google Edge TPU Edge computing zoo Tiles Everywhere 1. Hardware 2. Data Layout Example: XLA compiler, Tiled data layout Repeated/Hierarchical Tiling e.g., BF16 (bfloat16) on Cloud TPU (should be 8x128 then 2x1) Tiles Everywhere Tiling in Halide 1. Hardware 2. Data Layout Tiled schedule: strip-mine (a.k.a. split) 3. Control Flow permute (a.k.a. reorder) 4. Data Flow 5. Data Parallelism Vectorized schedule: strip-mine Example: Halide for image processing pipelines vectorize inner loop https://halide-lang.org Meta-programming API and domain-specific language (DSL) for loop transformations, numerical computing kernels Non-divisible bounds/extent: strip-mine shift left/up redundant computation (also forward substitute/inline operand) Tiles Everywhere TVM example: scan cell (RNN) m = tvm.var("m") n = tvm.var("n") 1. Hardware X = tvm.placeholder((m,n), name="X") s_state = tvm.placeholder((m,n)) 2. Data Layout s_init = tvm.compute((1,n), lambda _,i: X[0,i]) s_do = tvm.compute((m,n), lambda t,i: s_state[t-1,i] + X[t,i]) 3. Control Flow s_scan = tvm.scan(s_init, s_do, s_state, inputs=[X]) s = tvm.create_schedule(s_scan.op) 4. Data Flow // Schedule to run the scan cell on a CUDA device block_x = tvm.thread_axis("blockIdx.x") 5. Data Parallelism thread_x = tvm.thread_axis("threadIdx.x") xo,xi = s[s_init].split(s_init.op.axis[1], factor=num_thread) s[s_init].bind(xo, block_x) Example: Halide for image processing pipelines s[s_init].bind(xi, thread_x) xo,xi = s[s_do].split(s_do.op.axis[1], factor=num_thread) https://halide-lang.org s[s_do].bind(xo, block_x) s[s_do].bind(xi, thread_x) print(tvm.lower(s, [X, s_scan], simple_mode=True)) And also TVM for neural networks https://tvm.ai Tiling and Beyond 1.
    [Show full text]
  • Equality Saturation: a New Approach to Optimization
    Logical Methods in Computer Science Vol. 7 (1:10) 2011, pp. 1–37 Submitted Oct. 12, 2009 www.lmcs-online.org Published Mar. 28, 2011 EQUALITY SATURATION: A NEW APPROACH TO OPTIMIZATION ROSS TATE, MICHAEL STEPP, ZACHARY TATLOCK, AND SORIN LERNER Department of Computer Science and Engineering, University of California, San Diego e-mail address: {rtate,mstepp,ztatlock,lerner}@cs.ucsd.edu Abstract. Optimizations in a traditional compiler are applied sequentially, with each optimization destructively modifying the program to produce a transformed program that is then passed to the next optimization. We present a new approach for structuring the optimization phase of a compiler. In our approach, optimizations take the form of equality analyses that add equality information to a common intermediate representation. The op- timizer works by repeatedly applying these analyses to infer equivalences between program fragments, thus saturating the intermediate representation with equalities. Once saturated, the intermediate representation encodes multiple optimized versions of the input program. At this point, a profitability heuristic picks the final optimized program from the various programs represented in the saturated representation. Our proposed way of structuring optimizers has a variety of benefits over previous approaches: our approach obviates the need to worry about optimization ordering, enables the use of a global optimization heuris- tic that selects among fully optimized programs, and can be used to perform translation validation, even on compilers other than our own. We present our approach, formalize it, and describe our choice of intermediate representation. We also present experimental results showing that our approach is practical in terms of time and space overhead, is effective at discovering intricate optimization opportunities, and is effective at performing translation validation for a realistic optimizer.
    [Show full text]
  • Scalable Conditional Induction Variables (CIV) Analysis
    Scalable Conditional Induction Variables (CIV) Analysis ifact Cosmin E. Oancea Lawrence Rauchwerger rt * * Comple A t te n * A te s W i E * s e n l C l o D C O Department of Computer Science Department of Computer Science and Engineering o * * c u e G m s E u e C e n R t v e o d t * y * s E University of Copenhagen Texas A & M University a a l d u e a [email protected] [email protected] t Abstract k = k0 Ind. k = k0 DO i = 1, N DO i =1,N Var. DO i = 1, N IF(cond(b(i)))THEN Subscripts using induction variables that cannot be ex- k = k+2 ) a(k0+2*i)=.. civ = civ+1 )? pressed as a formula in terms of the enclosing-loop indices a(k)=.. Sub. ENDDO a(civ) = ... appear in the low-level implementation of common pro- ENDDO k=k0+MAX(2N,0) ENDIF ENDDO gramming abstractions such as filter, or stack operations and (a) (b) (c) pose significant challenges to automatic parallelization. Be- Figure 1. Loops with affine and CIV array accesses. cause the complexity of such induction variables is often due to their conditional evaluation across the iteration space of its closed-form equivalent k0+2*i, which enables its in- loops we name them Conditional Induction Variables (CIV). dependent evaluation by all iterations. More importantly, This paper presents a flow-sensitive technique that sum- the resulted code, shown in Figure 1(b), allows the com- marizes both such CIV-based and affine subscripts to pro- piler to verify that the set of points written by any dis- gram level, using the same representation.
    [Show full text]
  • Induction Variable Analysis with Delayed Abstractions1
    Induction Variable Analysis with Delayed Abstractions1 SEBASTIAN POP, and GEORGES-ANDRE´ SILBER CRI, Mines Paris, France and ALBERT COHEN ALCHEMY group, INRIA Futurs, Orsay, France We present the design of an induction variable analyzer suitable for the analysis of typed, low-level, three address representations in SSA form. At the heart of our analyzer stands a new algorithm that recognizes scalar evolutions. We define a representation called trees of recurrences that is able to capture different levels of abstractions: from the finer level that is a subset of the SSA representation restricted to arithmetic operations on scalar variables, to the coarser levels such as the evolution envelopes that abstract sets of possible evolutions in loops. Unlike previous work, our algorithm tracks induction variables without prior classification of a few evolution patterns: different levels of abstraction can be obtained on demand. The low complexity of the algorithm fits the constraints of a production compiler as illustrated by the evaluation of our implementation on standard benchmark programs. Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—compilers, interpreters, optimization, retargetable compilers; F.3.2 [Logics and Meanings of Programs]: Semantics of Programming Languages—partial evaluation, program analysis General Terms: Compilers Additional Key Words and Phrases: Scalar evolutions, static analysis, static single assignment representation, assessing compilers heuristics regressions. 1Extension of Conference Paper:
    [Show full text]
  • Strength Reduction of Induction Variables and Pointer Analysis – Induction Variable Elimination
    Loop optimizations • Optimize loops – Loop invariant code motion [last time] Loop Optimizations – Strength reduction of induction variables and Pointer Analysis – Induction variable elimination CS 412/413 Spring 2008 Introduction to Compilers 1 CS 412/413 Spring 2008 Introduction to Compilers 2 Strength Reduction Induction Variables • Basic idea: replace expensive operations (multiplications) with • An induction variable is a variable in a loop, cheaper ones (additions) in definitions of induction variables whose value is a function of the loop iteration s = 3*i+1; number v = f(i) while (i<10) { while (i<10) { j = 3*i+1; //<i,3,1> j = s; • In compilers, this a linear function: a[j] = a[j] –2; a[j] = a[j] –2; i = i+2; i = i+2; f(i) = c*i + d } s= s+6; } •Observation:linear combinations of linear • Benefit: cheaper to compute s = s+6 than j = 3*i functions are linear functions – s = s+6 requires an addition – Consequence: linear combinations of induction – j = 3*i requires a multiplication variables are induction variables CS 412/413 Spring 2008 Introduction to Compilers 3 CS 412/413 Spring 2008 Introduction to Compilers 4 1 Families of Induction Variables Representation • Basic induction variable: a variable whose only definition in the • Representation of induction variables in family i by triples: loop body is of the form – Denote basic induction variable i by <i, 1, 0> i = i + c – Denote induction variable k=i*a+b by triple <i, a, b> where c is a loop-invariant value • Derived induction variables: Each basic induction variable i defines
    [Show full text]
  • Generalizing Loop-Invariant Code Motion in a Real-World Compiler
    Imperial College London Department of Computing Generalizing loop-invariant code motion in a real-world compiler Author: Supervisor: Paul Colea Fabio Luporini Co-supervisor: Prof. Paul H. J. Kelly MEng Computing Individual Project June 2015 Abstract Motivated by the perpetual goal of automatically generating efficient code from high-level programming abstractions, compiler optimization has developed into an area of intense research. Apart from general-purpose transformations which are applicable to all or most programs, many highly domain-specific optimizations have also been developed. In this project, we extend such a domain-specific compiler optimization, initially described and implemented in the context of finite element analysis, to one that is suitable for arbitrary applications. Our optimization is a generalization of loop-invariant code motion, a technique which moves invariant statements out of program loops. The novelty of the transformation is due to its ability to avoid more redundant recomputation than normal code motion, at the cost of additional storage space. This project provides a theoretical description of the above technique which is fit for general programs, together with an implementation in LLVM, one of the most successful open-source compiler frameworks. We introduce a simple heuristic-driven profitability model which manages to successfully safeguard against potential performance regressions, at the cost of missing some speedup opportunities. We evaluate the functional correctness of our implementation using the comprehensive LLVM test suite, passing all of its 497 whole program tests. The results of our performance evaluation using the same set of tests reveal that generalized code motion is applicable to many programs, but that consistent performance gains depend on an accurate cost model.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Efficient Symbolic Analysis for Optimizing Compilers*
    Efficient Symbolic Analysis for Optimizing Compilers? Robert A. van Engelen Dept. of Computer Science, Florida State University, Tallahassee, FL 32306-4530 [email protected] Abstract. Because most of the execution time of a program is typically spend in loops, loop optimization is the main target of optimizing and re- structuring compilers. An accurate determination of induction variables and dependencies in loops is of paramount importance to many loop opti- mization and parallelization techniques, such as generalized loop strength reduction, loop parallelization by induction variable substitution, and loop-invariant expression elimination. In this paper we present a new method for induction variable recognition. Existing methods are either ad-hoc and not powerful enough to recognize some types of induction variables, or existing methods are powerful but not safe. The most pow- erful method known is the symbolic differencing method as demonstrated by the Parafrase-2 compiler on parallelizing the Perfect Benchmarks(R). However, symbolic differencing is inherently unsafe and a compiler that uses this method may produce incorrectly transformed programs without issuing a warning. In contrast, our method is safe, simpler to implement in a compiler, better adaptable for controlling loop transformations, and recognizes a larger class of induction variables. 1 Introduction It is well known that the optimization and parallelization of scientific applica- tions by restructuring compilers requires extensive analysis of induction vari- ables and dependencies
    [Show full text]
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Incremental Computation of Static Single Assignment Form
    Incremental Computation of Static Single Assignment Form Jong-Deok Choi 1 and Vivek Sarkar 2 and Edith Schonberg 3 IBM T.J. Watson Research Center, P. O. Box 704, Yorktown Heights, NY 10598 ([email protected]) 2 IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA 95141 ([email protected]) IBM T.J. Watson Research Center, P. O. Box 704, Yorktown Heights, NY 10598 ([email protected]) Abstract. Static single assignment (SSA) form is an intermediate rep- resentation that is well suited for solving many data flow optimization problems. However, since the standard algorithm for building SSA form is exhaustive, maintaining correct SSA form throughout a multi-pass com- pilation process can be expensive. In this paper, we present incremental algorithms for restoring correct SSA form after program transformations. First, we specify incremental SSA algorithms for insertion and deletion of a use/definition of a variable, and for a large class of updates on intervals. We characterize several cases for which the cost of these algorithms will be proportional to the size of the transformed region and hence poten- tially much smaller than the cost of the exhaustive algorithm. Secondly, we specify customized SSA-update algorithms for a set of common loop transformations. These algorithms are highly efficient: the cost depends at worst on the size of the transformed code, and in many cases the cost is independent of the loop body size and depends only on the number of loops. 1 Introduction Static single assignment (SSA) [8, 6] form is a compact intermediate program representation that is well suited to a large number of compiler optimization algorithms, including constant propagation [19], global value numbering [3], and program equivalence detection [21].
    [Show full text]
  • Compiler-Based Code-Improvement Techniques
    Compiler-Based Code-Improvement Techniques KEITH D. COOPER, KATHRYN S. MCKINLEY, and LINDA TORCZON Since the earliest days of compilation, code quality has been recognized as an important problem [18]. A rich literature has developed around the issue of improving code quality. This paper surveys one part of that literature: code transformations intended to improve the running time of programs on uniprocessor machines. This paper emphasizes transformations intended to improve code quality rather than analysis methods. We describe analytical techniques and specific data-flow problems to the extent that they are necessary to understand the transformations. Other papers provide excellent summaries of the various sub-fields of program analysis. The paper is structured around a simple taxonomy that classifies transformations based on how they change the code. The taxonomy is populated with example transformations drawn from the literature. Each transformation is described at a depth that facilitates broad understanding; detailed references are provided for deeper study of individual transformations. The taxonomy provides the reader with a framework for thinking about code-improving transformations. It also serves as an organizing principle for the paper. Copyright 1998, all rights reserved. You may copy this article for your personal use in Comp 512. Further reproduction or distribution requires written permission from the authors. 1INTRODUCTION This paper presents an overview of compiler-based methods for improving the run-time behavior of programs — often mislabeled code optimization. These techniques have a long history in the literature. For example, Backus makes it quite clear that code quality was a major concern to the implementors of the first Fortran compilers [18].
    [Show full text]