Optimization and Parallelization of Sequential Programs

Total Page:16

File Type:pdf, Size:1020Kb

Optimization and Parallelization of Sequential Programs TDDC86 Compiler optimizations and code generation Outline Towards (semi-)automatic parallelization of sequential programs Optimization and Parallelization Data dependence analysis for loops of Sequential Programs Some loop transformations Loop invariant code hoisting, loop unrolling, loop fusion, loop interchange, loop blocking and tiling Lecture 7 Static loop parallelization Run-time loop parallelization Christoph Kessler IDA / PELAB Doacross parallelization, Inspector-executor method Linköping University Speculative parallelization (as time permits) Sweden Auto-tuning (later, if time) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 2 TDDD56 Multicore and GPU Programming Foundations: Control and Data Dependence Foundations: Control and Data Dependence Consider statements S, T in a sequential program (S=T possible) Data dependence S T, Example: if statement S may execute (dynamically) before T Scope of analysis is typically a function, i.e. intra-procedural analysis S: z = … ; and both may access the same memory location … Assume that a control flow path S … T is possible and at least one of these accesses is a write T: … = ..z.. ; Can be done at arbitrary granularity (instructions, operations, Means that execution order ”S before T” must statements, compound statements, program regions) be preserved when restructuring the program (flow dependence) Relevant are only the read and write effects on memory In general, only a conservative over-estimation (i.e. on program variables) by each operation, can be determined statically and the effect on control flow flow dependence: (RAW, read-after-write) Example: S may write a location z that T may read Control dependence S T, S: if (…) { anti dependence: (WAR, write-after-read) if the fact whether T is executed may depend on S … (e.g. condition) T:… S may read a location x that T may overwrites Implies that relative execution order S T … output dependence: (WAW, write-after-write) must be preserved when restructuring the program } both S and T may write the same location Mostly obvious from nesting structure in well-structured programs, but more tricky in arbitrary branching code (e.g. assembler code) C. Kessler, IDA, Linköpings universitet. 3 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 4 TDDD56 Multicore and GPU Programming Dependence Graph Data Dependence Graph (Data, Control, Program) Dependence Graph: Data dependence graph for straight-line code (”basic Directed graph, consisting of all statements as vertices block”, no branching) is always acyclic, because relative and all (data, control, any) dependences as edges. execution order of statements is forward only. Data dependence graph for a loop: Dependence edge ST if a dependence may exist for some pair of instances (iterations) of S, T Cycles possible Loop-independent versus loop-carried dependences Example: (assuming we know statically C. Kessler, IDA, Linköpings universitet. 5 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet.that arrays a and6 b do not intersect)TDDD56 Multicore and GPU Programming 1 Example Why Loop Optimization and Parallelization? (assuming that we statically know that Loops are a promising object for program optimizations, arrays A, X, Y, Z do not intersect, otherwise there might be further including automatic parallelization: dependences) High execution frequency Most computation done in (inner) loops Even small optimizations can have large impact (cf. Amdahl’s Law) Regular, repetitive behavior compact description relatively simple to analyze statically (Iterations unrolled) Well researched S Data dependence graph: 1 C. Kessler, IDA, Linköpings universitet. S2 7 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 8 TDDD56 Multicore and GPU Programming Loop Optimizations – General Issues TDDC86 Compiler optimizations and code generation Move loop invariant computations out of loops Modify the order of iterations or parts thereof Goals: Improve data access locality Faster execution Data Dependence Analysis Reduce loop control overhead Enhance possibilities for loop parallelization or vectorization A more formal introduction Only transformations that preserve the program semantics (its input/output behavior) are admissible Conservative (static) criterium: preserve data dependences Need data dependence analysis for loops (see TDDC86) Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 9 TDDD56 Multicore and GPU Programming Data Dependence Analysis – Overview Precedence relation between statements Important for loop optimizations, vectorization and parallelization, instruction scheduling, data cache optimizations Conservative approximations to disjointness of pairs of memory accesses weaker than data-flow analysis but generalizes nicely to the level of individual array element Loops, loop nests Iteration space Array subscripts in loops Index space Dependence testing methods Data dependence graph Data + control dependence graph Program dependence graph C. Kessler, IDA, Linköpings universitet. 11 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 12 TDDD56 Multicore and GPU Programming 2 Control and Data Depencence; Data Dependence Graph Dependence Graph C. Kessler, IDA, Linköpings universitet. 13 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 14 TDDD56 Multicore and GPU Programming Remark: Target-Level Dependence Graphs Loop Normalization For VLIR / target code, the dependence edges may be labeled with latency or delay of the operation See lecture on instruction scheduling for details C. Kessler, IDA, Linköpings universitet. 15 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 16 TDDD56 Multicore and GPU Programming Loop Iteration Space Dependence Distance and Direction C. Kessler, IDA, Linköpings universitet. 17 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 18 TDDD56 Multicore and GPU Programming 3 Example Dependence Equation System C. Kessler, IDA, Linköpings universitet. 19 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 20 TDDD56 Multicore and GPU Programming Linear Diophantine Equations Dependence Testing, 1: GCD-Test C. Kessler, IDA, Linköpings universitet. 21 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 22 TDDD56 Multicore and GPU Programming For multidimensional arrays? Survey of Dependence Tests C. Kessler, IDA, Linköpings universitet. 23 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 24 TDDD56 Multicore and GPU Programming 4 TDDC86 Compiler optimizations and code generation Loop Transformations Loop Transformations and Parallelization Christoph Kessler, IDA, Linköpings universitet, 2011. C. Kessler, IDA, Linköpings universitet. 26 TDDD56 Multicore and GPU Programming Some important loop transformations Loop Invariant Code Hoisting Loop normalization Move loop invariant code out of the loop Loop parallelization Compilers can do this automatically if they can statically Loop invariant code hoisting find out what code is loop invariant Loop interchange Example: Loop fusion vs. Loop distribution / fission for (i=0; i<10; i++) tmp = c / d; Strip-mining / loop tiling / blocking vs. Loop linearization a[i] = b[i] + c / d; for (i=0; i<10; i++) Loop unrolling, unroll-and-jam a[i] = b[i] + tmp; Loop peeling Index set splitting, Loop unswitching Scalar replacement, Scalar expansion Later: Software pipelining More: Cycle shrinking, Loop skewing, ... C. Kessler, IDA, Linköpings universitet. 27 TDDD56 Multicore and GPU Programming C. Kessler, IDA, Linköpings universitet. 28 TDDD56 Multicore and GPU Programming Loop Unrolling Loop Interchange (1) Loop unrolling For properly nested loops Can be enforced with compiler options e.g. –funroll=2 (statements in innermost loop body only) Example: Example 1: for (i=0; i<50; i++) { for (i =0; i<50; i+=2) { for (j=0; j<M; j++) for (i=0; i<N; i++) a[i] = b[i]; Unroll a[i] = b[i]; for (i=0; i<N; i++) for (j=0; j<M; j++) by 2: } a[i+1] = b[i+1]; a[ i ][ j ] = 0.0 ; a[ i ][ j ] = 0.0 ; j j } a[0][0] row-wise a[0][0] a[0][M-1] Reduces loop overhead (total # comparisons, branches, increments) i storage of i 2D-arrays . in C, Java Longer loop body may enable further local optimizations . new iteration order . (e.g. common subexpression elimination, . register allocation, instruction scheduling, old iteration order a[N-1][0] a[N-1][0] using SIMD instructions) Can improve data access locality in memory hierarchy longer code (fewer cache misses / page faults) C. Kessler,Exercise: IDA, Linköpings Formulate universitet. the unrolling rule for29 statically unknownTDDD56 upperMulticore loop and GPU limit Programming C. Kessler, IDA, Linköpings universitet. 30 TDDD56 Multicore and GPU Programming 5 Foundations: Loop-Carried Data Dependences Loop Interchange (2) Recall: Data dependence S T, S: z = … ; Be careful with loop carried data dependences! if operation S may execute (dynamically) before operation T … and both may access the same memory location T: … = ..z.. ; Example 2: and at least one of these accesses is a write for (j=1; j<M; j++) In general, only a conservative over-estimation can be determined for (i=0; i<N; i++) statically. for (i=0; i<N; i++) for (j=1; j<M; j++) Data dependence ST is called loop
Recommended publications
  • CSE 582 – Compilers
    CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2011 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-1 Agenda A sampler of typical optimizing transformations Mostly a teaser for later, particularly once we’ve looked at analyzing loops 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-2 Role of Transformations Data-flow analysis discovers opportunities for code improvement Compiler must rewrite the code (IR) to realize these improvements A transformation may reveal additional opportunities for further analysis & transformation May also block opportunities by obscuring information 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-3 Organizing Transformations in a Compiler Typically middle end consists of many individual transformations that filter the IR and produce rewritten IR No formal theory for order to apply them Some rules of thumb and best practices Some transformations can be profitably applied repeatedly, particularly if others transformations expose more opportunities 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-4 A Taxonomy Machine Independent Transformations Realized profitability may actually depend on machine architecture, but are typically implemented without considering this Machine Dependent Transformations Most of the machine dependent code is in instruction selection & scheduling and register allocation Some machine dependent code belongs in the optimizer 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-5 Machine Independent Transformations Dead code elimination Code motion Specialization Strength reduction
    [Show full text]
  • ICS803 Elective – III Multicore Architecture Teacher Name: Ms
    ICS803 Elective – III Multicore Architecture Teacher Name: Ms. Raksha Pandey Course Structure L T P 3 1 0 4 Prerequisite: Course Content: Unit-I: Multi-core Architectures Introduction to multi-core architectures, issues involved into writing code for multi-core architectures, Virtual Memory, VM addressing, VA to PA translation, Page fault, TLB- Parallel computers, Instruction level parallelism (ILP) vs. thread level parallelism (TLP), Performance issues, OpenMP and other message passing libraries, threads, mutex etc. Unit-II: Multi-threaded Architectures Brief introduction to cache hierarchy - Caches: Addressing a Cache, Cache Hierarchy, States of Cache line, Inclusion policy, TLB access, Memory Op latency, MLP, Memory Wall, communication latency, Shared memory multiprocessors, General architectures and the problem of cache coherence, Synchronization primitives: Atomic primitives; locks: TTS, ticket, array; barriers: central and tree; performance implications in shared memory programs; Chip multiprocessors: Why CMP (Moore's law, wire delay); shared L2 vs. tiled CMP; core complexity; power/performance; Snoopy coherence: invalidate vs. update, MSI, MESI, MOESI, MOSI; performance trade-offs; pipelined snoopy bus design; Memory consistency models: SC, PC, TSO, PSO, WO/WC, RC; Chip multiprocessor case studies: Intel Montecito and dual-core, Pentium4, IBM Power4, Sun Niagara Unit-III: Compiler Optimization Issues Code optimizations: Copy Propagation, dead Code elimination , Loop Optimizations-Loop Unrolling, Induction variable Simplification, Loop Jamming, Loop Unswitching, Techniques to improve detection of parallelism: Scalar Processors, Special locality, Temporal locality, Vector machine, Strip mining, Shared memory model, SIMD architecture, Dopar loop, Dosingle loop. Unit-IV: Control Flow analysis Control flow analysis, Flow graph, Loops in Flow graphs, Loop Detection, Approaches to Control Flow Analysis, Reducible Flow Graphs, Node Splitting.
    [Show full text]
  • Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗
    Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗ David Leopoldseder Roland Schatz Lukas Stadler Johannes Kepler University Linz Oracle Labs Oracle Labs Austria Linz, Austria Linz, Austria [email protected] [email protected] [email protected] Manuel Rigger Thomas Würthinger Hanspeter Mössenböck Johannes Kepler University Linz Oracle Labs Johannes Kepler University Linz Austria Zurich, Switzerland Austria [email protected] [email protected] [email protected] ABSTRACT Non-Counted Loops to Enable Subsequent Compiler Optimizations. In 15th Java programs can contain non-counted loops, that is, loops for International Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018, Linz, Austria. ACM, New York, NY, USA, 13 pages. which the iteration count can neither be determined at compile https://doi.org/10.1145/3237009.3237013 time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves 1 INTRODUCTION duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can Generating fast machine code for loops depends on a selective appli- optimize the unrolled code. cation of different loop optimizations on the main optimizable parts This paper presents an unrolling approach for non-counted loops of a loop: the loop’s exit condition(s), the back edges and the loop that uses simulation at run time to determine whether unrolling body. All these parts must be optimized to generate optimal code such loops enables subsequent compiler optimizations. Simulat- for a loop.
    [Show full text]
  • Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A
    Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A. Najjar University of California Riverside University of California Riverside University of California Riverside Computer Science Department Electrical Engineering Department Computer Science Department Riverside, CA 92521 Riverside, CA 92521 Riverside, CA 92521 [email protected] [email protected] [email protected] ABSTRACT such as constant propagation, constant folding, dead code ROCCC (Riverside Optimizing Configurable Computing eliminations that enables other optimizations. Then, loop Compiler) is an optimizing C to HDL compiler targeting level optimizations such as loop unrolling, loop invariant FPGA and CSOC (Configurable System On a Chip) code motion, loop fusion and other such transforms are architectures. ROCCC system is built on the SUIF- applied. MACHSUIF compiler infrastructure. Our system first This paper is organized as follows. Next section discusses identifies frequently executed kernel loops inside programs on the related work. Section 3 gives an in depth description and then compiles them to VHDL after optimizing the of the ROCCC system. SUIF2 level optimizations are kernels to make best use of FPGA resources. This paper described in Section 4 followed by the compilation done at presents an overview of the ROCCC project as well as the MACHSUIF level in Section 5. Section 6 mentions our optimizations performed inside the ROCCC compiler. early results. Finally Section 7 concludes the paper. 1. INTRODUCTION 2. RELATED WORK FPGAs (Field Programmable Gate Array) can boost There are several projects and publications on translating C software performance due to the large amount of or other high level languages to different HDLs. Streams- parallelism they offer.
    [Show full text]
  • Functional Array Programming in Sac
    Functional Array Programming in SaC Sven-Bodo Scholz Dept of Computer Science, University of Hertfordshire, United Kingdom [email protected] Abstract. These notes present an introduction into array-based pro- gramming from a functional, i.e., side-effect-free perspective. The first part focuses on promoting arrays as predominant, stateless data structure. This leads to a programming style that favors compo- sitions of generic array operations that manipulate entire arrays over specifications that are made in an element-wise fashion. An algebraicly consistent set of such operations is defined and several examples are given demonstrating the expressiveness of the proposed set of operations. The second part shows how such a set of array operations can be defined within the first-order functional array language SaC.Itdoes not only discuss the language design issues involved but it also tackles implementation issues that are crucial for achieving acceptable runtimes from such genericly specified array operations. 1 Introduction Traditionally, binary lists and algebraic data types are the predominant data structures in functional programming languages. They fit nicely into the frame- work of recursive program specifications, lazy evaluation and demand driven garbage collection. Support for arrays in most languages is confined to a very resricted set of basic functionality similar to that found in imperative languages. Even if some languages do support more advanced specificational means such as array comprehensions, these usualy do not provide the same genericity as can be found in array programming languages such as Apl,J,orNial. Besides these specificational issues, typically, there is also a performance issue.
    [Show full text]
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Compiler-Based Code-Improvement Techniques
    Compiler-Based Code-Improvement Techniques KEITH D. COOPER, KATHRYN S. MCKINLEY, and LINDA TORCZON Since the earliest days of compilation, code quality has been recognized as an important problem [18]. A rich literature has developed around the issue of improving code quality. This paper surveys one part of that literature: code transformations intended to improve the running time of programs on uniprocessor machines. This paper emphasizes transformations intended to improve code quality rather than analysis methods. We describe analytical techniques and specific data-flow problems to the extent that they are necessary to understand the transformations. Other papers provide excellent summaries of the various sub-fields of program analysis. The paper is structured around a simple taxonomy that classifies transformations based on how they change the code. The taxonomy is populated with example transformations drawn from the literature. Each transformation is described at a depth that facilitates broad understanding; detailed references are provided for deeper study of individual transformations. The taxonomy provides the reader with a framework for thinking about code-improving transformations. It also serves as an organizing principle for the paper. Copyright 1998, all rights reserved. You may copy this article for your personal use in Comp 512. Further reproduction or distribution requires written permission from the authors. 1INTRODUCTION This paper presents an overview of compiler-based methods for improving the run-time behavior of programs — often mislabeled code optimization. These techniques have a long history in the literature. For example, Backus makes it quite clear that code quality was a major concern to the implementors of the first Fortran compilers [18].
    [Show full text]
  • Survey of Compiler Technology at the IBM Toronto Laboratory
    An (incomplete) Survey of Compiler Technology at the IBM Toronto Laboratory Bob Blainey March 26, 2002 Target systems Sovereign (Sun JDK-based) Just-in-Time (JIT) Compiler zSeries (S/390) OS/390, Linux Resettable, shareable pSeries (PowerPC) AIX 32-bit and 64-bit Linux xSeries (x86 or IA-32) Windows, OS/2, Linux, 4690 (POS) IA-64 (Itanium, McKinley) Windows, Linux C and C++ Compilers zSeries OS/390 pSeries AIX Fortran Compiler pSeries AIX Key Optimizing Compiler Components TOBEY (Toronto Optimizing Back End with Yorktown) Highly optimizing code generator for S/390 and PowerPC targets TPO (Toronto Portable Optimizer) Mostly machine-independent optimizer for Wcode intermediate language Interprocedural analysis, loop transformations, parallelization Sun JDK-based JIT (Sovereign) Best of breed JIT compiler for client and server applications Based very loosely on Sun JDK Inside a Batch Compilation C source C++ source Fortran source Other source C++ Front Fortran C Front End Other Front End Front End Ends Wcode Wcode Wcode++ Wcode Wcode TPO Wcode Scalarizer Wcode TOBEY Wcode Back End Object Code TOBEY Optimizing Back End Project started in 1983 targetting S/370 Later retargetted to ROMP (PC-RT), Power, Power2, PowerPC, SPARC, and ESAME/390 (64 bit) Experimental retargets to i386 and PA-RISC Shipped in over 40 compiler products on 3 different platforms with 8 different source languages Primary vehicle for compiler optimization since the creation of the RS/6000 (pSeries) Implemented in a combination of PL.8 ("80% of PL/I") and C++ on an AIX reference
    [Show full text]
  • Compiler Transformations for High-Performance Computing
    School of Electrical and Computer Engineering N.T.U.A. Embedded System Design High-Level Dimitrios Soudris Transformations for Embedded Computing Άδεια Χρήσης Το παρόν εκπαιδευτικό υλικό υπόκειται σε άδειες χρήσης Creative Commons. Για εκπαιδευτικό υλικό, όπως εικόνες, που υπόκειται σε άδεια χρήσης άλλου τύπου, αυτή πρέπει να αναφέρεται ρητώς. Organization of a hypothetical optimizing compiler 2 DEPENDENCE ANALYSIS Dependence analysis identifies these constraints, which are then used to determine whether a particular transformation can be applied without changing the semantics of the computation. A dependence is a relationship between two computations that places constraints on their execution order. Types dependences: (i) control dependence and (ii) data dependence. Control Dependence Two statements have a Data Dependence if they cannot be executed simultaneously due to conflicting uses of the same variable. 3 Types of Data Dependences (1) Flow dependence (also called true dependence) S1: a = c*10 S2: d = 2*a + c Anti-dependence S1: e = f*4 + g S2: g = 2*h 4 Types of Data Dependences (2) Output dependence both statements write the same variable S1: a = b*c S2: a = d+e Input Dependence when two accesses to the same location memory are both reads Dependence Graph nodes represents statements and arcs dependencies between computations 5 Loop Dependence Analysis To compute dependence information for loops, the key problem is understanding the use of arrays; scalar variables are relatively easy to manage. To track array behavior, the compiler must analyze the subscript expressions in each array reference. To discover whether there is a dependence in the loop nest, it is sufficient to determine whether any of the iterations can write a value that is read or written by any of the other iterations.
    [Show full text]
  • Generalized Index-Set Splitting
    Generalized Index-Set Splitting Christopher Barton1, Arie Tal2, Bob Blainey2, and Jos´e Nelson Amaral1 1 Department of Computing Science, University of Alberta, Edmonton, Canada {cbarton, amaral}@cs.ualberta.ca 2 IBM Toronto Software Laboratory, Toronto, Canada {arietal, blainey}@ca.ibm.com Abstract. This paper introduces Index-Set Splitting (ISS), a technique that splits a loop containing several conditional statements into sev- eral loops with less complex control flow. Contrary to the classic loop unswitching technique, ISS splits loops when the conditional is loop vari- ant. ISS uses an Index Sub-range Tree (IST) to identify the structure of the conditionals in the loop and to select which conditionals should be eliminated. This decision is based on an estimation of the code growth for each splitting: a greedy algorithm spends a pre-determined code growth budget. ISTs separate the decision about which splits to perform from the actual code generation for the split loops. The use of ISS to improve a loop fusion framework is then discussed. ISS opportunity identification in the SPEC2000 benchmark suite and three other suites demonstrate that ISS is a general technique that may benefit other compilers. 1 Introduction This paper describes Index-Set Splitting (ISS), a code transformation motivated by the implementation of loop fusion in the commercially distributed IBM XL Compilers. ISS is an enabling technique that increases the code scope where other optimizations, such as software pipelining, loop unroll-and-jam, unimodu- lar transformations, loop-based common expression elimination, can be applied. A loop that does not contain branch statements is a Single Basic Block Loop (SBBL).
    [Show full text]
  • Qlogic Pathscale™ Compiler Suite User Guide
    Q Simplify QLogic PathScale™ Compiler Suite User Guide Version 3.0 1-02404-15 Page i QLogic PathScale Compiler Suite User Guide Version 3.0 Q Information furnished in this manual is believed to be accurate and reliable. However, QLogic Corporation assumes no responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use. QLogic Corporation reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only. QLogic Corporation makes no representation nor warranty that such applications are suitable for the specified use without further testing or modification. QLogic Corporation assumes no responsibility for any errors that may appear in this document. No part of this document may be copied nor reproduced by any means, nor translated nor transmitted to any magnetic medium without the express written consent of QLogic Corporation. In accordance with the terms of their valid PathScale agreements, customers are permitted to make electronic and paper copies of this document for their own exclusive use. Linux is a registered trademark of Linus Torvalds. QLA, QLogic, SANsurfer, the QLogic logo, PathScale, the PathScale logo, and InfiniPath are registered trademarks of QLogic Corporation. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SuSE is a registered trademark of SuSE Linux AG. All other brand and product names are trademarks or registered trademarks of their respective owners. Document Revision History 1.4 New Sections 3.3.3.1, 3.7.4, 10.4, 10.5 Added Appendix B: Supported Fortran intrinsics 2.0 New Sections 2.3, 8.9.7, 11.8 Added Chapter 8: Using OpenMP in Fortran New Appendix B: Implementation dependent behavior for OpenMP Fortran Expanded and updated Appendix C: Supported Fortran intrinsics 2.1 Added Chapter 9: Using OpenMP in C/C++, Appendix E: eko man page.
    [Show full text]
  • Chapter 10 Scalar Optimizations
    Front End Optimizer Back End ¹ ¹ ¹ ¹ Infrastructure Chapter 10 Scalar Optimizations 10.1 Introduction Code optimization in acompiler consists of analyses and transformations in- tended to improve the quality of the code that the compiler produces. Data-flow analysis, discussed in detail in Chapter 9, lets the compiler discover opportu- nities for transformation and lets the compiler prove the safety of applying the transformations. However, analysis is just the prelude to transformation: the compiler improve the code’s performance by rewriting it. Data-flow analysis serves as a unifying conceptual framework for the classic problems in static analysis. Problems are posed as data-flow frameworks. In- stances of these problems, exhibited by programs, are solved using some general purpose solver. The results are produced as sets that annotate some form of the code. Sometimes, the insights of analysis can be directly encoded into the ir form of the code, as with ssa form. Unfortunately, no unifying framework exists for optimizations—which com- bine a specific analysis with the rewriting necessary to achieve a desired im- provement. Optimizations consume and produce the compiler’s ir;theymight be viewed as complex rewriting engines. Some optimizations are specified with detailed algorithms; for example, dominator-based value numbering builds up a collection of facts from low-level details (§8.5.2). Others are specified with high- level descriptions; for example, global redundancy elimination operates from a set of data-flow equations (§8.6), and inline substitution is usually described as replacing a call site with the text of the called procedure with appropriate substitution of actual arguments for formal parameters (§8.7.2).
    [Show full text]