CSE 501: Course Outline Implementation of Programming Languages

Total Page:16

File Type:pdf, Size:1020Kb

CSE 501: Course Outline Implementation of Programming Languages CSE 501: Course outline Implementation of Programming Languages Models of compilation/analysis Main focus: program analysis and transformation ¥ how to represent programs? Standard optimizing transformations ¥ how to analyze programs? what to analyze? ¥ how to transform programs? what transformations to apply? Basic representations and analyses Study imperative, functional, and object-oriented languages Fancier representations and analyses Official prerequisites: Interprocedural representations, analyses, and transformations ¥ CSE 401 or equivalent ¥ for imperative, functional, and OO languages ¥ CSE 505 or equivalent Run-time system issues ¥ garbage collection Reading: Appel’s “Modern Compiler Implementation” ¥ compiling dynamic dispatch, first-class functions, ... + ~20 papers from literature ¥ “Compilers: Principles, Techniques, & Tools”, Dynamic (JIT) compilation a.k.a. the Dragon Book, as a reference Other program analysis frameworks and tools Coursework: ¥ model checking, constraints, best-effort "bug finders" ¥ periodic homework assignments ¥ major course project ¥ midterm, final Craig Chambers 1 CSE 501 Craig Chambers 2 CSE 501 Why study compilers? Goals for language implementation Meeting area of programming languages, architectures Correctness ¥ capabilities of compilers greatly influence design of these others Efficiency ¥ of: time, data space, code space Program representation, analysis, and transformation ¥ at: compile-time, run-time is widely useful beyond pure compilation ¥ software engineering tools Support expressive, safe language features ¥ DB query optimizers, programmable graphics renderers ¥ first-class, higher-order functions (domain-specific languages and optimizers) ¥ method dispatching ¥ safety/security checking of code, e.g. in programmable/extensible systems, networks, ¥ exceptions, continuations databases ¥ reflection, dynamic code loading ¥ bounds-checked arrays, ... Cool theoretical aspects, too ¥ garbage collection ¥ lattice domains, graph algorithms, computability/complexity ¥ ... Support desirable programming environment features ¥ fast turnaround ¥ separate compilation, shared libraries ¥ source-level debugging ¥ profiling ¥ ... Craig Chambers 3 CSE 501 Craig Chambers 4 CSE 501 Standard compiler organization Mixing front-ends and back-ends Analysis Synthesis of input program of output program Define intermediate language (front-end) (back-end) (e.g. Java bytecode, MSIL, SUIF, WIL, C, C--, ...) character Compile multiple languages into it stream intermediate form ¥ each such compiler may not be much more than a front-end Lexical Analysis token Optimization Compile to multiple targets from it stream ¥ may not be much more than back-end Syntactic Analysis intermediate form abstract Or, interpret/execute it directly syntax Or, perform other analyses of it tree Interpreter Code Generation Semantic Analysis target language annotated Advantages: AST ¥ reuse of front-ends and back-ends ¥ portable “compiled” code Interpreter Intermediate Code Generation BUT: design of portable intermediate language is hard intermediate form ¥ how universal? across input language models? target machine models? ¥ high-level or low-level? Craig Chambers 5 CSE 501 Craig Chambers 6 CSE 501 Key questions Overview of optimizations How are programs represented in the compiler? First analyze program to learn things about it Then transform the program based on info How are analyses organized/structured? Repeat... ¥ Over what region of the program are analyses performed? ¥ What analysis algorithms are used? Requirement: don’t change the semantics! ¥ transform input program into What kinds of optimizations can be performed? semantically equivalent but better output program ¥ Which are profitable in practice? ¥ How should analyses/optimizations be sequenced/ combined? Analysis determines when transformations are: ¥ legal How best to compile in face of: ¥ profitable ¥ pointers, arrays ¥ first-class functions ¥ inheritance & message passing ¥ parallel target machines Caveat: “optimize” a misnomer ¥ result is almost never optimal Other issues: ¥ sometimes slow down some programs on some inputs ¥ speeding compilation (although hope to speed up most programs on most inputs) ¥ making compilers portable, table-driven ¥ supporting tools like debuggers, profilers, garbage collect’rs Craig Chambers 7 CSE 501 Craig Chambers 8 CSE 501 Semantics Scope of analysis Exactly what are the semantics that are to be preserved? Peephole: across a small number of “adjacent” instructions Subtleties: [adjacent in space or time] ¥ evaluation order ¥ trivial analysis ¥ arithmetic properties like associativity, commutativity ¥ behavior in “error” cases Local: within a basic block ¥ simple, fast analysis Some languages very precise Intraprocedural (a.k.a. global): across basic blocks, within a procedure ¥ programmers always know what they’re getting ¥ analysis more complex: branches, merges, loops Others weaker Interprocedural: ¥ allow better performance (but how much?) across procedures, within a whole program ¥ analysis even more complex: calls, returns ¥ hard with separate compilation Semantics selected by compiler option? Whole-program: analysis examines whole program in order to prove safety Craig Chambers 9 CSE 501 Craig Chambers 10 CSE 501 A tour of common optimizations/transformations common subexpression elimination (CSE) x := a + b ⇒ x := a + b ... ... arithmetic simplifications: y := a + b y := x ¥ constant folding ¥ can also eliminate redundant memory references, ⇒ x := 3 + 4 x := 7 branch tests ¥ strength reduction x := y * 4 ⇒ x := y << 2 partial redundancy elimination (PRE) ¥ like CSE, but with earlier expression only available along constant propagation subset of possible paths x := 5 ⇒ x := 5 ⇒ x := 5 if ... then ⇒ if ... then y := x + 2 y := 5 + 2 y := 7 ... ... x := a + b t := a + b; x := t end else t := a + b end integer range analysis ... ... ¥ fold comparisons based on range analysis y := a + b y := t ¥ eliminate unreachable code for(index = 0; index < 10; index ++) { if index >= 10 goto _error a[index] := 0 } ¥ more generally, symbolic assertion analysis Craig Chambers 11 CSE 501 Craig Chambers 12 CSE 501 copy propagation pointer/alias analysis x := y ⇒ x := y p := &x ⇒ p := &x ⇒ p := &x w := w + x w := w + y *p := 5 *p := 5 *p := 5 y := x + 1 y := 5 + 1 y := 6 dead (unused) assignment elimination x := y ** z x := 5 ... // no use of x *p := 3 ⇒ x := 6 y := x + 1 ??? ¥ a common clean-up after other optimizations: ¥ augments lots of other optimizations/analyses x := y ⇒ x := y ⇒ x := y w := w + x w := w + y ⇒ w := w + y ... // no use of x partial dead assignment elimination ¥ like DAE, except assignment only used on some later paths dead (unreachable) code elimination if false goto _else ... goto _done _else: ... _done: ¥ another common clean-up after other optimizations Craig Chambers 13 CSE 501 Craig Chambers 14 CSE 501 loop-invariant code motion parallelization for j := 1 to 10 ⇒ for j := 1 to 10 for i := 1 to 1000 ⇒ forall i := 1 to 1000 for i := 1 to 10 t := b[j] a[i] := a[i] + 1 a[i] := a[i] + 1 a[i] := a[i] + b[j] for i := 1 to 10 loop interchange, skewing, reversal, ... a[i] := a[i] + t blocking/tiling induction variable elimination ¥ restructuring loops for better data cache locality for i := 1 to 10 ⇒ for p := &a[1] to &a[10] for i := 1 to 1000 a[i] := a[i] + 1 *p := *p + 1 for j := 1 to 1000 ¥ a[i] is several instructions, *p is one for k := 1 to 1000 c[i,j] += a[i,k] * b[k,j] ⇒ loop unrolling for i := 1 to 1000 by TILESIZE for j := 1 to 1000 by TILESIZE for i := 1 to N ⇒ for i := 1 to N by 4 for k := 1 to 1000 a[i] := a[i] + 1 a[i] := a[i] + 1 for i’ := i to i+TILESIZE a[i+1] := a[i+1] + 1 for j’ := j to j+TILESIZE a[i+2] := a[i+2] + 1 c[i’,j’] += a[i’,k] * b[k,j’] a[i+3] := a[i+3] + 1 Craig Chambers 15 CSE 501 Craig Chambers 16 CSE 501 inlining register allocation l := ... ⇒ l := ... ⇒ l := ... w := 4 w := 4 w := 4 instruction selection a := area(l,w) a := l * w a := l << 2 p1 := p + 4 ⇒ ld %g3, [%g1 + 4] ¥ lots of “silly” optimizations become important after inlining x := *p1 ¥ particularly important on CISCs interprocedural constant propagation, alias analysis, etc. instruction scheduling static binding of dynamic calls ld %g2, [%g1 + 0] ⇒ ld %g2, [%g1 + 0] ¥ in imperative languages, for call of a function pointer: add %g3, %g2, 1 ld %g5, [%g1 + 4] if can compute unique target of pointer, ld %g2, [%g1 + 4] add %g3, %g2, 1 can replace with direct call add %g4, %g2, 1 add %g4, %g5, 1 ¥ in functional languages, for call of a computed function: ¥ particularly important with instructions that have delayed if can compute unique value of function expression, results, and on wide-issue machines can replace with direct call ¥ vs. dynamically scheduled machines? ¥ in OO languages, for dynamically dispatched message: if can deduce class of receiver, can replace with direct call ¥ other possible optimizations even if several possible targets procedure specialization Craig Chambers 17 CSE 501 Craig Chambers 18 CSE 501 Optimization themes The phase ordering problem Don’t compute it if you don’t have to Typically, want to perform a number of optimizations; ¥ dead assignment elimination in what order should the transformations be performed? Compute it at compile-time if you can some optimizations create opportunities for other optimizations ⇒ order optimizations using this dependence ¥ constant folding, loop unrolling, inlining ¥ some optimizations simplified if can assume another opt will run later & “clean up” Compute it as few
Recommended publications
  • Redundancy Elimination Common Subexpression Elimination
    Redundancy Elimination Aim: Eliminate redundant operations in dynamic execution Why occur? Loop-invariant code: Ex: constant assignment in loop Same expression computed Ex: addressing Value numbering is an example Requires dataflow analysis Other optimizations: Constant subexpression elimination Loop-invariant code motion Partial redundancy elimination Common Subexpression Elimination Replace recomputation of expression by use of temp which holds value Ex. (s1) y := a + b Ex. (s1) temp := a + b (s1') y := temp (s2) z := a + b (s2) z := temp Illegal? How different from value numbering? Ex. (s1) read(i) (s2) j := i + 1 (s3) k := i i + 1, k+1 (s4) l := k + 1 no cse, same value number Why need temp? Local and Global ¡ Local CSE (BB) Ex. (s1) c := a + b (s1) t1 := a + b (s2) d := m&n (s1') c := t1 (s3) e := a + b (s2) d := m&n (s4) m := 5 (s5) if( m&n) ... (s3) e := t1 (s4) m := 5 (s5) if( m&n) ... 5 instr, 4 ops, 7 vars 6 instr, 3 ops, 8 vars Always better? Method: keep track of expressions computed in block whose operands have not changed value CSE Hash Table (+, a, b) (&,m, n) Global CSE example i := j i := j a := 4*i t := 4*i i := i + 1 i := i + 1 b := 4*i t := 4*i b := t c := 4*i c := t Assumes b is used later ¡ Global CSE An expression e is available at entry to B if on every path p from Entry to B, there is an evaluation of e at B' on p whose values are not redefined between B' and B.
    [Show full text]
  • Polyhedral Compilation As a Design Pattern for Compiler Construction
    Polyhedral Compilation as a Design Pattern for Compiler Construction PLISS, May 19-24, 2019 [email protected] Polyhedra? Example: Tiles Polyhedra? Example: Tiles How many of you read “Design Pattern”? → Tiles Everywhere 1. Hardware Example: Google Cloud TPU Architectural Scalability With Tiling Tiles Everywhere 1. Hardware Google Edge TPU Edge computing zoo Tiles Everywhere 1. Hardware 2. Data Layout Example: XLA compiler, Tiled data layout Repeated/Hierarchical Tiling e.g., BF16 (bfloat16) on Cloud TPU (should be 8x128 then 2x1) Tiles Everywhere Tiling in Halide 1. Hardware 2. Data Layout Tiled schedule: strip-mine (a.k.a. split) 3. Control Flow permute (a.k.a. reorder) 4. Data Flow 5. Data Parallelism Vectorized schedule: strip-mine Example: Halide for image processing pipelines vectorize inner loop https://halide-lang.org Meta-programming API and domain-specific language (DSL) for loop transformations, numerical computing kernels Non-divisible bounds/extent: strip-mine shift left/up redundant computation (also forward substitute/inline operand) Tiles Everywhere TVM example: scan cell (RNN) m = tvm.var("m") n = tvm.var("n") 1. Hardware X = tvm.placeholder((m,n), name="X") s_state = tvm.placeholder((m,n)) 2. Data Layout s_init = tvm.compute((1,n), lambda _,i: X[0,i]) s_do = tvm.compute((m,n), lambda t,i: s_state[t-1,i] + X[t,i]) 3. Control Flow s_scan = tvm.scan(s_init, s_do, s_state, inputs=[X]) s = tvm.create_schedule(s_scan.op) 4. Data Flow // Schedule to run the scan cell on a CUDA device block_x = tvm.thread_axis("blockIdx.x") 5. Data Parallelism thread_x = tvm.thread_axis("threadIdx.x") xo,xi = s[s_init].split(s_init.op.axis[1], factor=num_thread) s[s_init].bind(xo, block_x) Example: Halide for image processing pipelines s[s_init].bind(xi, thread_x) xo,xi = s[s_do].split(s_do.op.axis[1], factor=num_thread) https://halide-lang.org s[s_do].bind(xo, block_x) s[s_do].bind(xi, thread_x) print(tvm.lower(s, [X, s_scan], simple_mode=True)) And also TVM for neural networks https://tvm.ai Tiling and Beyond 1.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Incremental Computation of Static Single Assignment Form
    Incremental Computation of Static Single Assignment Form Jong-Deok Choi 1 and Vivek Sarkar 2 and Edith Schonberg 3 IBM T.J. Watson Research Center, P. O. Box 704, Yorktown Heights, NY 10598 ([email protected]) 2 IBM Software Solutions Division, 555 Bailey Avenue, San Jose, CA 95141 ([email protected]) IBM T.J. Watson Research Center, P. O. Box 704, Yorktown Heights, NY 10598 ([email protected]) Abstract. Static single assignment (SSA) form is an intermediate rep- resentation that is well suited for solving many data flow optimization problems. However, since the standard algorithm for building SSA form is exhaustive, maintaining correct SSA form throughout a multi-pass com- pilation process can be expensive. In this paper, we present incremental algorithms for restoring correct SSA form after program transformations. First, we specify incremental SSA algorithms for insertion and deletion of a use/definition of a variable, and for a large class of updates on intervals. We characterize several cases for which the cost of these algorithms will be proportional to the size of the transformed region and hence poten- tially much smaller than the cost of the exhaustive algorithm. Secondly, we specify customized SSA-update algorithms for a set of common loop transformations. These algorithms are highly efficient: the cost depends at worst on the size of the transformed code, and in many cases the cost is independent of the loop body size and depends only on the number of loops. 1 Introduction Static single assignment (SSA) [8, 6] form is a compact intermediate program representation that is well suited to a large number of compiler optimization algorithms, including constant propagation [19], global value numbering [3], and program equivalence detection [21].
    [Show full text]
  • Compiler-Based Code-Improvement Techniques
    Compiler-Based Code-Improvement Techniques KEITH D. COOPER, KATHRYN S. MCKINLEY, and LINDA TORCZON Since the earliest days of compilation, code quality has been recognized as an important problem [18]. A rich literature has developed around the issue of improving code quality. This paper surveys one part of that literature: code transformations intended to improve the running time of programs on uniprocessor machines. This paper emphasizes transformations intended to improve code quality rather than analysis methods. We describe analytical techniques and specific data-flow problems to the extent that they are necessary to understand the transformations. Other papers provide excellent summaries of the various sub-fields of program analysis. The paper is structured around a simple taxonomy that classifies transformations based on how they change the code. The taxonomy is populated with example transformations drawn from the literature. Each transformation is described at a depth that facilitates broad understanding; detailed references are provided for deeper study of individual transformations. The taxonomy provides the reader with a framework for thinking about code-improving transformations. It also serves as an organizing principle for the paper. Copyright 1998, all rights reserved. You may copy this article for your personal use in Comp 512. Further reproduction or distribution requires written permission from the authors. 1INTRODUCTION This paper presents an overview of compiler-based methods for improving the run-time behavior of programs — often mislabeled code optimization. These techniques have a long history in the literature. For example, Backus makes it quite clear that code quality was a major concern to the implementors of the first Fortran compilers [18].
    [Show full text]
  • Introduction to Machine-Independent Optimizations - 1
    Introduction to Machine-Independent Optimizations - 1 Y.N. Srikant Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Y.N. Srikant Introduction to Optimizations Outline of the Lecture What is code optimization? Illustrations of code optimizations Examples of data-flow analysis Fundamentals of control-flow analysis Algorithms for two machine-independent optimizations SSA form and optimizations Y.N. Srikant Introduction to Optimizations Machine-independent Code Optimization Intermediate code generation process introduces many inefficiencies Extra copies of variables, using variables instead of constants, repeated evaluation of expressions, etc. Code optimization removes such inefficiencies and improves code Improvement may be time, space, or power consumption It changes the structure of programs, sometimes of beyond recognition Inlines functions, unrolls loops, eliminates some programmer-defined variables, etc. Code optimization consists of a bunch of heuristics and percentage of improvement depends on programs (may be zero also) Y.N. Srikant Introduction to Optimizations Examples of Machine-Independant Optimizations Global common sub-expression elimination Copy propagation Constant propagation and constant folding Loop invariant code motion Induction variable elimination and strength reduction Partial redundancy elimination Loop unrolling Function inlining Tail recursion removal Vectorization and Concurrentization Loop interchange, and loop blocking Y.N. Srikant Introduction to Optimizations Bubble Sort Running Example Y.N. Srikant Introduction to Optimizations Control Flow Graph of Bubble Sort Y.N. Srikant Introduction to Optimizations GCSE Conceptual Example Y.N. Srikant Introduction to Optimizations GCSE on Running Example - 1 Y.N. Srikant Introduction to Optimizations GCSE on Running Example - 2 Y.N. Srikant Introduction to Optimizations Copy Propagation on Running Example Y.N.
    [Show full text]
  • Code Improvement:Thephasesof Compilation Devoted to Generating Good Code
    Code17 Improvement In Chapter 15 we discussed the generation, assembly, and linking of target code in the middle and back end of a compiler. The techniques we presented led to correct but highly suboptimal code: there were many redundant computations, and inefficient use of the registers, multiple functional units, and cache of a mod- ern microprocessor. This chapter takes a look at code improvement:thephasesof compilation devoted to generating good code. For the most part we will interpret “good” to mean fast. In a few cases we will also consider program transforma- tions that decrease memory requirements. On occasion a real compiler may try to minimize power consumption, dollar cost of execution under a particular ac- counting system, or demand for some other resource; we will not consider these issues here. There are several possible levels of “aggressiveness” in code improvement. In a very simple compiler, or in a “nonoptimizing” run of a more sophisticated com- piler, we can use a peephole optimizer to peruse already-generated target code for obviously suboptimal sequences of adjacent instructions. At a slightly higher level, typical of the baseline behavior of production-quality compilers, we can generate near-optimal code for basic blocks. As described in Chapter 15, a basic block is a maximal-length sequence of instructions that will always execute in its entirety (assuming it executes at all). In the absence of delayed branches, each ba- sic block in assembly language or machine code begins with the target of a branch or with the instruction after a conditional branch, and ends with a branch or with the instruction before the target of a branch.
    [Show full text]
  • A Performance-Based Approach to Automatic Parallelization
    A Performance-based Approach to Automatic Parallelization Lecture Notes for ECE 663 Advanced Optimizing Compilers Draft Rudolf Eigenmann School of Electrical and Computer Engineering Purdue University c Rudolf Eigenmann 2 Chapter 1 Motivation and Introduction Slide 2: Optimizing Compilers are in the Center of the (Software) Universe Compilers are the translators between “human and machine”. This interface is one of the more challenging and important issues in all of computer science. Through programming languages, software engineers tell machines what do to. Today’s programming languages are still rather cryptic, borrowing a few words from the English language vocabulary. Perhaps, tomorrow’s languages will be more like natural languages. They may offer higher-level ex- pressions, allowing problems to be specified, rather than coding the detailed problem solution algorithms. Translating these languages onto modern architectures with their low-level ma- chine code languages is already a challenge. As every generation of computer architectures tends to grow in complexity, translating future programming languages onto future machines will be an even grander challenge. In performing such translation, optimization is important. Basic translation is usually inefficient. Performance almost always matters. The are other optimization criteria, as well. Energy efficiency is of increasing importance and is critical for devices that are battery operated. Small code and memory size can be of great value for embedded system. In this course, we will focus mainly on performance. Optimizations for parallel machines are the focus of this course. Today’s CPUs contain multiple cores. They are one of the most important classes of current parallel machines. This was not always so.
    [Show full text]
  • Manual Micro-Optimizations in C++ an Investigation of Four Micro-Optimizations and Their Usefulness
    Technology and Society Computer Science and Media Technology Bachelor thesis 15 credits, first cycle Manual micro-optimizations in C++ An investigation of four micro-optimizations and their usefulness Manuella mikrooptimeringar i C++ En undersökning av fyra mikrooptimeringar och deras nytta Viktor Ekström Degree: Bachelor 180hp Supervisor: Olle Lindeberg Main field: Computer Science Examiner: Farid Naisan Programme: Application Developer Final seminar: 4/6 2019 Abstract Optimization is essential for utilizing the full potential of the computer. There are several different approaches to optimization, including so-called micro-optimizations. Micro-optimizations are local adjustments that do not change an algorithm. This study investigates four micro-optimizations: loop interchange, loop unrolling, cache loop end value, and iterator incrementation, to see when they provide performance benefit in C++. This is investigated through an experiment, where the running time of test cases with and without micro-optimization is measured and then compared between cases. Measurements are made on two compilers. Results show several circumstances where micro-optimizations provide benefit. However, value can vary greatly depending on the compiler even when using the same code. A micro-optimization providing benefit with one compiler may be detrimental to performance with another compiler. This shows that understanding the compiler, and measuring performance, remains important. 1 2 Sammanfattning Optimering är oumbärligt för att utnyttja datorns fulla potential. Det finns flera olika infallsvinklar till optimering, däribland så kallade mikrooptimeringar. Mikrooptimeringar är lokala ändringar som inte ändrar på någon algoritm. Denna studie undersöker fyra mikrooptimeringar: loop interchange, loop unrolling, cache loop end value, och iterator-inkrementering, för att se när de bidrar med prestandaförbättring i C++.
    [Show full text]
  • Cray Fortran 90 Optimization
    Cray Fortran 90 Optimization Sylvia Crain, Cray Research, A Silicon Graphics Company, 655F Lone Oak Drive, Eagan, Minnesota 55121 ABSTRACT: This paper provides a discussion of the optimization technology used in the Cray CF90 compiler. It also outlines some specific optimizations important to the compiler and provides numbers on Cray CF90 performance relative to the Cray CF77 compiler. Lastly, it suggests strategies for transitioning from Fortran 77 to Fortran 90. 1 The Optimizer Technology these intrinsics allows for other optimizations to take place on the loop bodies and also eliminates the overhead of the subrou- The optimizer used by Cray’s Fortran 90 compiler is called tine calls. Even with no other optimization, removal of the PDGCS which stands for Program Dependence Graph subroutine call overhead can provide a good win. Compiling System. This optimizer provides state of the art tech- nology to support the Fortran 90 language. The premise of the Loop fusion is an important optimization to aid array syntax. optimizer is hierarchical memory optimization which structures After an array intrinsics is expand into a series of loops, the optimizations to reduce memory traffic, take advantage of loops can then be fused or combined into a single larger loop cache, and reduce the number of loads and stores required where other potential optimizations can then be performed. within a program. The program dependence graph (PDG) is at Loop Interchange is swap inner and outer loops. This allows the heart of the optimizer and is integral to the design. Modifi- for moving dependencies out of the loop thus providing greater cations to the PDG build on top of one another.
    [Show full text]
  • Automated Test Program Generation for an Industrial Optimizing Compiler
    Automated Test Program Generation for an Industrial Optimizing Compiler Chen Zhao, Yunzhi Xue, Qiuming Tao, Liang Guo, Zhaohui Wang Institute of Software, Chinese Academy of Sciences June 2009 The 2nd ISCAS/USC/UMass Joint Workshop on Trustworthy Software Systems Beijing, China Outline • Motivation • Brief Introduction to Compiler Optimizations • JTT: Automated Tool for Test Program Generation • Formal Model-based Generation: From Formula to Program • Evaluation in Testing UniPhier Compiler • Contributions 1 Motivation UniPhier: A popular, high- A collection of compiler performance, embedded chip optimizations from Panasonic All optimizations must be TRUSTABLE! 2 Compiler Optimizations What is compiler optimization? “Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attribute of an executable computer program.” ——Wikipedia Well-Known Optimizations: Dead Code Elimination, Common Subexpression Elimination, Loop Interchange, Loop Unrolling… tmp = b * c; a = b * c + g; a = tmp + g; d = b * c * d; d = tmp * d; 3 Compiler Optimizations (Cont’d) • General Form I→I' if {Side Conditions} •Features – Semantic-preserving – Many possible prerequisites – Dependent to data types 4 How to Test An Optimization • Basic approach Find a program to active the optimization • Requirements for test programs – Definitely active specific optimization – Cover as many prerequisites as possible – Use various data types – Interleave optimizations and data types 5 Extra Difficulties in Testing UniPhier Compiler Architecture-dependent
    [Show full text]