Code Generation and Local Optimization

Total Page:16

File Type:pdf, Size:1020Kb

Code Generation and Local Optimization Generating assembly Code generation and • How do we convert from three-address code to assembly? • Seems easy! But easy solutions may not be the best local optimization option • What we will cover: • Instruction selection • Peephole optimizations • “Local” common subexpression elimination • “Local” register allocation Wednesday, October 22, 14 Wednesday, October 22, 14 Naïve approach Why is this bad? (I) • “Macro-expansion” LD A, R1 Treat each 3AC instruction separately, generate code in • MOV 4, R2 isolation MUL A, 4, B MUL R1, R2, R3 LD A, R1 ST R3, B LD B, R2 ADD A, B, C ADD R1, R2, R3 ST R3, C LD A, R1 MOV 4, R2 MUL A, 4, B MUL R1, R2, R3 ST R3, B Wednesday, October 22, 14 Wednesday, October 22, 14 Why is this bad? (I) Why is this bad? (I) LD A, R1 LD A, R1 MOV 4, R2 MOV 4, R2 MUL A, 4, B MUL A, 4, B MUL R1, R2, R3 MUL R1, R2, R3 ST R3, B ST R3, B LD A, R1 MUL A, 4, B MULI R1, 4, R3 ST R3, B Too many instructions Too many instructions Should use a different instruction type Should use a different instruction type Wednesday, October 22, 14 Wednesday, October 22, 14 Why is this bad? (II) Why is this bad? (II) LD A, R1 LD A, R1 LD B, R2 LD B, R2 ADD A, B, C ADD A, B, C ADD R1, R2, R3 ADD R1, R2, R3 ST R3, C ST R3, C LD A, R1 LD A, R1 LD B, R2 LD B, R2 ADD R1, R2, R3 ADD R1, R2, R3 ADD A, B, C ST R3, C ADD A, B, C ST R3, C ADD C, A, E LD C, R4 ADD C, A, E LD C, R4 LD A, R5 LD A, R5 ADD R4, R5, R6 Redundant load of C ADD R4, R5, R6 ST R6, E Redundant load of A ST R6, E Uses a lot of registers Wednesday, October 22, 14 Wednesday, October 22, 14 Why is this bad? (II) Why is this bad? (III) LD A, R1 LD A, R1 LD B, R2 LD B, R2 ADD A, B, C ADD A, B, C ADD R1, R2, R3 ADD R1, R2, R3 ST R3, C ST R3, C LD A, R1 LD A, R1 LD B, R2 LD B, R2 ADD R1, R2, R3 ADD R1, R2, R3 ADD A, B, C ST R3, C ADD A, B, C ST R3, C ADD C, A, E LD C, R4 ADD A, B, D LD A, R4 LD A, R5 LD B, R5 Redundant load of C ADD R4, R5, R6 ADD R4, R5, R6 Redundant load of A ST R6, E ST R6, D Uses a lot of registers Wednesday, October 22, 14 Wednesday, October 22, 14 Why is this bad? (III) How do we address this? LD A, R1 LD B, R2 ADD A, B, C ADD R1, R2, R3 • Several techniques to improve performance of generated ST R3, C code Instruction selection to choose better instructions LD A, R1 • LD B, R2 • Peephole optimizations to remove redundant instructions ADD R1, R2, R3 ADD A, B, C ST R3, C • Common subexpression elimination to remove redundant ADD A, B, D LD A, R4 computation LD B, R5 Register allocation to reduce number of registers used ADD R4, R5, R6 • Wasting instructions recomputing A + B ST R6, D Wednesday, October 22, 14 Wednesday, October 22, 14 Instruction selection More choices for instructions • Even a simple instruction may have a large set of possible • Auto increment/decrement (especially common in address modes and combinations embedded processors as in DSPs) + A B C • e.g., load from this address and increment it • Can be indirect, register, memory • Why is this useful? address, indexed, etc. • Three-address instructions • Can be literal, register, memory • Specialized registers (condition registers, floating point address, indexed, etc. registers, etc.) • Can be literal, register, memory • “Free” addition in indexed mode address, indexed, etc. MOV (R1)offset R2 • Dozens of potential combinations! • Why is this useful? Wednesday, October 22, 14 Wednesday, October 22, 14 Peephole optimizations Peephole optimizations • Simple optimizations that can be performed by pattern • Constant folding matching ADD lit1, lit2, Rx MOV lit1 + lit2, Rx MOV lit1, Rx Intuitively, look through a “peephole” at a small segment MOV lit1 + lit2, Ry • ADD li2, Rx, Ry of code and replace it with something better • Strength reduction • Example: if code generator sees ST R X; LD X R, eliminate load MUL operand, 2, Rx SHIFTL operand, 1, Rx • Can recognize sequences of instructions that can be DIV operand, 4, Rx SHIFTR operand, 2, Rx performed by single instructions • Null sequences LDI R1 R2; ADD R1 4 R1 replaced by MUL operand, 1, Rx MOV operand, Rx LDINC R1 R2 4 //load from address in R1 then inc by 4 ADD operand, 0, Rx MOV operand, Rx Wednesday, October 22, 14 Wednesday, October 22, 14 Peephole optimizations Superoptimization • Combine operations JEQ L1 Peephole optimization/instruction selection writ large JMP L2 JNE L2 • L1: ... • Given a sequence of instructions, find a different sequence of instructions that performs the same computation in less Simplifying • time SUB operand, 0, Rx NEG Rx • Huge body of research, pulling in ideas from all across • Special cases (taking advantage of ++/--) computer science ADD 1, Rx, Rx INC Rx Theorem proving SUB Rx, 1, Rx DEC Rx • • Address mode operations • Machine learning MOV A R1 ADD @A R2 R3 ADD 0(R1) R2 R3 Wednesday, October 22, 14 Wednesday, October 22, 14 Common subexpression Common subexpression elimination elimination • Goal: remove redundant computation, don’t calculate the same expression multiple times • Two varieties of common subexpression elimination (CSE) Local: within a single basic block 1: A = B * C Keep the result of statement 1 in a • 2: E = B * C temporary and reuse for statement 2 • Easier problem to solve (why?) • Difficulty: how do we know when the same expression will • Global: within a single procedure or across the whole produce the same result? program Intra- vs. inter-procedural 1: A = B * C B is “killed.” Any expression using B is • 2: B = <new value> no longer “available,” so we cannot • More powerful, but harder (why?) reuse the result of statement 1 for 3: E = B * C statement 3 • Will come back to these sorts of “global” optimizations later • This becomes harder with pointers (how do we know when B is killed?) Wednesday, October 22, 14 Wednesday, October 22, 14 CSE in practice Maintaining available expressions For each 3AC operation in a basic block • Idea: keep track of which expressions are “available” during • the execution of a basic block • Create name for expression (based on lexical representation) • Which expressions have we already computed? If name not in available expression set, generate code, Issue: determining when an expression is no longer • • add it to set available Track register that holds result of and any variables This happens when one of its components is • • used to compute expression assigned to, or “killed.” If name in available expression set, generate move Idea: when we see an expression that is already available, • • instruction rather than generating code, copy the temporary If operation assigns to a variable, kill all dependent Issue: determining when two expressions are the same • • expressions Wednesday, October 22, 14 Wednesday, October 22, 14 Example Example Three address code Generated code Three address code Generated code + A B T1 + A B T1 ADD A B R1 + T1 C T2 + T1 C T2 + A B T3 + A B T3 + T1 T2 C + T1 T2 C + T1 C T4 + T1 C T4 + T3 T2 D + T3 T2 D Available expressions: Available expressions: “A+B” Wednesday, October 22, 14 Wednesday, October 22, 14 Example Example Three address code Generated code Three address code Generated code + A B T1 ADD A B R1 + A B T1 ADD A B R1 + T1 C T2 ADD R1 C R2 + T1 C T2 ADD R1 C R2 + A B T3 + A B T3 MOV R1 R3 + T1 T2 C + T1 T2 C + T1 C T4 + T1 C T4 + T3 T2 D + T3 T2 D Available expressions: “A+B” “T1+C” Available expressions: “A+B” “T1+C” Wednesday, October 22, 14 Wednesday, October 22, 14 Example Example Three address code Generated code Three address code Generated code + A B T1 ADD A B R1 + A B T1 ADD A B R1 + T1 C T2 ADD R1 C R2 + T1 C T2 ADD R1 C R2 + A B T3 MOV R1 R3 + A B T3 MOV R1 R3 + T1 T2 C ADD R1 R2 R5; ST R5 C + T1 T2 C ADD R1 R2 R5; ST R5 C + T1 C T4 + T1 C T4 ADD R1 C R4 + T3 T2 D + T3 T2 D Available expressions: “A+B” “T1+C” “T1+T2” Available expressions: “A+B” “T1+T2” “T1+C” Wednesday, October 22, 14 Wednesday, October 22, 14 Example Downsides • What are some downsides to this approach? Consider the two highlighted operations Three address code Generated code Three address code Generated code + A B T1 ADD A B R1 + A B T1 ADD A B R1 + T1 C T2 ADD R1 C R2 + T1 C T2 ADD R1 C R2 + A B T3 MOV R1 R3 + A B T3 MOV R1 R3 + T1 T2 C ADD R1 R2 R5; ST R5 C + T1 T2 C ADD R1 R2 R5; ST R5 C + T1 C T4 ADD R1 C R4 + T1 C T4 ADD R1 C R4 + T3 T2 D ADD R3 R2 R6; ST R6 D + T3 T2 D ADD R3 R2 R6; ST R6 D Available expressions: “A+B” “T1+T2” “T1+C” “T3+T2” Wednesday, October 22, 14 Wednesday, October 22, 14 Downsides Aliasing • What are some downsides to this approach? Consider the • One of the biggest problems in compiler analysis is to two highlighted operations recognize aliases – different names for the same location in memory Three address code Generated code Aliases can occur for many reasons + A B T1 ADD A B R1 • + T1 C T2 ADD R1 C R2 • Pointers referring to same location, arrays referencing the + A B T3 MOV R1 R3 same element, function calls passing the same reference + T1 T2 C ADD R1 R2 R5; ST R5 C in two arguments, explicit storage overlapping (unions) + T1 C T4 ADD R1 C R4 + T3 T2 D ST R5 D • Upshot: when talking about “live” and “killed” values in optimizations like CSE, we’re talking about particular • This can be handled by an optimization called value variable names numbering, which we will not cover now (although we may • In the presence of aliasing, we may not know which variables get to it later) get killed when a location is written to Wednesday, October 22, 14 Wednesday, October 22, 14 Memory disambiguation Register allocation • Simple code generation: use a register for each temporary, load from a variable on each read, store
Recommended publications
  • Resourceable, Retargetable, Modular Instruction Selection Using a Machine-Independent, Type-Based Tiling of Low-Level Intermediate Code
    Reprinted from Proceedings of the 2011 ACM Symposium on Principles of Programming Languages (POPL’11) Resourceable, Retargetable, Modular Instruction Selection Using a Machine-Independent, Type-Based Tiling of Low-Level Intermediate Code Norman Ramsey Joao˜ Dias Department of Computer Science, Tufts University Department of Computer Science, Tufts University [email protected] [email protected] Abstract Our compiler infrastructure is built on C--, an abstraction that encapsulates an optimizing code generator so it can be reused We present a novel variation on the standard technique of selecting with multiple source languages and multiple target machines (Pey- instructions by tiling an intermediate-code tree. Typical compilers ton Jones, Ramsey, and Reig 1999; Ramsey and Peyton Jones use a different set of tiles for every target machine. By analyzing a 2000). C-- accommodates multiple source languages by providing formal model of machine-level computation, we have developed a two main interfaces: the C-- language is a machine-independent, single set of tiles that is machine-independent while retaining the language-independent target language for front ends; the C-- run- expressive power of machine code. Using this tileset, we reduce the time interface is an API which gives the run-time system access to number of tilers required from one per machine to one per archi- the states of suspended computations. tectural family (e.g., register architecture or stack architecture). Be- cause the tiler is the part of the instruction selector that is most dif- C-- is not a universal intermediate language (Conway 1958) or ficult to reason about, our technique makes it possible to retarget an a “write-once, run-anywhere” intermediate language encapsulating instruction selector with significantly less effort than standard tech- a rigidly defined compiler and run-time system (Lindholm and niques.
    [Show full text]
  • Precise Null Pointer Analysis Through Global Value Numbering
    Precise Null Pointer Analysis Through Global Value Numbering Ankush Das1 and Akash Lal2 1 Carnegie Mellon University, Pittsburgh, PA, USA 2 Microsoft Research, Bangalore, India Abstract. Precise analysis of pointer information plays an important role in many static analysis tools. The precision, however, must be bal- anced against the scalability of the analysis. This paper focusses on improving the precision of standard context and flow insensitive alias analysis algorithms at a low scalability cost. In particular, we present a semantics-preserving program transformation that drastically improves the precision of existing analyses when deciding if a pointer can alias Null. Our program transformation is based on Global Value Number- ing, a scheme inspired from compiler optimization literature. It allows even a flow-insensitive analysis to make use of branch conditions such as checking if a pointer is Null and gain precision. We perform experiments on real-world code and show that the transformation improves precision (in terms of the number of dereferences proved safe) from 86.56% to 98.05%, while incurring a small overhead in the running time. Keywords: Alias Analysis, Global Value Numbering, Static Single As- signment, Null Pointer Analysis 1 Introduction Detecting and eliminating null-pointer exceptions is an important step towards developing reliable systems. Static analysis tools that look for null-pointer ex- ceptions typically employ techniques based on alias analysis to detect possible aliasing between pointers. Two pointer-valued variables are said to alias if they hold the same memory location during runtime. Statically, aliasing can be de- cided in two ways: (a) may-alias [1], where two pointers are said to may-alias if they can point to the same memory location under some possible execution, and (b) must-alias [27], where two pointers are said to must-alias if they always point to the same memory location under all possible executions.
    [Show full text]
  • A Formally-Verified Alias Analysis
    A Formally-Verified Alias Analysis Valentin Robert1;2 and Xavier Leroy1 1 INRIA Paris-Rocquencourt 2 University of California, San Diego [email protected], [email protected] Abstract. This paper reports on the formalization and proof of sound- ness, using the Coq proof assistant, of an alias analysis: a static analysis that approximates the flow of pointer values. The alias analysis con- sidered is of the points-to kind and is intraprocedural, flow-sensitive, field-sensitive, and untyped. Its soundness proof follows the general style of abstract interpretation. The analysis is designed to fit in the Comp- Cert C verified compiler, supporting future aggressive optimizations over memory accesses. 1 Introduction Alias analysis. Most imperative programming languages feature pointers, or object references, as first-class values. With pointers and object references comes the possibility of aliasing: two syntactically-distinct program variables, or two semantically-distinct object fields can contain identical pointers referencing the same shared piece of data. The possibility of aliasing increases the expressiveness of the language, en- abling programmers to implement mutable data structures with sharing; how- ever, it also complicates tremendously formal reasoning about programs, as well as optimizing compilation. In this paper, we focus on optimizing compilation in the presence of pointers and aliasing. Consider, for example, the following C program fragment: ... *p = 1; *q = 2; x = *p + 3; ... Performance would be increased if the compiler propagates the constant 1 stored in p to its use in *p + 3, obtaining ... *p = 1; *q = 2; x = 4; ... This optimization, however, is unsound if p and q can alias.
    [Show full text]
  • Aliases, Intro. to Optimization
    Aliasing Two variables are aliased if they can refer to the same storage location. Possible Sources:pointers, parameter passing, storage overlap... Ex. address of a passed x and y are aliased! Pointer Analysis, Alias Analysis to get less conservative info Needed for correct, aggressive optimization Procedures: terminology a, e global b,c formal arguments d local call site with actual arguments At procedure call, formals bound to actuals, may be aliased Ex. (b,a) , (c, d) Globals, actuals may be modified, used Ex. a, b Call Graphs Determines possible flow of control, interprocedurally G = (N, LE, s) N set of nodes LE set of labelled edges n m s start node Qu: Why need call site labels? Why list? Example Call Graph 1,2 3 4 5 6 7 Interprocedural Dataflow Analysis Based on call graph: forward, backward Gen, Kill: Need to summarize procedures per call Flow sensitive: take procedure's control flow into account Flow insensitive: ignore procedure's control flow Difficulties: Hard, complex Flow sensitive alias analysis intractable Separate compilation? Scale compiler can do both flow sensitive and insensitive Most compilers ultraconservative, or flow insensitive Scalar Replacement of Aggregates Use scalar temporary instead of aggregate variable Compiler may limit optimization to such scalars Can do better register allocation, constant propagation,... Particulary useful when small number of constant values Can use constant propagation, dead code elimination to specialize code Value Numbering of Basic Blocks Eliminates computations whose values are already computed in BB value needn't be constant Method: Value Number Hash Table Global Copy Propagation Given A:=B, replace later uses of A by B, as long as A,B not redefined (with dead code elim) Global Copy Propagation Ex.
    [Show full text]
  • CS415 Compilers Register Allocation and Introduction to Instruction Scheduling
    CS415 Compilers Register Allocation and Introduction to Instruction Scheduling These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Announcement • First homework out → Due tomorrow 11:59pm EST. → Sakai submission window open. → pdf format only. • Late policy → 20% penalty for first 24 hours late. → Not accepted 24 hours after deadline. If you have personal overriding reasons that prevents you from submitting homework online, please let me know in advance. cs415, spring 16 2 Lecture 3 Review: ILOC Register allocation on basic blocks in “ILOC” (local register allocation) • Pseudo-code for a simple, abstracted RISC machine → generated by the instruction selection process • Simple, compact data structures • Here: we only use a small subset of ILOC → ILOC simulator at ~zhang/cs415_2016/ILOC_Simulator on ilab cluster Naïve Representation: Quadruples: loadI 2 r1 • table of k x 4 small integers loadAI r0 @y r2 • simple record structure add r1 r2 r3 • easy to reorder loadAI r0 @x r4 • all names are explicit sub r4 r3 r5 ILOC is described in Appendix A of EAC cs415, spring 16 3 Lecture 3 Review: F – Set of Feasible Registers Allocator may need to reserve registers to ensure feasibility • Must be able to compute addresses Notation: • Requires some minimal set of registers, F k is the number of → F depends on target architecture registers on the • F contains registers to make spilling work target machine (set F registers “aside”, i.e., not available for register assignment) cs415, spring 16 4 Lecture 3 Live Range of a Value (Virtual Register) A value is live between its definition and its uses • Find definitions (x ← …) and uses (y ← … x ...) • From definition to last use is its live range → How does a second definition affect this? • Can represent live range as an interval [i,j] (in block) → live on exit Let MAXLIVE be the maximum, over each instruction i in the block, of the number of values (pseudo-registers) live at i.
    [Show full text]
  • MAO - an Extensible Micro-Architectural Optimizer
    MAO - an Extensible Micro-Architectural Optimizer Robert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani Google 1600 Amphitheatre Parkway Mountain View, CA, 94043 frhundt, eraman, [email protected], [email protected] .L3 movsbl 1(%rdi,%r8,4),%edx Abstract—Performance matters, and so does repeatability and movsbl (%rdi,%r8,4),%eax predictability. Today’s processors’ micro-architectures have be- # ... 6 instructions come so complex as to now contain many undocumented, not movl %edx, (%rsi,%r8,4) understood, and even puzzling performance cliffs. Small changes addq $1, %r8 in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the nop # this instruction speeds up effect of exposing compiler and performance optimization efforts # the loop by 5% to perceived unwanted randomness. .L5: movsbl 1(%rdi,%r8,4),%edx This paper presents MAO, an extensible micro-architectural movsbl (%rdi,%r8,4),%eax assembly to assembly optimizer, which seeks to address this # ... identical code sequence problem for x86/64 processors. In essence, MAO is a thin wrapper movl %edx, (%rsi,%r8,4) around a common open source assembler infrastructure. It offers addq $1, %r8 basic operations, such as creation or modification of instructions, cmpl %r8d, %r9d simple data-flow analysis, and advanced infra-structure, such jg .L3 as loop recognition, and a repeated relaxation algorithm to compute instruction addresses and lengths. This infrastructure Fig. 1. Code snippet with high impact NOP instruction enables a plethora of passes for pattern matching, alignment specific optimizations, peep-holes, experiments (such as random insertion of NOPs), and fast prototyping of more sophisticated optimizations.
    [Show full text]
  • Automatic Parallelization of C by Means of Language Transcription
    Automatic Parallelization of C by Means of Language Transcription Richard L. Kennell Rudolf Eigenmann Purdue University, School of Electrical and Computer Engineering Abstract. The automatic parallelization of C has always been frustrated by pointer arithmetic, irregular control flow and complicated data aggregation. Each of these problems is similar to familiar challenges encountered in the parallelization of more rigidly-structured languages such as FORTRAN. By creating a mapping from one language to the other, we can expose the capabil- ities of existing automatically parallelizing compilers to the C language. In this paper, we describe our approach to mapping applications written in C to a form suitable for the Polaris source-to- source FORTRAN compiler. We also describe the improvements in the compiled applications realized by this second level of transformation and show results for a small application in compar- ison to commercial compilers. 1.0 Introduction Polaris is a automatically parallelizing source-to-source FORTRAN compiler. It accepts FORTRAN77 input and produces a FORTRAN output in a new dialect that supports explicit par- allelism by means of embedded directives such as the OpenMP [Ope97] or Sun FORTRAN Directives [Sun96]. The benefit that Polaris provides is in automating the analysis of the loops and array accesses in the application to determine how they can best be expressed to exploit available parallelism. Since FORTRAN naturally constrains the way in which parallelism exists, the analy- sis is somewhat more straightforward than with other languages. This allows Polaris to perform very complicated interprocedural and global analysis without risk of misinterpretation of pro- grammer intent. Experimental results show that Polaris is able to markedly improve the run-time of applications without additional programmer direction [PVE96, BDE+96].
    [Show full text]
  • Lecture Notes on Instruction Selection
    Lecture Notes on Instruction Selection 15-411: Compiler Design Frank Pfenning Lecture 2 August 28, 2014 1 Introduction In this lecture we discuss the process of instruction selection, which typcially turns some form of intermediate code into a pseudo-assembly language in which we assume to have infinitely many registers called “temps”. We next apply register allocation to the result to assign machine registers and stack slots to the temps be- fore emitting the actual assembly code. Additional material regarding instruction selection can be found in the textbook [App98, Chapter 9]. 2 A Simple Source Language We use a very simple source language where a program is just a sequence of assign- ments terminated by a return statement. The right-hand side of each assignment is a simple arithmetic expression. Later in the course we describe how the input text is parsed and translated into some intermediate form. Here we assume we have arrived at an intermediate representation where expressions are still in the form of trees and we have to generate instructions in pseudo-assembly. We call this form IR Trees (for “Intermediate Representation Trees”). We describe the possible IR trees in a kind of pseudo-grammar, which should not be read as a description of the concrete syntax, but the recursive structure of the data. LECTURE NOTES AUGUST 28, 2014 Instruction Selection L2.2 Programs ~s ::= s1; : : : ; sn sequence of statements Statements s ::= x = e assignment j return e return, always last Expressions e ::= c integer constant j x variable j e1 ⊕ e2 binary operation Binops ⊕ ::= + j − j ∗ j = j ::: 3 Abstract Assembly Target Code For our very simple source, we use an equally simple target.
    [Show full text]
  • Automatic Loop Parallelization Via Compiler Guided Refactoring
    Automatic Loop Parallelization via Compiler Guided Refactoring Per Larsen∗, Razya Ladelskyz, Jacob Lidmany, Sally A. McKeey, Sven Karlsson∗ and Ayal Zaksz ∗DTU Informatics Technical U. Denmark, 2800 Kgs. Lyngby, Denmark Email: fpl,[email protected] y Computer Science Engineering Chalmers U. Technology, 412 96 Gothenburg, Sweden Email: [email protected], [email protected] zIBM Haifa Research Labs Mount Carmel, Haifa, 31905, Israel Email: frazya,[email protected] Abstract—For many parallel applications, performance relies with hand-parallelized and optimized versions using an octo- not on instruction-level parallelism, but on loop-level parallelism. core Intel Xeon 5570 system and a quad-core IBM POWER6 Unfortunately, many modern applications are written in ways SCM system. that obstruct automatic loop parallelization. Since we cannot identify sufficient parallelization opportunities for these codes in Our contributions are as follows. a static, off-line compiler, we developed an interactive compilation • First, we present our interactive compilation system. feedback system that guides the programmer in iteratively • Second, we perform an extensive performance evaluation. modifying application source, thereby improving the compiler’s We use two benchmark kernels, two parallel architectures ability to generate loop-parallel code. We use this compilation system to modify two sequential benchmarks, finding that the and also study the behavior of the benchmarks. code parallelized in this way runs up to 8.3 times faster on an After modification with our compilation system, we find octo-core Intel Xeon 5570 system and up to 12.5 times faster on that the two benchmarks runs up to 6.0 and 8.3 times faster a quad-core IBM POWER6 system.
    [Show full text]
  • Synthesizing an Instruction Selection Rule Library from Semantic Specifications
    Synthesizing an Instruction Selection Rule Library from Semantic Specifications Sebastian Buchwald Andreas Fried Sebastian Hack Karlsruhe Institute of Technology Karlsruhe Institute of Technology Saarland University Germany Germany Germany [email protected] [email protected] [email protected] Abstract 1 Introduction Instruction selection is the part of a compiler that transforms Modern instruction set architectures (ISAs), even those of intermediate representation (IR) code into machine code. RISC processors, are complex and comprise several hundred Instruction selectors build on a library of hundreds if not instructions. In recent years, ISAs have been more frequently thousands of rules. Creating and maintaining these rules is extended to accelerate computations of various domains (e.g., a tedious and error-prone manual process. signal processing, graphics, string processing, etc.). In this paper, we present a fully automatic approach to Instruction selectors typically use a library of rules to create provably correct rule libraries from formal specifica- transform the program: Each rule associates a pattern of IR tions of the instruction set architecture and the compiler operations to a semantically equivalent, small program of IR. We use a hybrid approach that combines enumerative machine instructions. First, the selector matches the pattern techniques with template-based counterexample-guided in- of each rule in the library to the IR of the program to be ductive synthesis (CEGIS). Thereby, we overcome several compiled. Then, the selector computes a pattern cover of the shortcomings of existing approaches, which were not able program and rewrites it according to the rules associated to handle complex instructions in a reasonable amount of with the patterns.
    [Show full text]
  • A General Compiler Framework for Speculative Optimizations Using Data Speculative Code Motion
    A General Compiler Framework for Speculative Optimizations Using Data Speculative Code Motion Xiaoru Dai, Antonia Zhai, Wei-Chung Hsu, Pen-Chung Yew Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455 {dai, zhai, hsu, yew}@cs.umn.edu Abstract obtaining precise data dependence analysis is both difficult and expensive for languages such as C in which Data speculative optimization refers to code dynamic and pointer-based data structures are frequently transformations that allow load and store instructions to used. When the data dependence analysis is unable to be moved across potentially dependent memory show that there is definitely no data dependence between operations. Existing research work on data speculative two memory references, the compiler must assume that optimizations has mainly focused on individual code there is a data dependence between them. It is quite often transformation. The required speculative analysis that that such an assumption is overly conservative. The identifies data speculative optimization opportunities and examples in Figure 1 illustrate how such conservative data the required recovery code generation that guarantees the dependences may affect compiler optimizations. correctness of their execution are handled separately for each optimization. This paper proposes a new compiler S1: = *q while ( p ){ while( p ){ framework to facilitate the design and implementation of S2: *p= b 1 S1: if (p->f==0) S1: if (p->f == 0) general data speculative optimizations such as dead store
    [Show full text]
  • Machine Independent Code Optimizations
    Machine Independent Code Optimizations Useless Code and Redundant Expression Elimination cs5363 1 Code Optimization Source IR IR Target Front end optimizer Back end program (Mid end) program compiler The goal of code optimization is to Discover program run-time behavior at compile time Use the information to improve generated code Speed up runtime execution of compiled code Reduce the size of compiled code Correctness (safety) Optimizations must preserve the meaning of the input code Profitability Optimizations must improve code quality cs5363 2 Applying Optimizations Most optimizations are separated into two phases Program analysis: discover opportunity and prove safety Program transformation: rewrite code to improve quality The input code may benefit from many optimizations Every optimization acts as a filtering pass that translate one IR into another IR for further optimization Compilers Select a set of optimizations to implement Decide orders of applying implemented optimizations The safety of optimizations depends on results of program analysis Optimizations often interact with each other and need to be combined in specific ways Some optimizations may need to applied multiple times . E.g., dead code elimination, redundancy elimination, copy folding Implement predetermined passes of optimizations cs5363 3 Scalar Compiler Optimizations Machine independent optimizations Enable other transformations Procedure inlining, cloning, loop unrolling Eliminate redundancy Redundant expression elimination Eliminate useless
    [Show full text]