Survey on Instruction Selection an Extensive and Modern Literature Review

Total Page:16

File Type:pdf, Size:1020Kb

Survey on Instruction Selection an Extensive and Modern Literature Review Survey on Instruction Selection An Extensive and Modern Literature Review gabriel s. hjort blindell October 4, 2013 School of Information and Communication Technology KTH Royal Institute of Technology Stockholm, Sweden Author: Hjort Blindell, Gabriel S. <[email protected]> Title: Survey on Instruction Selection Subtitle: An Extensive and Modern Literature Review Identification data: TRITA-ICT/ECS R 13:17 ISRN KTH/ICT/ECS/R-13/17-SE ISSN 1653-6363 ISBN 978-91-7501-898-0 Doc. build version: 2013-10-04 19:32 Research Unit: Software and Computer Systems School of Information and Communication Technology KTH Royal Institute of Technology Forum 105 164 40, Kista Sweden Copyright c 2013 – All rights reserved. This work is licensed under the Creative Commons Attribution (CC BY 3.0) license. To access the full license, please visit http://creativecommons.org/licenses/by/3.0/legalcode. This document was typeset in LATEX, using Garamond, Linux Libertine, and Bera Mono as fonts. All figures were produced, either partly or entirely, with the TikZ package. ii Abstract* Instruction selection is one of three optimization problems – the other two are instruction scheduling and register allocation – involved in code generation. The task of the instruction selector is to transform an input program from its target-independent representation into a target-specific form by making best use of the available machine instructions. Hence instruction selection is a crucial component of generating code that is both correct and runs efficiently on a specific target machine. Despite on-going research since the late 1960s, the last comprehensive survey on this field was written more than 30 years ago. As many new approaches and techniques have appeared since its publication, there is a need for an up-to-date review of the current body of literature; this report addresses that need by presenting an extensive survey and categorization of both dated method and the state-of-the-art of instruction selection. The report thereby supersedes and extends the previous surveys, and attempts to identify where future research could be directed. *This research was funded by the Swedish Research Council (VR) (grant 621-2011-6229). I also wish to thank Roberto Castañeda Lozano, Mats Carlsson, and Karl Johansson for their helpful feedback, and SICS (The Swedish Institute of Computer Science) for letting me use their computer equipment to compile this report. iii Contents Abstract iii Contents iv List of Figures vii List of Abbreviations ix 1 Introduction 1 1.1 What is instruction selection? 2 1.1.1 Inside the compiler 2 1.1.2 The interface of the target machine 3 1.2 Problems of comparing different methods 4 1.2.1 Machine instruction characteristics 4 1.2.2 Optimal instruction selection 6 1.3 The prehistory of instruction selection 7 2 Macro Expansion 8 2.1 The principle 8 2.2 Naïve macro expansion 9 2.2.1 Early approaches 9 2.2.2 Generating the macros automatically from a machine description 11 2.2.3 More efficient compilation with tables 13 2.2.4 Falling out of fashion 14 2.3 Improving code quality with peephole optimization 15 2.3.1 Peephole optimization 15 2.3.2 Combining naïve macro expansion with peephole optimization 17 2.3.3 Running peephole optimization before instruction selection 20 2.4 Summary 21 3 Tree Covering 22 3.1 The principle 22 3.2 First techniques to use tree pattern matching 23 3.3 Covering trees bottom-up with LR parsing 25 3.3.1 The Glanville-Graham approach 26 3.3.2 Extending grammars with semantic handling 32 3.3.3 Maintaining multiple parse trees for better code quality 34 3.4 Covering trees top-down with recursion 34 3.4.1 First approaches 35 3.5 A note on tree rewriting and tree covering 37 3.6 Separating pattern matching and pattern selection 37 iv 3.6.1 Algorithms for linear-time tree pattern matching 38 3.6.2 Optimal pattern selection with dynamic programming 43 3.6.3 Faster pattern selection with offline cost analysis 47 3.6.4 Lazy state table generation 53 3.7 Other techniques for covering trees 53 3.7.1 Approaches based on formal frameworks 54 3.7.2 Tree automata and series transducers 55 3.7.3 Rewriting strategies 55 3.7.4 Genetic algorithms 55 3.7.5 Trellis diagrams 56 3.8 Summary 57 3.8.1 Restrictions that come with trees 58 4 DAG Covering 59 4.1 The principle 59 4.2 Optimal pattern selection on DAGs is NP-complete 59 4.2.1 The proof 60 4.3 Covering DAGs with tree patterns 62 4.3.1 Undagging the expression DAG into a tree 62 4.3.2 Extending dynamic programming to DAGs 64 4.3.3 Matching tree patterns directly on DAGs 65 4.3.4 Selecting patterns with integer programming 66 4.3.5 Modeling entire processors as DAGs 67 4.4 Covering DAGs with DAG patterns 68 4.4.1 Splitting DAG patterns into trees 68 4.5 Reducing pattern selection to a MIS problem 75 4.5.1 Approaches using MIS and MWIS 76 4.6 Other DAG-based techniques 76 4.6.1 Extending means-end analysis to DAGs 76 4.6.2 DAG-like trellis diagrams 77 4.7 Summary 77 5 Graph-Based Approaches 79 5.1 The principle 79 5.2 Pattern matching using subgraph isomorphism 80 5.2.1 Ullmann’s algorithm 80 5.2.2 The VF2 algorithm 81 5.3 Graph-based pattern selection techniques 81 5.3.1 Unate and binate covering 81 5.3.2 CP-based techniques 83 5.3.3 PBQP-based techniques 84 5.4 Approaches based on processor modeling 90 5.4.1 Chess 90 5.4.2 Simulated annealing 91 v 5.5 Summary 91 6 Conclusions 92 6.1 Is instruction selection a closed problem? 92 6.2 Where to go from here? 93 7 Bibliography 94 A List of Approaches 116 B Publication Diagram 119 C Graph Definitions 120 D Taxonomy 122 D.1 Common terms 122 D.2 Machine instruction characteristics 123 D.3 Scope 124 D.4 Principles 125 E Document Revisions 126 Index 127 vi List of Figures 1.1 Overview of a common compiler infrastructure 2 2.1 Language transfer example using Simcmp 9 2.2 IR expression represented as a tree 10 2.3 Binary addition macro in ICL 10 2.4 MIML example 11 2.5 Partial specification of IBM-360 in OMML 12 2.6 Overview of the Davidson-Fraser approach 17 2.7 An extension of the Davidson-Fraser approach 19 2.8 Machine instruction expressed in λ-RTL 19 3.1 Machine description sample for PCC 24 3.2 Straight-forward pattern matching with nm time complexity 25 3.3 Example of an instruction set grammar O( ) 29 3.4 Generated state table 29 3.5 Execution walk-through of the Glanville-Graham approach 30 3.6 An instruction set expressed as an affix grammar 33 3.7 String matching state machine 39 3.8 Hoffmann-O’Donnell’s bottom-up algorithm for labeling expressions tree 39 3.9 Tree pattern matching using Hoffmann-O’Donnell 41 3.10 Preprocessing algorithms of Hoffmann-O’Donnell 41 3.11 Compression of lookup tables 43 3.12 Rule samples from Twig 44 3.13 Computing the costs with dynamic programming 45 3.14 DP-driven instruction selection algorithm with optimal pattern selection 45 3.15 Table-driven algorithm for performing optimal pattern selection and code emission 48 3.16 BURS example 49 3.17 Incorporating costs into states 50 3.18 Trellis diagram example 57 4.1 Reducing SAT to DAG covering 61 4.2 Undagging an expression DAG with a common subexpression 63 4.3 Sharing nonterminals of tree covering on a DAG 64 4.4 CO graph of a simple processor 67 4.5 Splitting DAG pattern at output nodes into multiple tree patterns 74 4.6 Conflict graph example 75 5.1 Time complexities of pattern matching and optimal pattern selection 80 5.2 Unate covering example 82 5.3 Example code which would benefit from global instruction selection 86 5.4 Same as Figure 5.3 but in SSA form 86 vii B.1 Publication timeline diagram 119 C.1 Example of two simple, directed graphs 120 D.1 Basic block example 124 viii List of Abbreviations AI artificial intelligence APT abstract parse tree (same as AST) ASIP application-specific instruction-set processor AST abstract syntax tree BURS bottom-up rewriting system CDFG control and data flow graph CISC complex instruction set computing CNF conjunctive normal form CP constraint programming DAG directed acyclic graph DP dynamic programming DSP digital signal processor GA genetic algorithm IP integer programming IR intermediate representation ISA instruction set architecture ISE instruction set extension JIT just-in-time LL parsing left-to-right, left-most derivation parsing LR parsing left-to-right, right-most derivation parsing LALR look-ahead LR parsing MIS maximum independent set MWIS maximum weighted independent set PBQP partitioned Boolean quadratic problem PO peephole optimisation QAP quadratic assignment problem RTL register transfer list SAT problem satisfiability problem SIMD single-instruction, multiple-data SLR parsing simple LR parsing SSA static single assignment VLIW very long instruction word VLSI very-large-scale integration ix 1 Introduction Every undertaking of implementing any piece of software – be it a GUI-based Windows appli- cation written in C, a high-performance Fortran program for mathematical computations, or a smartphone app written in Java – necessitates a compiler in one form or another.
Recommended publications
  • Resourceable, Retargetable, Modular Instruction Selection Using a Machine-Independent, Type-Based Tiling of Low-Level Intermediate Code
    Reprinted from Proceedings of the 2011 ACM Symposium on Principles of Programming Languages (POPL’11) Resourceable, Retargetable, Modular Instruction Selection Using a Machine-Independent, Type-Based Tiling of Low-Level Intermediate Code Norman Ramsey Joao˜ Dias Department of Computer Science, Tufts University Department of Computer Science, Tufts University [email protected] [email protected] Abstract Our compiler infrastructure is built on C--, an abstraction that encapsulates an optimizing code generator so it can be reused We present a novel variation on the standard technique of selecting with multiple source languages and multiple target machines (Pey- instructions by tiling an intermediate-code tree. Typical compilers ton Jones, Ramsey, and Reig 1999; Ramsey and Peyton Jones use a different set of tiles for every target machine. By analyzing a 2000). C-- accommodates multiple source languages by providing formal model of machine-level computation, we have developed a two main interfaces: the C-- language is a machine-independent, single set of tiles that is machine-independent while retaining the language-independent target language for front ends; the C-- run- expressive power of machine code. Using this tileset, we reduce the time interface is an API which gives the run-time system access to number of tilers required from one per machine to one per archi- the states of suspended computations. tectural family (e.g., register architecture or stack architecture). Be- cause the tiler is the part of the instruction selector that is most dif- C-- is not a universal intermediate language (Conway 1958) or ficult to reason about, our technique makes it possible to retarget an a “write-once, run-anywhere” intermediate language encapsulating instruction selector with significantly less effort than standard tech- a rigidly defined compiler and run-time system (Lindholm and niques.
    [Show full text]
  • The LLVM Instruction Set and Compilation Strategy
    The LLVM Instruction Set and Compilation Strategy Chris Lattner Vikram Adve University of Illinois at Urbana-Champaign lattner,vadve ¡ @cs.uiuc.edu Abstract This document introduces the LLVM compiler infrastructure and instruction set, a simple approach that enables sophisticated code transformations at link time, runtime, and in the field. It is a pragmatic approach to compilation, interfering with programmers and tools as little as possible, while still retaining extensive high-level information from source-level compilers for later stages of an application’s lifetime. We describe the LLVM instruction set, the design of the LLVM system, and some of its key components. 1 Introduction Modern programming languages and software practices aim to support more reliable, flexible, and powerful software applications, increase programmer productivity, and provide higher level semantic information to the compiler. Un- fortunately, traditional approaches to compilation either fail to extract sufficient performance from the program (by not using interprocedural analysis or profile information) or interfere with the build process substantially (by requiring build scripts to be modified for either profiling or interprocedural optimization). Furthermore, they do not support optimization either at runtime or after an application has been installed at an end-user’s site, when the most relevant information about actual usage patterns would be available. The LLVM Compilation Strategy is designed to enable effective multi-stage optimization (at compile-time, link-time, runtime, and offline) and more effective profile-driven optimization, and to do so without changes to the traditional build process or programmer intervention. LLVM (Low Level Virtual Machine) is a compilation strategy that uses a low-level virtual instruction set with rich type information as a common code representation for all phases of compilation.
    [Show full text]
  • 9. Optimization
    9. Optimization Marcus Denker Optimization Roadmap > Introduction > Optimizations in the Back-end > The Optimizer > SSA Optimizations > Advanced Optimizations © Marcus Denker 2 Optimization Roadmap > Introduction > Optimizations in the Back-end > The Optimizer > SSA Optimizations > Advanced Optimizations © Marcus Denker 3 Optimization Optimization: The Idea > Transform the program to improve efficiency > Performance: faster execution > Size: smaller executable, smaller memory footprint Tradeoffs: 1) Performance vs. Size 2) Compilation speed and memory © Marcus Denker 4 Optimization No Magic Bullet! > There is no perfect optimizer > Example: optimize for simplicity Opt(P): Smallest Program Q: Program with no output, does not stop Opt(Q)? © Marcus Denker 5 Optimization No Magic Bullet! > There is no perfect optimizer > Example: optimize for simplicity Opt(P): Smallest Program Q: Program with no output, does not stop Opt(Q)? L1 goto L1 © Marcus Denker 6 Optimization No Magic Bullet! > There is no perfect optimizer > Example: optimize for simplicity Opt(P): Smallest ProgramQ: Program with no output, does not stop Opt(Q)? L1 goto L1 Halting problem! © Marcus Denker 7 Optimization Another way to look at it... > Rice (1953): For every compiler there is a modified compiler that generates shorter code. > Proof: Assume there is a compiler U that generates the shortest optimized program Opt(P) for all P. — Assume P to be a program that does not stop and has no output — Opt(P) will be L1 goto L1 — Halting problem. Thus: U does not exist. > There will
    [Show full text]
  • Value Numbering
    SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 0(0), 1–18 (MONTH 1900) Value Numbering PRESTON BRIGGS Tera Computer Company, 2815 Eastlake Avenue East, Seattle, WA 98102 AND KEITH D. COOPER L. TAYLOR SIMPSON Rice University, 6100 Main Street, Mail Stop 41, Houston, TX 77005 SUMMARY Value numbering is a compiler-based program analysis method that allows redundant computations to be removed. This paper compares hash-based approaches derived from the classic local algorithm1 with partitioning approaches based on the work of Alpern, Wegman, and Zadeck2. Historically, the hash-based algorithm has been applied to single basic blocks or extended basic blocks. We have improved the technique to operate over the routine’s dominator tree. The partitioning approach partitions the values in the routine into congruence classes and removes computations when one congruent value dominates another. We have extended this technique to remove computations that define a value in the set of available expressions (AVA IL )3. Also, we are able to apply a version of Morel and Renvoise’s partial redundancy elimination4 to remove even more redundancies. The paper presents a series of hash-based algorithms and a series of refinements to the partitioning technique. Within each series, it can be proved that each method discovers at least as many redundancies as its predecessors. Unfortunately, no such relationship exists between the hash-based and global techniques. On some programs, the hash-based techniques eliminate more redundancies than the partitioning techniques, while on others, partitioning wins. We experimentally compare the improvements made by these techniques when applied to real programs. These results will be useful for commercial compiler writers who wish to assess the potential impact of each technique before implementation.
    [Show full text]
  • Automatic Isolation of Compiler Errors
    Automatic Isolation of Compiler Errors DAVID B.WHALLEY Flor ida State University This paper describes a tool called vpoiso that was developed to automatically isolate errors in the vpo com- piler system. The twogeneral types of compiler errors isolated by this tool are optimization and nonopti- mization errors. When isolating optimization errors, vpoiso relies on the vpo optimizer to identify sequences of changes, referred to as transformations, that result in semantically equivalent code and to pro- vide the ability to stop performing improving (or unnecessary) transformations after a specified number have been performed. Acompilation of a typical program by vpo often results in thousands of improving transformations being performed. The vpoiso tool can automatically isolate the first improving transforma- tion that causes incorrect output of the execution of the compiled program by using a binary search that varies the number of improving transformations performed. Not only is the illegaltransformation automati- cally isolated, but vpoiso also identifies the location and instant the transformation is performed in vpo. Nonoptimization errors occur from problems in the front end, code generator,and necessary transforma- tions in the optimizer.Ifanother compiler is available that can produce correct (but perhaps more ineffi- cient) code, then vpoiso can isolate nonoptimization errors to a single function. Automatic isolation of compiler errors facilitates retargeting a compiler to a newmachine, maintenance of the compiler,and sup- porting experimentation with newoptimizations. General Terms: Compilers, Testing Additional Key Words and Phrases: Diagnosis procedures, nonoptimization errors, optimization errors 1. INTRODUCTION To increase portability compilers are often split into twoparts, a front end and a back end.
    [Show full text]
  • Language and Compiler Support for Dynamic Code Generation by Massimiliano A
    Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto S.B., Massachusetts Institute of Technology (1995) M.Eng., Massachusetts Institute of Technology (1995) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 1999 © Massachusetts Institute of Technology 1999. All rights reserved. A u th or ............................................................................ Department of Electrical Engineering and Computer Science June 23, 1999 Certified by...............,. ...*V .,., . .* N . .. .*. *.* . -. *.... M. Frans Kaashoek Associate Pro essor of Electrical Engineering and Computer Science Thesis Supervisor A ccepted by ................ ..... ............ ............................. Arthur C. Smith Chairman, Departmental CommitteA on Graduate Students me 2 Language and Compiler Support for Dynamic Code Generation by Massimiliano A. Poletto Submitted to the Department of Electrical Engineering and Computer Science on June 23, 1999, in partial fulfillment of the requirements for the degree of Doctor of Philosophy Abstract Dynamic code generation, also called run-time code generation or dynamic compilation, is the cre- ation of executable code for an application while that application is running. Dynamic compilation can significantly improve the performance of software by giving the compiler access to run-time infor- mation that is not available to a traditional static compiler. A well-designed programming interface to dynamic compilation can also simplify the creation of important classes of computer programs. Until recently, however, no system combined efficient dynamic generation of high-performance code with a powerful and portable language interface. This thesis describes a system that meets these requirements, and discusses several applications of dynamic compilation.
    [Show full text]
  • Transparent Dynamic Optimization: the Design and Implementation of Dynamo
    Transparent Dynamic Optimization: The Design and Implementation of Dynamo Vasanth Bala, Evelyn Duesterwald, Sanjeev Banerjia HP Laboratories Cambridge HPL-1999-78 June, 1999 E-mail: [email protected] dynamic Dynamic optimization refers to the runtime optimization optimization, of a native program binary. This report describes the compiler, design and implementation of Dynamo, a prototype trace selection, dynamic optimizer that is capable of optimizing a native binary translation program binary at runtime. Dynamo is a realistic implementation, not a simulation, that is written entirely in user-level software, and runs on a PA-RISC machine under the HPUX operating system. Dynamo does not depend on any special programming language, compiler, operating system or hardware support. Contrary to intuition, we demonstrate that it is possible to use a piece of software to improve the performance of a native, statically optimized program binary, while it is executing. Dynamo not only speeds up real application programs, its performance improvement is often quite significant. For example, the performance of many +O2 optimized SPECint95 binaries running under Dynamo is comparable to the performance of their +O4 optimized version running without Dynamo. Internal Accession Date Only Ó Copyright Hewlett-Packard Company 1999 Contents 1 INTRODUCTION ........................................................................................... 7 2 RELATED WORK ......................................................................................... 9 3 OVERVIEW
    [Show full text]
  • Optimizing for Reduced Code Space Using Genetic Algorithms
    Optimizing for Reduced Code Space using Genetic Algorithms Keith D. Cooper, Philip J. Schielke, and Devika Subramanian Department of Computer Science Rice University Houston, Texas, USA {keith | phisch | devika}@cs.rice.edu Abstract is virtually impossible to select the best set of optimizations to run on a particular piece of code. Historically, compiler Code space is a critical issue facing designers of software for embedded systems. Many traditional compiler optimiza- writers have made one of two assumptions. Either a fixed tions are designed to reduce the execution time of compiled optimization order is “good enough” for all programs, or code, but not necessarily the size of the compiled code. Fur- giving the user a large set of flags that control optimization ther, different results can be achieved by running some opti- is sufficient, because it shifts the burden onto the user. mizations more than once and changing the order in which optimizations are applied. Register allocation only com- Interplay between optimizations occurs frequently. A plicates matters, as the interactions between different op- transformation can create opportunities for other transfor- timizations can cause more spill code to be generated. The mations. Similarly, a transformation can eliminate oppor- compiler for embedded systems, then, must take care to use tunities for other transformations. These interactions also the best sequence of optimizations to minimize code space. Since much of the code for embedded systems is compiled depend on the program being compiled, and they are of- once and then burned into ROM, the software designer will ten difficult to predict. Multiple applications of the same often tolerate much longer compile times in the hope of re- transformation at different points in the optimization se- ducing the size of the compiled code.
    [Show full text]
  • Automatic Derivation of Compiler Machine Descriptions
    Automatic Derivation of Compiler Machine Descriptions CHRISTIAN S. COLLBERG University of Arizona We describe a method designed to significantly reduce the effort required to retarget a compiler to a new architecture, while at the same time producing fast and effective compilers. The basic idea is to use the native C compiler at compiler construction time to discover architectural features of the new architecture. From this information a formal machine description is produced. Given this machine description, a native code-generator can be generated by a back-end generator such as BEG or burg. A prototype automatic Architecture Discovery Tool (called ADT) has been implemented. This tool is completely automatic and requires minimal input from the user. Given the Internet address of the target machine and the command-lines by which the native C compiler, assembler, and linker are invoked, ADT will generate a BEG machine specification containing the register set, addressing modes, instruction set, and instruction timings for the architecture. The current version of ADT is general enough to produce machine descriptions for the integer instruction sets of common RISC and CISC architectures such as the Sun SPARC, Digital Alpha, MIPS, DEC VAX, and Intel x86. Categories and Subject Descriptors: D.2.7 [Software Engineering]: Distribution, Maintenance, and Enhancement—Portability; D.3.2 [Programming Languages]: Language Classifications— Macro and assembly languages; D.3.4 [Programming Languages]: Processors—Translator writ- ing systems and compiler generators General Terms: Languages Additional Key Words and Phrases: Back-end generators, compiler configuration scripts, retargeting 1. INTRODUCTION An important aspect of a compiler implementation is its retargetability.For example, a new programming language whose compiler can be quickly retar- geted to a new hardware platform or a new operating system is more likely to gain widespread acceptance than a language whose compiler requires extensive retargeting effort.
    [Show full text]
  • CS415 Compilers Register Allocation and Introduction to Instruction Scheduling
    CS415 Compilers Register Allocation and Introduction to Instruction Scheduling These slides are based on slides copyrighted by Keith Cooper, Ken Kennedy & Linda Torczon at Rice University Announcement • First homework out → Due tomorrow 11:59pm EST. → Sakai submission window open. → pdf format only. • Late policy → 20% penalty for first 24 hours late. → Not accepted 24 hours after deadline. If you have personal overriding reasons that prevents you from submitting homework online, please let me know in advance. cs415, spring 16 2 Lecture 3 Review: ILOC Register allocation on basic blocks in “ILOC” (local register allocation) • Pseudo-code for a simple, abstracted RISC machine → generated by the instruction selection process • Simple, compact data structures • Here: we only use a small subset of ILOC → ILOC simulator at ~zhang/cs415_2016/ILOC_Simulator on ilab cluster Naïve Representation: Quadruples: loadI 2 r1 • table of k x 4 small integers loadAI r0 @y r2 • simple record structure add r1 r2 r3 • easy to reorder loadAI r0 @x r4 • all names are explicit sub r4 r3 r5 ILOC is described in Appendix A of EAC cs415, spring 16 3 Lecture 3 Review: F – Set of Feasible Registers Allocator may need to reserve registers to ensure feasibility • Must be able to compute addresses Notation: • Requires some minimal set of registers, F k is the number of → F depends on target architecture registers on the • F contains registers to make spilling work target machine (set F registers “aside”, i.e., not available for register assignment) cs415, spring 16 4 Lecture 3 Live Range of a Value (Virtual Register) A value is live between its definition and its uses • Find definitions (x ← …) and uses (y ← … x ...) • From definition to last use is its live range → How does a second definition affect this? • Can represent live range as an interval [i,j] (in block) → live on exit Let MAXLIVE be the maximum, over each instruction i in the block, of the number of values (pseudo-registers) live at i.
    [Show full text]
  • MAO - an Extensible Micro-Architectural Optimizer
    MAO - an Extensible Micro-Architectural Optimizer Robert Hundt, Easwaran Raman, Martin Thuresson, Neil Vachharajani Google 1600 Amphitheatre Parkway Mountain View, CA, 94043 frhundt, eraman, [email protected], [email protected] .L3 movsbl 1(%rdi,%r8,4),%edx Abstract—Performance matters, and so does repeatability and movsbl (%rdi,%r8,4),%eax predictability. Today’s processors’ micro-architectures have be- # ... 6 instructions come so complex as to now contain many undocumented, not movl %edx, (%rsi,%r8,4) understood, and even puzzling performance cliffs. Small changes addq $1, %r8 in the instruction stream, such as the insertion of a single NOP instruction, can lead to significant performance deltas, with the nop # this instruction speeds up effect of exposing compiler and performance optimization efforts # the loop by 5% to perceived unwanted randomness. .L5: movsbl 1(%rdi,%r8,4),%edx This paper presents MAO, an extensible micro-architectural movsbl (%rdi,%r8,4),%eax assembly to assembly optimizer, which seeks to address this # ... identical code sequence problem for x86/64 processors. In essence, MAO is a thin wrapper movl %edx, (%rsi,%r8,4) around a common open source assembler infrastructure. It offers addq $1, %r8 basic operations, such as creation or modification of instructions, cmpl %r8d, %r9d simple data-flow analysis, and advanced infra-structure, such jg .L3 as loop recognition, and a repeated relaxation algorithm to compute instruction addresses and lengths. This infrastructure Fig. 1. Code snippet with high impact NOP instruction enables a plethora of passes for pattern matching, alignment specific optimizations, peep-holes, experiments (such as random insertion of NOPs), and fast prototyping of more sophisticated optimizations.
    [Show full text]
  • Lecture Notes on Instruction Selection
    Lecture Notes on Instruction Selection 15-411: Compiler Design Frank Pfenning Lecture 2 August 28, 2014 1 Introduction In this lecture we discuss the process of instruction selection, which typcially turns some form of intermediate code into a pseudo-assembly language in which we assume to have infinitely many registers called “temps”. We next apply register allocation to the result to assign machine registers and stack slots to the temps be- fore emitting the actual assembly code. Additional material regarding instruction selection can be found in the textbook [App98, Chapter 9]. 2 A Simple Source Language We use a very simple source language where a program is just a sequence of assign- ments terminated by a return statement. The right-hand side of each assignment is a simple arithmetic expression. Later in the course we describe how the input text is parsed and translated into some intermediate form. Here we assume we have arrived at an intermediate representation where expressions are still in the form of trees and we have to generate instructions in pseudo-assembly. We call this form IR Trees (for “Intermediate Representation Trees”). We describe the possible IR trees in a kind of pseudo-grammar, which should not be read as a description of the concrete syntax, but the recursive structure of the data. LECTURE NOTES AUGUST 28, 2014 Instruction Selection L2.2 Programs ~s ::= s1; : : : ; sn sequence of statements Statements s ::= x = e assignment j return e return, always last Expressions e ::= c integer constant j x variable j e1 ⊕ e2 binary operation Binops ⊕ ::= + j − j ∗ j = j ::: 3 Abstract Assembly Target Code For our very simple source, we use an equally simple target.
    [Show full text]