Optimizing Memory Accesses for Spatial Computation

Total Page:16

File Type:pdf, Size:1020Kb

Optimizing Memory Accesses for Spatial Computation Optimizing Memory Accesses For Spatial Computation Mihai Budiu and Seth C. Goldstein Computer Science Department Carnegie Mellon University g fmihaib,seth @cs.cmu.edu Abstract ence of pointers. Pegasus also increases available paral- In this paper we present the internal representation and lelism by supporting aggressive predication without giving optimizations used by the CASH compiler for improving up the features of SSA. the memory parallelism of pointer-based programs. CASH One of Pegasus’ main design goals is to support spa- uses an SSA-based representation for memory, which com- tial computation. Spatial computation refers to the direct pactly summarizes both control-flow- and dependence in- execution of high-level language programs in hardware. formation. Each operation in the program is implemented as a hard- In CASH, memory optimization is a four-step process: ware operator. Data flows from one operation to another (1) first an initial, relatively coarse, representation of mem- along wires that connect the operations. ory dependences is built; (2) next, unnecessary memory de- Pegasus is a natural intermediate representation for spa- pendences are removed using dependence tests; (3) third, tial computation because its semantics are similar to that redundant memory operations are removed (4) finally, par- of an asynchronous circuit. Pegasus is an executable IR allelism is increased by pipelining memory accesses in which unifies predication, static-single assignment, may- loops. While the first three steps above are very general, dependences, and data-flow semantics. The most powerful the loop pipelining transformations are particularly appli- aspect of Pegasus is that it allows the essential information cable for spatial computation, which is the primary target in a program to be represented directly. Thus, many op- of CASH. timizations are reduced to a local rewriting of the graph. The redundant memory removal optimizations pre- This makes Pegasus a particularly scalable representation, sented are: load/store hoisting (subsuming partial re- yielding efficient compilation, even when compiling entire dundancy elimination and common-subexpression elimina- programs to circuits. tion), load-after-store removal, store-before-store removal In this paper we describe optimizations which increase (dead store removal) and loop-invariant load motion. memory parallelism, eliminate redundant memory refer- One of our loop pipelining transformations is a new ences, and increase the pipeline parallelism found in loops. form of loop parallelization, called loop decoupling. This We show how Pegasus’ explicit representation of both con- transformation separates independent memory accesses trol and data dependences allows these optimizations to be within a loop body into several independent loops, which concisely expressed. are allowed dynamically to slip with respect to each other. 2 Example A new computational primitive, a token generator is used to We motivate our memory representation with the fol- dynamically control the amount of slip, allowing maximum lowing example: freedom, while guaranteeing that no memory dependences are violated. void f(unsigned*p, unsigned a[], int i) { 1 Introduction if (p) a[i] += *p; One of the main bottlenecks to increasing the perfor- else a[i] = 1; mance of programs is that many compiler optimizations a[i] <<= a[i+1]; break down in the face of memory references. For exam- } ple, while SSA is an IR that is widely recognized as en- abling many efficient and powerful optimizations it cannot This program uses a[i] as a temporary variable. We be easily applied in the presence of pointers. In this paper have compiled this program using the highest optimization we present an intermediate representation, Pegasus, that level using seven different compilers: enables efficient and powerful optimizations in the pres- ¯ gcc 3.2 for Intel P4, -O3, ¯ Sun WorkShop 6 update 2 for Sparc, -xO5, In Figure 1B, the load from a[i] labeled 4 immedi- ¯ DEC CC V5.6-075 for alpha, -O4, ately follows the two possible stores (2 and 3). No compli- ¯ MIPSpro Compiler Version 7.3.1.2m for SGI, -O4, cated dataflow or control-flow analysis is required to de- ¯ SGI ORC version 1.0.0 for Itanium, -O4, duce this fact: it is indicated by the tokens flowing directly ¯ IBM AIX cc version 3.6.6.0, -O3 from the stores to the load. As such, the load can be re- ¯ CASH. moved and replaced with the value of the executed store. This replacement is shown in Figure 1C, where the load Only CASH and the AIX compiler removed all the has been replaced by a multiplexor 7, drawn as a trapezoid. useless memory accesses made for the intermediate result The multiplexor is controlled by the predicates of the two stored in a[i] (two stores and one load). The other com- stores to a[i], 2 and 3: the store that executes (i.e., has pilers retain the intermediate redundant stores to a[i], a “true” predicate at run-time), is the one that will forward which are immediately overwritten by the last store in the the stored value through the multiplexor. function. The CASH result is even more surprising in light of the simplicity of the analysis it performs: the op- As a result of the load elimination, in Figure 1C the a[i] timizations consist only of reachability computations in a store 6 to immediately follows (and post-dominates) DAG and localized code transformations (term-rewriting). the other two stores, 2 and 3, to the same address; in con- The point of this comparison is not to show the superior- sequence, 2 and 3 are useless, and can be completely re- ity of CASH; rather, we want to point out that optimiza- moved, as dead code. Again, the post-dominance test does tions which are complicated enough to perform in a tradi- not involve any control-flow analysis: it follows immedi- tional representation (so that most compilers forgo them), ately from the representation, because (1) the stores imme- are very easy to express, implement and reason about when diately follow each other, as indicated by the tokens con- using a better program representation. necting them; (2) they occur at the same address and (3) each predicate of an earlier store implies the predicate of The strength of CASH originates from Pegasus, its in- the latter one, which is constant “true” (shown as 1), in- ternal program representation. The optimization occurs in dicating the post-dominance. The post-dominance is de- three steps, detailed in Figure 1. (We describe Pegasus in termined by only elementary boolean manipulations. The Section 3 and details of these optimizations in Section 5.) 1 transformation from C to D is accomplished in two steps: Figure 1A depicts a slightly simplified form of the pro- (a) the predicates of the prior stores are “and”-ed with the gram representation before any optimizations. Each oper- negation of the predicate of the latter store (i.e., the prior ation is represented by a node in a graph. Edges represent stores should occur only if the latter one doesn’t overwrite value flow between nodes. Pegasus uses predication [20]; them) and (b) stores with a constant “false” predicate are throughout this paper, dotted lines represent predicate val- completely removed from the program as dead code. ues, while dashed lines represent may dependences for memory access operations; these lines carry token values. 2.1 Related Work Each memory operation has a predicate input, indicating The explicit representation of memory dependences be- whether it ought to be executed, and a token input, indi- tween program operations has been suggested numerous cating that the side-effects of the prior dependent opera- times in the literature; for example in Pingali’s Depen- tions have occurred. At the beginning of the compilation dence Flow Graph [21], or Steensgaard’s adaptation to process the token dependences track the original program Value-Dependence Graphs [24]. As Steensgaard has ob- control-flow. The nodes denoted by “V” represent opera- served, this type of representation is actually a generaliza- tions that combine multiple input tokens into a single token tion of SSA designed to handle value flow through mem- for each output; these originally represent joins in the pro- ory. Other researchers have explored the use of SSA for gram control-flow. The “*” node is the token input, indi- handling memory dependences, e.g.,[7, 8, 14, 5, 6, 13]. cating that side-effects in the previous part of the program The original contributions of this work are: have completed. ¯ We apply the memory access optimization techniques in CASH first proves that a[i] and a[i+1] access dis- the context of a spatial compiler, which translates C pro- joint memory regions, and thus commute (nodes 3/5, 2/5 grams to hardware circuits. and 5/6). By removing the token edges between these ¯ We show how to use predication and SSA together for memory operations the program is transformed as in Fig- memory code hoisting (subsuming partial redundancy ure 1B. Notice that a new combine operator is inserted at elimination and global common-subexpression elimina- the end of the program; its execution indicates that all prior tion for memory operation), removing loads after stores, program side-effects have occurred. dead store removal, and loop-invariant code motion for memory operations; our algorithms rely on boolean ma- 1We have elided the address computation part. nipulation of the controlling predicates (Section 5). * p != 0 p != 0 * * p != 0 *p != 0 =a[i] =*p 1 ! =*p =a[i] ! 1 1 =a[i] =*p ! =a[i] =*p ! 1 1 1 1 + V a[i]= + V a[i]= 1 a[i]= V + 1 1 V + 1 2 2 2 a[i]= a[i]= =a[i+1] a[i]= ?: =a[i+1] ?: 3 3 5 3 7 5 1 V 1 V 1 1 1 V << << 1 =a[i] =a[i+1] =a[i] =a[i+1] a[i]= a[i]= 4 5 45 6 6 1 << V 1 << V V a[i]= a[i]= 6 6 V (A) (B) (C) (D) Figure 1: Program transformations [dotted lines represent predicates, while dashed lines represent tokens]: (A) removal of two extraneous dependences between a[i] and a[i+1], (B) load from a[i] bypasses directly stored value(s), (C) store to a[i] kills prior stores to the same address (D) final result.
Recommended publications
  • Loop Transformations and Parallelization
    Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .
    [Show full text]
  • Foundations of Scientific Research
    2012 FOUNDATIONS OF SCIENTIFIC RESEARCH N. M. Glazunov National Aviation University 25.11.2012 CONTENTS Preface………………………………………………….…………………….….…3 Introduction……………………………………………….…..........................……4 1. General notions about scientific research (SR)……………….……….....……..6 1.1. Scientific method……………………………….………..……..……9 1.2. Basic research…………………………………………...……….…10 1.3. Information supply of scientific research……………..….………..12 2. Ontologies and upper ontologies……………………………….…..…….…….16 2.1. Concepts of Foundations of Research Activities 2.2. Ontology components 2.3. Ontology for the visualization of a lecture 3. Ontologies of object domains………………………………..………………..19 3.1. Elements of the ontology of spaces and symmetries 3.1.1. Concepts of electrodynamics and classical gauge theory 4. Examples of Research Activity………………….……………………………….21 4.1. Scientific activity in arithmetics, informatics and discrete mathematics 4.2. Algebra of logic and functions of the algebra of logic 4.3. Function of the algebra of logic 5. Some Notions of the Theory of Finite and Discrete Sets…………………………25 6. Algebraic Operations and Algebraic Structures……………………….………….26 7. Elements of the Theory of Graphs and Nets…………………………… 42 8. Scientific activity on the example “Information and its investigation”……….55 9. Scientific research in Artificial Intelligence……………………………………..59 10. Compilers and compilation…………………….......................................……69 11. Objective, Concepts and History of Computer security…….………………..93 12. Methodological and categorical apparatus of scientific research……………114 13. Methodology and methods of scientific research…………………………….116 13.1. Methods of theoretical level of research 13.1.1. Induction 13.1.2. Deduction 13.2. Methods of empirical level of research 14. Scientific idea and significance of scientific research………………………..119 15. Forms of scientific knowledge organization and principles of SR………….121 1 15.1. Forms of scientific knowledge 15.2.
    [Show full text]
  • Compiler Transformations for High-Performance Computing
    Compiler Transformations for High-Performance Computing DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP Computer Sc~ence Diviszon, Unwersbty of California, Berkeley, California 94720 In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimization for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis. This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each traneformation, explain how to determine if it is legal, and give an example of its application. Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimization that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of optimizing compiler technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as a bibliographic reference for the details of each optimization.
    [Show full text]
  • Compiler Construction
    Compiler construction PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sat, 10 Dec 2011 02:23:02 UTC Contents Articles Introduction 1 Compiler construction 1 Compiler 2 Interpreter 10 History of compiler writing 14 Lexical analysis 22 Lexical analysis 22 Regular expression 26 Regular expression examples 37 Finite-state machine 41 Preprocessor 51 Syntactic analysis 54 Parsing 54 Lookahead 58 Symbol table 61 Abstract syntax 63 Abstract syntax tree 64 Context-free grammar 65 Terminal and nonterminal symbols 77 Left recursion 79 Backus–Naur Form 83 Extended Backus–Naur Form 86 TBNF 91 Top-down parsing 91 Recursive descent parser 93 Tail recursive parser 98 Parsing expression grammar 100 LL parser 106 LR parser 114 Parsing table 123 Simple LR parser 125 Canonical LR parser 127 GLR parser 129 LALR parser 130 Recursive ascent parser 133 Parser combinator 140 Bottom-up parsing 143 Chomsky normal form 148 CYK algorithm 150 Simple precedence grammar 153 Simple precedence parser 154 Operator-precedence grammar 156 Operator-precedence parser 159 Shunting-yard algorithm 163 Chart parser 173 Earley parser 174 The lexer hack 178 Scannerless parsing 180 Semantic analysis 182 Attribute grammar 182 L-attributed grammar 184 LR-attributed grammar 185 S-attributed grammar 185 ECLR-attributed grammar 186 Intermediate language 186 Control flow graph 188 Basic block 190 Call graph 192 Data-flow analysis 195 Use-define chain 201 Live variable analysis 204 Reaching definition 206 Three address
    [Show full text]
  • Codegeneratoren
    Codegeneratoren Sebastian Geiger Twitter: @lanoxx, Email: [email protected] January 26, 2015 Abstract This is a very short summary of the topics that are tought in “Codegeneratoren“. Code generators are the backend part of a compiler that deal with the conversion of the machine independent intermediate representation into machine dependent instructions and the various optimizations on the resulting code. The first two Sections deal with prerequisits that are required for the later parts. Section 1 discusses computer architectures in order to give a better understanding of how processors work and why di↵erent optimizations are possible. Section 2 discusses some of the higher level optimizations that are done before the code generator does its work. Section 3 discusses instruction selection, which conserns the actual conversion of the intermediate representation into machine dependent instructions. The later chapters all deal with di↵erent kinds of optimizations that are possible on the resulting code. I have written this document to help my self studing for this subject. If you find this document useful or if you find errors or mistakes then please send me an email. Contents 1 Computer architectures 2 1.1 Microarchitectures ............................................... 2 1.2 InstructionSetArchitectures ......................................... 3 2 Optimizations 3 2.1 StaticSingleAssignment............................................ 3 2.2 Graphstructuresandmemorydependencies . 4 2.3 Machineindependentoptimizations. 4 3 Instruction selection 5 4 Register allocation 6 5 Coallescing 7 6 Optimal register allocation using integer linear programming 7 7 Interproceedural register allocation 8 7.1 Global Register allocation at link time (Annotation based) . 8 7.2 Register allocation across procedure and module bounaries (Web) . 8 8 Instruction Scheduling 9 8.1 Phaseorderingproblem:...........................................
    [Show full text]
  • Handling Irreducible Loops: Optimized Node Splitting Vs. DJ-Graphs
    Handling Irreducible Loops: Optimized Node Splitting vs. DJ-Graphs Sebastian Unger1 and Frank Mueller2 1 DResearch Digital Media Systems Otto-Schmirgal-Str. 3, 10319 Berlin (Germany) 2 North Carolina State University, CS Dept. Box 8206, Raleigh, NC 27695-7534 fax: +1.919.515.7925 [email protected] Abstract. This paper addresses the question of how to handle irre- ducible regions during optimization, which has become even more rele- vant for contemporary processors since recent VLIW-like architectures highly rely on instruction scheduling. The contributions of this paper are twofold. First, a method of optimized node splitting to transform irreducible regions of control flow into reducible regions is derived. This method is superior to approaches previously published since it reduces the number of replicated nodes by comparison. Second, three methods that handle regions of irreducible control flow are evaluated with respect to their impact on compiler optimizations: traditional and optimized node splitting as well as loop analysis through DJ graphs. Measurements show improvements of 1-40% for these methods of handling irreducible loop over the unoptimized case. 1 Introduction Compilers heavily rely on recognizing loops for optimizations. Most loop opti- mizations have only been formulated for natural loops with a single entry point (header), the sink of the backedge(s) of such a loop. Multiple entry loops cause irreducible regions of control flow, typically not recognized as loops by traditional algorithms. These regions may result from goto statements or from optimizations that modify the control flow. As a result, loop transformations and optimizations to exploit instruction-level parallelism cannot be applied to such regions so that opportunities for code improvements may be missed.
    [Show full text]
  • Data Locality on Manycore Architectures Duco Van Amstel
    Data Locality on Manycore Architectures Duco van Amstel To cite this version: Duco van Amstel. Data Locality on Manycore Architectures. Distributed, Parallel, and Cluster Computing [cs.DC]. Université Grenoble-Alpes, 2016. English. tel-01358312v1 HAL Id: tel-01358312 https://hal.inria.fr/tel-01358312v1 Submitted on 31 Aug 2016 (v1), last revised 14 Apr 2017 (v2) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THÈSE Pour obtenir le grade de DOCTEUR DE L’UNIVERSITÉ DE GRENOBLE ALPES Spécialité : Informatique Arrêté ministèriel : 7 août 2006 Présentée par Duco van Amstel Thèse dirigée par Fabrice Rastello et co-encadrée par Benoît Dupont de Dinechin préparée au sein de l’ INRIA et de l’École Doctorale MSTII Optimisation de la Localité des Données sur Architectures Manycœurs Thése soutenue publiquement le 18 / 07 / 2016, devant le jury composé de : Mr. François Bodin Professeur, IRISA, Président Mr. Albert Cohen Directeur de Recherche, Inria, Rapporteur Mr. Benoît Dupont de Dinechin Directeur technique, Kalray S.A, Co-Encadrant de thèse Mr. François Irigoin Directeur de Recherche, MINES ParisTech, Rapporteur Mr. Fabrice Rastello Chargé de Recherche, Inria, Directeur de thèse Mr. Ponnuswamy Sadayappan Professor, Ohio State University, Examinateur Data Locality on Manycore Architectures a dissertation by Duco Yalmar van Amstel Acknowledgments The work which is related in this manuscript is not that of a single person.
    [Show full text]
  • Software Thread Integration for Instruction Level Parallelism
    ABSTRACT SO, WON. Software Thread Integration for Instruction Level Parallelism. (Under the direction of Associate Professor Alexander G. Dean). Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long In- struction Word) or EPIC (Explicitly Parallel Instruction Computing). Despite many efforts to exploit instruction-level parallelism (ILP) in the application, the speed is a fraction of what it could be, limited by the difficulty of finding enough independent instructions to keep all of the processor’s functional units busy. This dissertation proposes Software Thread Integration (STI) for Instruction Level Parallelism. STI is a software technique for interleaving multiple threads of control into a single implicitly multithreaded one. We use STI to improve the performance on ILP processors by merging parallel procedures into one, increasing the compiler’s scope and hence allowing it to create a more efficient instruction schedule. STI is es- sentially procedure jamming with intraprocedural code motion transformations which allow arbitrary alignment of instructions or code regions. This alignment enables code to be moved to use available execution resources better and improve the execution schedule. Parallel procedures are identified by the programmer with either annota- tions in conventional procedural languages or graph analysis for stream coarse-grain dataflow programming languages. We use the method of procedure cloning and integration for improving program run-time performance by integrating parallel procedures via STI. This defines a new way of converting parallelism at the thread level to the instruction level.
    [Show full text]
  • CS 610: Loop Transformations Swarnendu Biswas
    CS 610: Loop Transformations Swarnendu Biswas Semester 2020-2021-I CSE, IIT Kanpur Content influenced by many excellent references, see References slide for acknowledgements. Copyright Information • “The instructor of this course owns the copyright of all the course materials. This lecture material was distributed only to the students attending the course CS 610: Programming for Performance of IIT Kanpur, and should not be distributed in print or through electronic media without the consent of the instructor. Students can make their own copies of the course materials for their use.” https://www.iitk.ac.in/doaa/data/FAQ-2020-21-I.pdf CS 610 Swarnendu Biswas Enhancing Program Performance Fundamental issues • Adequate fine-grained parallelism • Exploit vector instruction sets (SSE, AVX, AVX-512) • Multiple pipelined functional units in each core • Adequate parallelism for SMP-type systems • Keep multiple asynchronous processors busy with work • Minimize cost of memory accesses CS 610 Swarnendu Biswas Role of a Good Compiler Try and extract performance automatically Optimize memory access latency • Code restructuring optimizations • Prefetching optimizations • Data layout optimizations • Code layout optimizations CS 610 Swarnendu Biswas Loop Optimizations • Loops are one of most commonly used constructs in HPC program • Compiler performs many of loop optimization techniques automatically • In some cases source code modifications enhance optimizer’s ability to transform code CS 610 Swarnendu Biswas Reordering Transformations • A reordering
    [Show full text]
  • An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs
    HELIX-RC: An Architecture-Compiler Co-Design for Automatic Parallelization of Irregular Programs Simone Campanoni Kevin Brownell Svilen Kanev Timothy M. Jones+ Gu-Yeon Wei David Brooks Harvard University University of Cambridge+ Cambridge, MA, USA Cambridge, UK Abstract overheads to support misspeculation and must therefore target Data dependences in sequential programs limit paralleliza- relatively large loops to amortize penalties. tion because extracted threads cannot run independently. Al- In contrast to existing parallelization solutions, we propose though thread-level speculation can avoid the need for precise an alternate strategy that instead targets small loops, which are dependence analysis, communication overheads required to syn- much easier to analyze via state-of-the-art control and data flow chronize actual dependences counteract the benefits of paral- analysis, significantly improving accuracy. Furthermore, this lelization. To address these challenges, we propose a lightweight ease of analysis enables transformations that simply re-compute architectural enhancement co-designed with a parallelizing com- shared variables in order to remove a large fraction of actual piler, which together can decouple communication from thread dependences. This strategy increases TLP and reduces core-to- execution. Simulations of these approaches, applied to a pro- core communication. Such optimizations do not readily translate cessor with 16 Intel Atom-like cores, show an average of 6.85× to TLS because the complexity of TLS-targeted code typically performance speedup for six SPEC CINT2000 benchmarks. spans multiple procedures in larger loops. Finally, our data shows parallelizing small hot loops yields high program coverage and 1. Introduction produces meaningful speedups for the non-numerical programs in the SPEC CPU2000 suite.
    [Show full text]
  • Code Optimization Based on Source to Source Transformations Using Profile Guided Metrics Youenn Lebras
    Code optimization based on source to source transformations using profile guided metrics Youenn Lebras To cite this version: Youenn Lebras. Code optimization based on source to source transformations using profile guided met- rics. Automatic Control Engineering. Université Paris-Saclay, 2019. English. NNT : 2019SACLV037. tel-02443231 HAL Id: tel-02443231 https://tel.archives-ouvertes.fr/tel-02443231 Submitted on 17 Jan 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. s Code optimization based on source to source transformations using profile guided metrics Thèse de doctorat de l’Université Paris-Saclay préparée à l’Université de Versailles-Saint-Quentin-en-Yvelines Ecole doctorale n◦580 Sciences et technologies de l’information et de la NNT : 2019SACLV037 communication (STIC) Spécialité de doctorat: "Programmation : modèles, algorithmes, langages, architecture" Thèse présentée et soutenue à Versailles, le 3 Juillet 2019, par YOUENN LEBRAS Composition du Jury : Anthony Scemama IR CNRS, HdR, Université de Toulouse Rapporteur Angelo Steffenel ) McF HdR, Université de Reims Champagne-Ardenne Rapporteur Michel Masella CEA DRF, HdR, Saclay Examinateur Denis Barthou PR. Université de Bordeaux Président Sophie Robert McF, Université d’Orleans Examinateur William Jalby PR, Univesité de Versailles Saint-Quentin Directeur de thèse Andres S.
    [Show full text]
  • Compilers Optimization
    Compilers Optimization Yannis Smaragdakis, U. Athens (or ig ina l s lides by Sam Guyer @Tu fts ) What we already saw z Lowering FlFrom language-lllevel cons ttttructs to mac hine-lllevel cons tttructs z At this point we could generate machine code z Output of lowering is a correct translation z What’s left to do? z Map from lower-level IR to actual ISA z Maybe some register management (could be required) z Pass off to assembler z Whyyp have a separate assembler? z Handles “packing the bits” Assembly addi <target>, <source>, <value> MhiMachine 0010 00ss ssst tttt iiii iiii iiii iiii 2 But first… z The compiler “understands” the program z IR captures program semantics z Lowering: semantics-preserving transformation z Why not d o oth ers ? z Compppiler optimizations z Oh great, now my program will be optimal! z Sorry, it’s a misnomer z What is an “optimization”? 3 Optimizations z What are they? z Code transformations z Improve some metric z Metrics z Performance: time, instructions, cycles AthAre these me titrics equ ilt?ivalent? z Memory z Memory hierarchy (reduce cache misses) z Reduce memory usage z Code Size z Energy 4 Big picture Errors Semantic chars Scanner tokens Parser AST checker Optimizations Checked Instruction IR lowering AST tuples selection assembly Register assembly Machine Assembler ∞ registers allocation k registers code 5 Big picture Errors Optimizations Semantic chars Scanner tokens Parser AST checker Checked Instruction IR lowering AST tuples selection assembly Register assembly Machine Assembler ∞ registers allocation
    [Show full text]