Approximate Loop Unrolling Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry

Total Page:16

File Type:pdf, Size:1020Kb

Approximate Loop Unrolling Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry Approximate Loop Unrolling Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry To cite this version: Marcelino Rodriguez-Cancio, Benoit Combemale, Benoit Baudry. Approximate Loop Unrolling. CF 2019 - ACM International Conference on Computing Frontiers, Apr 2019, Alghero, Sardinia, Italy. pp.94-105, 10.1145/3310273.3323841. hal-02407868 HAL Id: hal-02407868 https://hal.inria.fr/hal-02407868 Submitted on 12 Dec 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Approximate Loop Unrolling Marcelino Rodriguez-Cancio Benoit Combemale Benoit Baudry Vanderbilt University, USA University of Toulouse / Inria, France KTH, Sweden [email protected] [email protected] [email protected] Abstract series, sound, video, and images are frequently represented We introduce Approximate Unrolling, a compiler loop opti- as arrays where contiguous slots contain similar values. As mization that reduces execution time and energy consump- a consequence of this neighboring similarity, computations tion, exploiting code regions that can endure some approxi- producing such data are usually locally smooth functions. mation and still produce acceptable results. Specifically, this Our technique exploits this observation: it searches for loops work focuses on counted loops that map a function over where functions are mapped to contiguous array slots in the elements of an array. Approximate Unrolling transforms each iteration. If the function’s computation is expensive, loops similarly to Loop Unrolling. However, unlike its ex- we substitute it by less costly interpolations of the values act counterpart, our optimization does not unroll loops by assigned to nearby array values. In exchange for losses in adding exact copies of the loop’s body. Instead, it adds code accuracy, we obtain a faster and less energy consuming loop. that interpolates the results of previous iterations. Approximate Unrolling combines static analysis and code transformations in order to improve performance in ex- CCS Concepts • Software and its engineering → Just- change for small accuracy losses. Static analysis has two in-time compilers. objectives: determine if the loop’s structure fits our trans- Keywords approximate computing; compiler optimizations formations (counted loops mapping computations to con- secutive arrays slots); estimate if the transformation could ACM Reference Format: actually provide some performance improvements (compu- Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry. tation mapped to the array elements is expensive). The code 2019. Approximate Loop Unrolling. In Proceedings of the 16th confer- is transformed using two possible strategies. One that we ence on Computing Frontiers (CF ’19), April 30-May 2, 2019, Alghero, Italy. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/ call Nearest Neighbour (NN), which approximates the value 3310273.3323841 of one array slot by copying the value of the preceding slot. The other strategy, called Linear Interpolation(LI) transforms 1 Introduction the code so the value of a slot is the average of the previous and next slot. A key novel aspect of these static transfor- Approximate computing exploits the fact that some computa- mations is that they ensure accuracy for non-approximated tions can be made less precise and still produce good results iteration of the loop, avoiding error to accumulate. This is a [25]. Previous works in the field have proven that when small unique feature that is not found on other approximate loop reductions in accuracy are acceptable, significant improve- optimization techniques. ments in execution times and energy consumption can be We have implemented a prototype of Approximate Un- achieved [40]. Opportunities for approximation exist across rolling, inside the OpenJDK C2 compiler. This design choice the whole stack and previous works have explored this con- prevents phase ordering problems and allows us to reuse cept in both hardware [6, 33, 36] and software [3, 28, 30, 34]. internal representations that we need for our analysis. An In this work, we describe Approximate Unrolling, a novel implementation in a production-ready compiler is also key compiler loop optimization that uses the ideas of approxi- to assess the relevance of our optimization on already highly mate computing to reduce execution times and energy con- optimized code, such as the one produced by the JVM. sumption of loops. The key motivation of Approximate Un- We experimented with this implementation using a four rolling relies on the following observation: data such as time real-world Java libraries. The objectives of our experiments Permission to make digital or hard copies of all or part of this work for are to determine whether Approximate Unrolling is able to personal or classroom use is granted without fee provided that copies are not provide a good performance vs. accuracy trade-off and to made or distributed for profit or commercial advantage and that copies bear compare Approximate Unrolling with Loop Perforation [34], this notice and the full citation on the first page. Copyrights for components the state of the art approximate computing technique for of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to loop optimization. redistribute to lists, requires prior specific permission and/or a fee. Request Our results show that Approximate Unrolling is able to permissions from [email protected]. reduce the execution time and CPU energy consumption of CF ’19, April 30-May 2, 2019, Alghero, Italy the x86 code generated around 50% to 110% while keeping © 2019 Association for Computing Machinery. the accuracy to acceptable levels. Compared with Loop Per- ACM ISBN 978-1-4503-6685-4/19/05...$15.00 foration, Approximate Unrolling preserved accuracy better https://doi.org/10.1145/3310273.3323841 CF ’19, April 30-May 2, 2019, Alghero, Italy Marcelino Rodriguez-Cancio, Benoit Combemale, and Benoit Baudry in 76% of the cases and raised less fatal exceptions than Loop body, and (iii) the indexing expression of the array assign- Perforation (2 vs. 7). ment is value-dependent on the loop’s induction variable. The contributions of this work are: (i) Approximate Un- The example at Listing1 is indeed a ‘for’ loop where rolling, an approximate loop optimization. (ii) An efficient (i) the induction variable ‘i’ is incremented by a constant implementation of Approximate Unrolling inside the Open- stride, i.e. ‘i++’; (ii) ‘A[i]’ is an array assignment inside JDK Hotspot VM (iii) An empirical assessment of the effec- the loop’s body and (iii) the indexing expression of ‘A[i]’ tiveness of Approximate Unrolling to trade accuracy for time is value-dependent on the induction variable ‘i’. and energy savings. (iv) An empirical demonstration of Ap- proximate Unrolling’s novel contribution to the state of the for (i=0;i<N-1;i+= 2){ art, as an effective complement to Loop Perforation. phase = phase + phaseInc; double v = Math.sin(phase) * ampl; 2 Approximate Unrolling A[i] = v; In this section we describe the main contribution of our phase = phase + phaseInc; paper. We use the example of Listing1 to illustrate the kind //v1=Math.sin(phase)*ampl;REMOVED of loops targeted by the optimization as well as the process A[i + 1] = A[i]; } for the approximate transformation. // Guard Loopg for (j = i; j < N; j ++) { for ( int i= 0; i <N;i++) { phase = phase + phaseInc; phase = phase+phaseInc; double v = Math.sin(phase) * ampl; double v = Math.sin(phase)*ampl; A[i] = v;} A[i] = v; } Listing 3. After unrolling, our optimization approximates Listing 1. Illustrative Example: A loop mapping a sine wave the unrolled iteration. The loop of Listing2, transformed into an array. using nearest neighborg interpolation. for ( int i=0;i<N;i+= 2){ //Initial iteration peeled: phase = phase + phaseInc; phase = phase + phaseInc; double v = Math.sin(phase) * ampl; double v0 = Math.sin(phase) * ampl; A[i] = v; A[0] = v0; int i = 1; //UNROLLEDITERATION: i f ( N > 2 ) phase = phase + phaseInc; for (i=2;i<N-1;i+=2){ double v1 = Math.sin(phase)*ampl; phase = phase + phaseInc; A[i + 1] = v1; } //v=Math.sin(phase)*ampl;REMOVED Listing 2. Approximate Unrolling first unroll the loop as phase=phase+phaseInc;//Exact iteration: Loop Unrolling would normally do. In this listing, the double v = Math.sin(phase)*ampl; Illustrative Example is unrolled using Loop Unrolling. A[i] = v; //Approximate iteration: A[i - 1] = (A[i]+A[i-2])*0.5f; } Approximate Unrolling replaces the unrolled iteration’s for (j=i; j<N; j++) {//Guard Loop original instructions by others that interpolate results. We phase = phase + phaseInc; perform this transformation in such a way that errors do not double v = Math.sin(phase) * ampl; accumulate as the loop runs. We currently implement two A[i] = v;} interpolation strategies: Linear Interpolation (LI) and Nearest Listing 4. The loop of Listing2 transformed by Approximate Neighbor (NN). A policy determines which strategy, if any, Unrolling using Linear Interpolation. is likely to optimize each loop. In the following paragraphs, we discuss each step. Like normal unrolling, Approximate Unrolling can unroll Step 2. Approximation Policy: In compiler optimiza- more than one loop iteration. For simplicity, we only show tion is often the case that is possible to predict whether the the transformation where 1 iteration out of 2 is unrolled/ap- optimization will have a beneficial impact on speed. This es- proximated. timate is called policy in C2’s terms. Approximate Unrolling Step 1.
Recommended publications
  • Loop Pipelining with Resource and Timing Constraints
    UNIVERSITAT POLITÈCNICA DE CATALUNYA LOOP PIPELINING WITH RESOURCE AND TIMING CONSTRAINTS Autor: Fermín Sánchez October, 1995 1 1 1 1 1 1 1 1 2 1• SOFTWARE PIPELINING 1 1 1 2.1 INTRODUCTION 1 Software pipelining [ChaSl] comprises a family of techniques aimed at finding a pipelined schedule of the execution of loop iterations. The pipelined schedule represents a new loop body which may contain instructions belonging to different iterations. The sequential execution of the schedule takes less time than the sequential execution of the iterations of the loop as they were initially written. 1 In general, a pipelined loop schedule has the following characteristics1: • All the iterations (of the new loop) are executed in the same fashion. • The initiation interval (ÍÍ) between the issuing of two consecutive iterations is always the same. 1 Figure 2.1 shows an example of software pipelining a loop. The DDG representing the loop body to pipeline is presented in Figure 2.1(a). The loop must be executed ten times (with an iteration index i G [0,9]). Let us assume that all instructions in the loop are executed in a single cycle 1 and a new iteration may start every cycle (otherwise the dependence between instruction A from 1 We will assume here that the schedule contains a unique iteration of the loop. 1 15 1 1 I I 16 CHAPTER 2 I I time B,, Prol AO Q, BO A, I Q, B, A2 3<i<9 I Steady State D4 B7 I Epilogue Cy I D9 (a) (b) (c) I Figure 2.1 Software pipelining a loop (a) DDG representing a loop body (b) Parallel execution of the loop (c) New parallel loop body I iterations i and i+ 1 will not be honored).- With this assumption,-the loop can be executed in a I more parallel fashion than the sequential one, as shown in Figure 2.1(b).
    [Show full text]
  • Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler
    MATEC Web of Conferences 228, 03008 (2018) https://doi.org/10.1051/matecconf/201822803008 CAS 2018 Research of Register Pressure Aware Loop Unrolling Optimizations for Compiler Xuehua Liu1,2, Liping Ding1,3 , Yanfeng Li1,2 , Guangxuan Chen1,4 , Jin Du5 1Laboratory of Parallel Software and Computational Science, Institute of Software, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China 3Digital Forensics Lab, Institute of Software Application Technology, Guangzhou & Chinese Academy of Sciences, Guangzhou, China 4Zhejiang Police College, Hangzhou, China 5Yunnan Police College, Kunming, China Abstract. Register pressure problem has been a known problem for compiler because of the mismatch between the infinite number of pseudo registers and the finite number of hard registers. Too heavy register pressure may results in register spilling and then leads to performance degradation. There are a lot of optimizations, especially loop optimizations suffer from register spilling in compiler. In order to fight register pressure and therefore improve the effectiveness of compiler, this research takes the register pressure into account to improve loop unrolling optimization during the transformation process. In addition, a register pressure aware transformation is able to reduce the performance overhead of some fine-grained randomization transformations which can be used to defend against ROP attacks. Experiments showed a peak improvement of about 3.6% and an average improvement of about 1% for SPEC CPU 2006 benchmarks and a peak improvement of about 3% and an average improvement of about 1% for the LINPACK benchmark. 1 Introduction fundamental transformations, because it is not only performed individually at least twice during the Register pressure is the number of hard registers needed compilation process, it is also an important part of to store values in the pseudo registers at given program optimizations like vectorization, module scheduling and point [1] during the compilation process.
    [Show full text]
  • CS153: Compilers Lecture 19: Optimization
    CS153: Compilers Lecture 19: Optimization Stephen Chong https://www.seas.harvard.edu/courses/cs153 Contains content from lecture notes by Steve Zdancewic and Greg Morrisett Announcements •HW5: Oat v.2 out •Due in 2 weeks •HW6 will be released next week •Implementing optimizations! (and more) Stephen Chong, Harvard University 2 Today •Optimizations •Safety •Constant folding •Algebraic simplification • Strength reduction •Constant propagation •Copy propagation •Dead code elimination •Inlining and specialization • Recursive function inlining •Tail call elimination •Common subexpression elimination Stephen Chong, Harvard University 3 Optimizations •The code generated by our OAT compiler so far is pretty inefficient. •Lots of redundant moves. •Lots of unnecessary arithmetic instructions. •Consider this OAT program: int foo(int w) { var x = 3 + 5; var y = x * w; var z = y - 0; return z * 4; } Stephen Chong, Harvard University 4 Unoptimized vs. Optimized Output .globl _foo _foo: •Hand optimized code: pushl %ebp movl %esp, %ebp _foo: subl $64, %esp shlq $5, %rdi __fresh2: movq %rdi, %rax leal -64(%ebp), %eax ret movl %eax, -48(%ebp) movl 8(%ebp), %eax •Function foo may be movl %eax, %ecx movl -48(%ebp), %eax inlined by the compiler, movl %ecx, (%eax) movl $3, %eax so it can be implemented movl %eax, -44(%ebp) movl $5, %eax by just one instruction! movl %eax, %ecx addl %ecx, -44(%ebp) leal -60(%ebp), %eax movl %eax, -40(%ebp) movl -44(%ebp), %eax Stephen Chong,movl Harvard %eax,University %ecx 5 Why do we need optimizations? •To help programmers… •They write modular, clean, high-level programs •Compiler generates efficient, high-performance assembly •Programmers don’t write optimal code •High-level languages make avoiding redundant computation inconvenient or impossible •e.g.
    [Show full text]
  • Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗
    Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗ David Leopoldseder Roland Schatz Lukas Stadler Johannes Kepler University Linz Oracle Labs Oracle Labs Austria Linz, Austria Linz, Austria [email protected] [email protected] [email protected] Manuel Rigger Thomas Würthinger Hanspeter Mössenböck Johannes Kepler University Linz Oracle Labs Johannes Kepler University Linz Austria Zurich, Switzerland Austria [email protected] [email protected] [email protected] ABSTRACT Non-Counted Loops to Enable Subsequent Compiler Optimizations. In 15th Java programs can contain non-counted loops, that is, loops for International Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018, Linz, Austria. ACM, New York, NY, USA, 13 pages. which the iteration count can neither be determined at compile https://doi.org/10.1145/3237009.3237013 time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves 1 INTRODUCTION duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can Generating fast machine code for loops depends on a selective appli- optimize the unrolled code. cation of different loop optimizations on the main optimizable parts This paper presents an unrolling approach for non-counted loops of a loop: the loop’s exit condition(s), the back edges and the loop that uses simulation at run time to determine whether unrolling body. All these parts must be optimized to generate optimal code such loops enables subsequent compiler optimizations. Simulat- for a loop.
    [Show full text]
  • A Hierarchical Region-Based Static Single Assignment Form
    A Hierarchical Region-Based Static Single Assignment Form Jisheng Zhao and Vivek Sarkar Department of Computer Science, Rice University fjisheng.zhao,[email protected] Abstract. Modern compilation systems face the challenge of incrementally re- analyzing a program’s intermediate representation each time a code transforma- tion is performed. Current approaches typically either re-analyze the entire pro- gram after an individual transformation or limit the analysis information that is available after a transformation. To address both efficiency and precision goals in an optimizing compiler, we introduce a hierarchical static single-assignment form called Region Static Single-Assignment (Region-SSA) form. Static single assign- ment (SSA) form is an efficient intermediate representation that is well suited for solving many data flow analysis and optimization problems. By partitioning the program into hierarchical regions, Region-SSA form maintains a local SSA form for each region. Region-SSA supports a demand-driven re-computation of SSA form after a transformation is performed, since only the updated region’s SSA form needs to be reconstructed along with a potential propagation of exposed defs and uses. In this paper, we introduce the Region-SSA data structure, and present algo- rithms for construction and incremental reconstruction of Region-SSA form. The Region-SSA data structure includes a tree based region hierarchy, a region based control flow graph, and region-based SSA forms. We have implemented in Region- SSA form in the Habanero-Java (HJ) research compiler. Our experimental re- sults show significant improvements in compile-time compared to traditional ap- proaches that recompute the entire procedure’s SSA form exhaustively after trans- formation.
    [Show full text]
  • The Effect of C Ode Expanding Optimizations on Instruction Cache Design
    May 1991 UILU-ENG-91-2227 CRHC-91-17 Center for Reliable and High-Performance Computing " /ff ,./IJ 5; F.__. =, _. I J THE EFFECT OF C ODE EXPANDING OPTIMIZATIONS ON INSTRUCTION CACHE DESIGN William Y. Chen Pohua P. Chang Thomas M. Conte Wen-mei W. Hwu Coordinated Science Laboratory College of Engineering UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Approved for Public Release. Distribution Unlimited. UNCLASSIFIED ;ECURIrY CLASSIFICATION OF THiS PAGE REPORT DOCUMENTATION PAGE ii la. REPORT SECURITY CLASSIFICATION lb. RESTRICTIVE MARKINGS Unclassified None 2a. SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION / AVAILABILITY OF REPORT Approved for public release; 2b. DECLASSIFICATION I DOWNGRADING SCHEDULE distribution unlimited 4, PERFORMING ORGANIZATION REPORT NUMBER(S) S. MONITORING ORGANIZATION REPORT NUMBER(S) UILU-ENG-91-2227 CRHC-91-17 _.NAME OFPERFORMING ORGANIZATION 7a. NAME OF MONITORING ORGANIZATION Coordinated Science Lab (If ap_lic,_ble) NCR, NSF, AI_, NASA University of Illinois 6b. OFFICEN/A SYMBOL 6cAOORESS(O_,S_,andZ;PCode) 7b. AOORESS(O_,$tate, andZIPCodc) Dayton, OH 45420 ii01 W. Springfield Avenue Washington DC 20550 Langley VA 20200 Urbana, IL 61801 NAME OF FUNDING/SPONSORING 18b.OFFICE SYMBOL 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER NO0014-9 l-J- 1283 NASA NAG 1-613 ORGANIZATION 7a I (If aDplitable) 8c. ADDRESS (City, State, and ZlP Code) 10. SOURCE OF FUNDING NUMBERS WORK UNIT ELEMENT NO. NO. ACCESSION NO. 7b I 11. TITLE (Include Securf(y C/aclrification) The Effect of Code Expanding Optimizations on Instruction Cache Design I 12. PERSONAL AUTHOR(S) Chen, William, Pohua Chang, Thomas Conte and Wen-Mei Hwu 13a. TYPE OF REPORT i13b, TIME COVERED 114.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Graduate Computer Architecture Chapter
    Graduate Computer Architecture Chapter 4 – Explore Instruction Level Parallelism with Software Approaches 1 Basic Pipeline Scheduling Find sequences of unrelated instructions Compiler’s ability to schedule – Amount of ILP available in the program – Latencies of the functional units Latency assumptions for the examples – Standard MIPS integer pipeline – No structural hazards (fully pipelined or duplicated units – Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op FP ALU op 3 FP ALU op SD 2 LD FP ALU op 1 LD SD 0 2 Basic Scheduling for (i = 1000; i > 0; i=i-1) Sequential MIPS Assembly Code Loop: LD F0, 0(R1) x[i] = x[i] + s; ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Pipelined execution: Scheduled pipelined execution: Loop: LD F0, 0(R1) 1 Loop: LD F0, 0(R1) 1 stall 2 SUBI R1, R1, #8 2 ADDD F4, F0, F2 3 ADDD F4, F0, F2 3 stall 4 stall 4 stall 5 BNEZ R1, Loop 5 SD 0(R1), F4 6 SD 8(R1), F4 6 SUBI R1, R1, #8 7 stall 8 BNEZ R1, Loop 9 stall 10 3 Static Branch Prediction: Using Compiler Technology How to statically predict branches? –Examination of program behavior Always predict taken (on average, 67% are taken) –Mis-prediction rate varies large (9%—59%) Predict backward branches taken, forward branches un-taken (mis-prediction rate < 30% -- 40%) –Profile-based predictor: use profile information collected from earlier runs 4 Mis-prediction Rate for a Profile-based Predictor FP is better than integer 5 Prediction-taken VS.
    [Show full text]
  • The Effect of Code Expanding Optimizations on Instruction Cache Design
    May 1991 UILU-EN G-91-2227 CRHC-91-17 Center for Reliable and High-Performance Computing THE EFFECT OF CODE EXPANDING OPTIMIZATIONS ON INSTRUCTION CACHE DESIGN William Y. Chen Pohua P. Chang Thomas M. Conte Wen-mei W. Hwu Coordinated Science Laboratory College of Engineering UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Approved for Public Release. Distribution Unlimited. UNCLASSIFIED____________ SECURITY ¿LASSIPlOvriON OF t h is PAGE REPORT DOCUMENTATION PAGE 1a. REPORT SECURITY CLASSIFICATION 1b. RESTRICTIVE MARKINGS Unclassified None 2a. SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION/AVAILABILITY OF REPORT 2b. DECLASSIFICATION/DOWNGRADING SCHEDULE Approved for public release; distribution unlimited 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) UILU-ENG-91-2227 CRHC-91-17 6a. NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION Coordinated Science Lab (If applicable) NCR, NSF, AMD, NASA University of Illinois N/A 6c ADDRESS (G'ty, Staff, and ZIP Code) 7b. ADDRESS (Oty, Staff, and ZIP Code) 1101 W. Springfield Avenue Dayton, OH 45420 Urbana, IL 61801 Washington DC 20550 Langley VA 20200 8a. NAME OF FUNDING/SPONSORING 8b. OFFICE SYMBOL ORGANIZATION 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 7a (If applicable) N00014-91-J-1283 NASA NAG 1-613 8c. ADDRESS (City, State, and ZIP Cod*) 10. SOURCE OF FUNDING NUMBERS PROGRAM PROJECT t a sk WORK UNIT 7b ELEMENT NO. NO. NO. ACCESSION NO. The Effect of Code Expanding Optimizations on Instruction Cache Design 12. PERSONAL AUTHOR(S) Chen, William, Pohua Chang, Thomas Conte and Wen-Mei Hwu 13a. TYPE OF REPORT 13b. TIME COVERED 14. OATE OF REPORT (Year, Month, Day) Jl5.
    [Show full text]
  • LLVM Performance Optimization for Z Systems
    IBM Linux Technology Center LLVM performance optimization for z Systems Dr. Ulrich Weigand Senior Technical Staff Member GNU/Linux Compilers & Toolchain Date: Mar 27, 2017 © 2017 IBM Corporation IBM Linux Technology Center Agenda • LLVM on z Systems Performance History • Instruction-Set Architecture (ISA) Optimization • Processor Micro-Architecture Optimization • Outlook 2 © 2017 IBM Corporation IBM Linux Technology Center Performance History 3 © 2017 IBM Corporation LLVM on z Systems – performance history LLVM Performance vs. GCC pre-7.0 (Mar 2017) 4 -25% -20% -15% -10% -5% 0% IBM Linux Technology Center IBM Technology Linux 3.3 3.4 3.5.0 3.6.0 LLVM Version LLVM 3.7.0 © 2017 © IBM2017 Corporation 3.8.0 3.9.0 4.0.0 IBM Linux Technology Center 0% -5% Back-end changes: overview -10% -15% • LLVM 3.3 – Initial release -20% – No focus on performance; z10 only -25% • LLVM 3.4 – ISA exploitation 3.3 3.4 3.5.0 3.6.0 3.7.0 3.8.0 3.9.0 4.0.0 – Significantly improved z10 code generation; initial z196 & zEC12 support • LLVM 3.5 – Further ISA exploitation – More z196 & zEC12 instructions exploited • LLVM 3.6 – No performance-related changes • LLVM 3.7 – z13 vector ISA support • LLVM 3.8 – Only minor performance-related changes – Small improvements to floating-point code generation • LLVM 3.9 – Misc. code-gen changes / start of micro-arch tuning – Avoid false FPR dependencies, conditional sibcall/return, FP test data class • LLVM 4.0 – Focus on micro-architecture tuning – Post-RA scheduler, tune loop unrolling / strength reduction, tune load-on-condition
    [Show full text]
  • That Increase Instruction-Level Parallelism 1 Introduction
    Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism Jack L, LO and Susan J. Eggers Department of Computer Science and Engineering University of Washington { -j 10, eggers} @cs . washi-rlgton. edu Abstract and becomes uncertain: not only will the processor see both cache hits and misses, but each level in the memory hierarchy will also Traditional list schedulers order instructions based on an optimistic introduce a new set of Iatencies. Instead of handling all loads iden- estimate of the load latency imposed by the hardware and therefore tically, a non-blocking processor’s code scheduler could arrange cannot respond to variations in memory latency caused by cache instructions behind loads more intelligently to produce more effi- hits and misses on non-blocking architectures. In contrast, bal- cient code schedules. Unfortunately, a traditional instruction anced scheduling schedules instructions based on an estimate of scheduler fails to exploit thk opportunity because of its optimistic and architecture-based estimates; its resulting schedules may be the amount of instruction-level parallelism in the program, By scheduling independent instructions behind loads based on what therefore intolerant of variations in load latency. the program can provide, rather than what the implementation stip- Balanced scheduling [KE93] is an algorithm that can generate ulates in the best case (i.e., a cache hit), balanced scheduling can schedules that adapt more readily to the uncertainties in memory hide variations in memory latencies more effectively. latency. Rather than being determined by a predefine, architecttrr- ally-based value, load latency estimates are based on the number Since its success depends on the amount of instruction-level paral- of independent instructions that are available to hide the latency of lelism in the code, balanced scheduling should perform even better when more parallelism is available.
    [Show full text]
  • Code Scheduling
    Register Allocation (wrapup) & Code Scheduling CS2210 Lecture 22 CS2210 Compiler Design 2004/5 Constructing and Representing the Interference Graph ■ Construction alternatives: ■ as side effect of live variables analysis (when variables are units of allocation) ■ compute du-chains & webs (or ssa form), do live variables analysis, compute interference graph ■ Representation ■ adjacency matrix: A[min(i,j), max(i,j)] = true iff (symbolic) register i adjacent to j (interferes) CS2210 Compiler Design 2004/5 Adjacency List ■ Used in actual allocation ■ A[i] lists nodes adjacent to node i ■ Components ■ color ■ disp: displacement (in stack for spilling) ■ spcost: spill cost (initialized 0 for symbolic, infinity for real registers) ■ nints: number of interferences ■ adjnds: list of real and symbolic registers currently interfere with i ■ rmvadj: list of real and symbolic registers that interfered withi bu have been removed CS2210 Compiler Design 2004/5 1 Register Coalescing ■ Goal: avoid unnecessary register to register copies by coalescing register ■ ensure that values are in proper argument registers before procedure calls ■ remove unnecessary copies introduced by code generation from SSA form ■ enforce source / target register constraints of certain instructions (important for CISC) ■ Approach: ■ search for copies si := sj where si and sj do not interfere (may be real or symbolic register copies) CS2210 Compiler Design 2004/5 Computing Spill Costs ■ Have to spill values to memory when not enough registers can be found (can’t find k- coloring)
    [Show full text]