Chapter 10 Scalar Optimizations
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
CSE 582 – Compilers
CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2011 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-1 Agenda A sampler of typical optimizing transformations Mostly a teaser for later, particularly once we’ve looked at analyzing loops 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-2 Role of Transformations Data-flow analysis discovers opportunities for code improvement Compiler must rewrite the code (IR) to realize these improvements A transformation may reveal additional opportunities for further analysis & transformation May also block opportunities by obscuring information 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-3 Organizing Transformations in a Compiler Typically middle end consists of many individual transformations that filter the IR and produce rewritten IR No formal theory for order to apply them Some rules of thumb and best practices Some transformations can be profitably applied repeatedly, particularly if others transformations expose more opportunities 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-4 A Taxonomy Machine Independent Transformations Realized profitability may actually depend on machine architecture, but are typically implemented without considering this Machine Dependent Transformations Most of the machine dependent code is in instruction selection & scheduling and register allocation Some machine dependent code belongs in the optimizer 11/8/2011 © 2002-11 Hal Perkins & UW CSE S-5 Machine Independent Transformations Dead code elimination Code motion Specialization Strength reduction -
The LLVM Instruction Set and Compilation Strategy
The LLVM Instruction Set and Compilation Strategy Chris Lattner Vikram Adve University of Illinois at Urbana-Champaign lattner,vadve ¡ @cs.uiuc.edu Abstract This document introduces the LLVM compiler infrastructure and instruction set, a simple approach that enables sophisticated code transformations at link time, runtime, and in the field. It is a pragmatic approach to compilation, interfering with programmers and tools as little as possible, while still retaining extensive high-level information from source-level compilers for later stages of an application’s lifetime. We describe the LLVM instruction set, the design of the LLVM system, and some of its key components. 1 Introduction Modern programming languages and software practices aim to support more reliable, flexible, and powerful software applications, increase programmer productivity, and provide higher level semantic information to the compiler. Un- fortunately, traditional approaches to compilation either fail to extract sufficient performance from the program (by not using interprocedural analysis or profile information) or interfere with the build process substantially (by requiring build scripts to be modified for either profiling or interprocedural optimization). Furthermore, they do not support optimization either at runtime or after an application has been installed at an end-user’s site, when the most relevant information about actual usage patterns would be available. The LLVM Compilation Strategy is designed to enable effective multi-stage optimization (at compile-time, link-time, runtime, and offline) and more effective profile-driven optimization, and to do so without changes to the traditional build process or programmer intervention. LLVM (Low Level Virtual Machine) is a compilation strategy that uses a low-level virtual instruction set with rich type information as a common code representation for all phases of compilation. -
ICS803 Elective – III Multicore Architecture Teacher Name: Ms
ICS803 Elective – III Multicore Architecture Teacher Name: Ms. Raksha Pandey Course Structure L T P 3 1 0 4 Prerequisite: Course Content: Unit-I: Multi-core Architectures Introduction to multi-core architectures, issues involved into writing code for multi-core architectures, Virtual Memory, VM addressing, VA to PA translation, Page fault, TLB- Parallel computers, Instruction level parallelism (ILP) vs. thread level parallelism (TLP), Performance issues, OpenMP and other message passing libraries, threads, mutex etc. Unit-II: Multi-threaded Architectures Brief introduction to cache hierarchy - Caches: Addressing a Cache, Cache Hierarchy, States of Cache line, Inclusion policy, TLB access, Memory Op latency, MLP, Memory Wall, communication latency, Shared memory multiprocessors, General architectures and the problem of cache coherence, Synchronization primitives: Atomic primitives; locks: TTS, ticket, array; barriers: central and tree; performance implications in shared memory programs; Chip multiprocessors: Why CMP (Moore's law, wire delay); shared L2 vs. tiled CMP; core complexity; power/performance; Snoopy coherence: invalidate vs. update, MSI, MESI, MOESI, MOSI; performance trade-offs; pipelined snoopy bus design; Memory consistency models: SC, PC, TSO, PSO, WO/WC, RC; Chip multiprocessor case studies: Intel Montecito and dual-core, Pentium4, IBM Power4, Sun Niagara Unit-III: Compiler Optimization Issues Code optimizations: Copy Propagation, dead Code elimination , Loop Optimizations-Loop Unrolling, Induction variable Simplification, Loop Jamming, Loop Unswitching, Techniques to improve detection of parallelism: Scalar Processors, Special locality, Temporal locality, Vector machine, Strip mining, Shared memory model, SIMD architecture, Dopar loop, Dosingle loop. Unit-IV: Control Flow analysis Control flow analysis, Flow graph, Loops in Flow graphs, Loop Detection, Approaches to Control Flow Analysis, Reducible Flow Graphs, Node Splitting. -
Analysis of Program Optimization Possibilities and Further Development
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Theoretical Computer Science 90 (1991) 17-36 17 Elsevier Analysis of program optimization possibilities and further development I.V. Pottosin Institute of Informatics Systems, Siberian Division of the USSR Acad. Sci., 630090 Novosibirsk, USSR 1. Introduction Program optimization has appeared in the framework of program compilation and includes special techniques and methods used in compiler construction to obtain a rather efficient object code. These techniques and methods constituted in the past and constitute now an essential part of the so called optimizing compilers whose goal is to produce an object code in run time, saving such computer resources as CPU time and memory. For contemporary supercomputers, the requirement of the proper use of hardware peculiarities is added. In our opinion, the existence of a sufficiently large number of optimizing compilers with real possibilities of producing “good” object code has evidently proved to be practically significant for program optimization. The methods and approaches that have accumulated in program optimization research seem to be no lesser valuable, since they may be successfully used in the general techniques of program construc- tion, i.e. in program synthesis, program transformation and at different steps of program development. At the same time, there has been criticism on program optimization and even doubts about its necessity. On the one hand, this viewpoint is based on blind faith in the new possibilities bf computers which will be so powerful that any necessity to worry about program efficiency will disappear. -
Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗
Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗ David Leopoldseder Roland Schatz Lukas Stadler Johannes Kepler University Linz Oracle Labs Oracle Labs Austria Linz, Austria Linz, Austria [email protected] [email protected] [email protected] Manuel Rigger Thomas Würthinger Hanspeter Mössenböck Johannes Kepler University Linz Oracle Labs Johannes Kepler University Linz Austria Zurich, Switzerland Austria [email protected] [email protected] [email protected] ABSTRACT Non-Counted Loops to Enable Subsequent Compiler Optimizations. In 15th Java programs can contain non-counted loops, that is, loops for International Conference on Managed Languages & Runtimes (ManLang’18), September 12–14, 2018, Linz, Austria. ACM, New York, NY, USA, 13 pages. which the iteration count can neither be determined at compile https://doi.org/10.1145/3237009.3237013 time nor at run time. State-of-the-art compilers do not aggressively optimize them, since unrolling non-counted loops often involves 1 INTRODUCTION duplicating also a loop’s exit condition, which thus only improves run-time performance if subsequent compiler optimizations can Generating fast machine code for loops depends on a selective appli- optimize the unrolled code. cation of different loop optimizations on the main optimizable parts This paper presents an unrolling approach for non-counted loops of a loop: the loop’s exit condition(s), the back edges and the loop that uses simulation at run time to determine whether unrolling body. All these parts must be optimized to generate optimal code such loops enables subsequent compiler optimizations. Simulat- for a loop. -
Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A
Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A. Najjar University of California Riverside University of California Riverside University of California Riverside Computer Science Department Electrical Engineering Department Computer Science Department Riverside, CA 92521 Riverside, CA 92521 Riverside, CA 92521 [email protected] [email protected] [email protected] ABSTRACT such as constant propagation, constant folding, dead code ROCCC (Riverside Optimizing Configurable Computing eliminations that enables other optimizations. Then, loop Compiler) is an optimizing C to HDL compiler targeting level optimizations such as loop unrolling, loop invariant FPGA and CSOC (Configurable System On a Chip) code motion, loop fusion and other such transforms are architectures. ROCCC system is built on the SUIF- applied. MACHSUIF compiler infrastructure. Our system first This paper is organized as follows. Next section discusses identifies frequently executed kernel loops inside programs on the related work. Section 3 gives an in depth description and then compiles them to VHDL after optimizing the of the ROCCC system. SUIF2 level optimizations are kernels to make best use of FPGA resources. This paper described in Section 4 followed by the compilation done at presents an overview of the ROCCC project as well as the MACHSUIF level in Section 5. Section 6 mentions our optimizations performed inside the ROCCC compiler. early results. Finally Section 7 concludes the paper. 1. INTRODUCTION 2. RELATED WORK FPGAs (Field Programmable Gate Array) can boost There are several projects and publications on translating C software performance due to the large amount of or other high level languages to different HDLs. Streams- parallelism they offer. -
Register Allocation and Method Inlining
Combining Offline and Online Optimizations: Register Allocation and Method Inlining Hiroshi Yamauchi and Jan Vitek Department of Computer Sciences, Purdue University, {yamauchi,jv}@cs.purdue.edu Abstract. Fast dynamic compilers trade code quality for short com- pilation time in order to balance application performance and startup time. This paper investigates the interplay of two of the most effective optimizations, register allocation and method inlining for such compil- ers. We present a bytecode representation which supports offline global register allocation, is suitable for fast code generation and verification, and yet is backward compatible with standard Java bytecode. 1 Introduction Programming environments which support dynamic loading of platform-indep- endent code must provide supports for efficient execution and find a good balance between responsiveness (shorter delays due to compilation) and performance (optimized compilation). Thus, most commercial Java virtual machines (JVM) include several execution engines. Typically, there is an interpreter or a fast com- piler for initial executions of all code, and a profile-guided optimizing compiler for performance-critical code. Improving the quality of the code of a fast compiler has the following bene- fits. It raises the performance of short-running and medium length applications which exit before the expensive optimizing compiler fully kicks in. It also ben- efits long-running applications with improved startup performance and respon- siveness (due to less eager optimizing compilation). One way to achieve this is to shift some of the burden to an offline compiler. The main question is what optimizations are profitable when performed offline and are either guaranteed to be safe or can be easily validated by a fast compiler. -
Generalizing Loop-Invariant Code Motion in a Real-World Compiler
Imperial College London Department of Computing Generalizing loop-invariant code motion in a real-world compiler Author: Supervisor: Paul Colea Fabio Luporini Co-supervisor: Prof. Paul H. J. Kelly MEng Computing Individual Project June 2015 Abstract Motivated by the perpetual goal of automatically generating efficient code from high-level programming abstractions, compiler optimization has developed into an area of intense research. Apart from general-purpose transformations which are applicable to all or most programs, many highly domain-specific optimizations have also been developed. In this project, we extend such a domain-specific compiler optimization, initially described and implemented in the context of finite element analysis, to one that is suitable for arbitrary applications. Our optimization is a generalization of loop-invariant code motion, a technique which moves invariant statements out of program loops. The novelty of the transformation is due to its ability to avoid more redundant recomputation than normal code motion, at the cost of additional storage space. This project provides a theoretical description of the above technique which is fit for general programs, together with an implementation in LLVM, one of the most successful open-source compiler frameworks. We introduce a simple heuristic-driven profitability model which manages to successfully safeguard against potential performance regressions, at the cost of missing some speedup opportunities. We evaluate the functional correctness of our implementation using the comprehensive LLVM test suite, passing all of its 497 whole program tests. The results of our performance evaluation using the same set of tests reveal that generalized code motion is applicable to many programs, but that consistent performance gains depend on an accurate cost model. -
Efficient Symbolic Analysis for Optimizing Compilers*
Efficient Symbolic Analysis for Optimizing Compilers? Robert A. van Engelen Dept. of Computer Science, Florida State University, Tallahassee, FL 32306-4530 [email protected] Abstract. Because most of the execution time of a program is typically spend in loops, loop optimization is the main target of optimizing and re- structuring compilers. An accurate determination of induction variables and dependencies in loops is of paramount importance to many loop opti- mization and parallelization techniques, such as generalized loop strength reduction, loop parallelization by induction variable substitution, and loop-invariant expression elimination. In this paper we present a new method for induction variable recognition. Existing methods are either ad-hoc and not powerful enough to recognize some types of induction variables, or existing methods are powerful but not safe. The most pow- erful method known is the symbolic differencing method as demonstrated by the Parafrase-2 compiler on parallelizing the Perfect Benchmarks(R). However, symbolic differencing is inherently unsafe and a compiler that uses this method may produce incorrectly transformed programs without issuing a warning. In contrast, our method is safe, simpler to implement in a compiler, better adaptable for controlling loop transformations, and recognizes a larger class of induction variables. 1 Introduction It is well known that the optimization and parallelization of scientific applica- tions by restructuring compilers requires extensive analysis of induction vari- ables and dependencies -
The Effect of Code Expanding Optimizations on Instruction Cache Design
May 1991 UILU-EN G-91-2227 CRHC-91-17 Center for Reliable and High-Performance Computing THE EFFECT OF CODE EXPANDING OPTIMIZATIONS ON INSTRUCTION CACHE DESIGN William Y. Chen Pohua P. Chang Thomas M. Conte Wen-mei W. Hwu Coordinated Science Laboratory College of Engineering UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Approved for Public Release. Distribution Unlimited. UNCLASSIFIED____________ SECURITY ¿LASSIPlOvriON OF t h is PAGE REPORT DOCUMENTATION PAGE 1a. REPORT SECURITY CLASSIFICATION 1b. RESTRICTIVE MARKINGS Unclassified None 2a. SECURITY CLASSIFICATION AUTHORITY 3 DISTRIBUTION/AVAILABILITY OF REPORT 2b. DECLASSIFICATION/DOWNGRADING SCHEDULE Approved for public release; distribution unlimited 4. PERFORMING ORGANIZATION REPORT NUMBER(S) 5. MONITORING ORGANIZATION REPORT NUMBER(S) UILU-ENG-91-2227 CRHC-91-17 6a. NAME OF PERFORMING ORGANIZATION 6b. OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION Coordinated Science Lab (If applicable) NCR, NSF, AMD, NASA University of Illinois N/A 6c ADDRESS (G'ty, Staff, and ZIP Code) 7b. ADDRESS (Oty, Staff, and ZIP Code) 1101 W. Springfield Avenue Dayton, OH 45420 Urbana, IL 61801 Washington DC 20550 Langley VA 20200 8a. NAME OF FUNDING/SPONSORING 8b. OFFICE SYMBOL ORGANIZATION 9. PROCUREMENT INSTRUMENT IDENTIFICATION NUMBER 7a (If applicable) N00014-91-J-1283 NASA NAG 1-613 8c. ADDRESS (City, State, and ZIP Cod*) 10. SOURCE OF FUNDING NUMBERS PROGRAM PROJECT t a sk WORK UNIT 7b ELEMENT NO. NO. NO. ACCESSION NO. The Effect of Code Expanding Optimizations on Instruction Cache Design 12. PERSONAL AUTHOR(S) Chen, William, Pohua Chang, Thomas Conte and Wen-Mei Hwu 13a. TYPE OF REPORT 13b. TIME COVERED 14. OATE OF REPORT (Year, Month, Day) Jl5. -
Functional Array Programming in Sac
Functional Array Programming in SaC Sven-Bodo Scholz Dept of Computer Science, University of Hertfordshire, United Kingdom [email protected] Abstract. These notes present an introduction into array-based pro- gramming from a functional, i.e., side-effect-free perspective. The first part focuses on promoting arrays as predominant, stateless data structure. This leads to a programming style that favors compo- sitions of generic array operations that manipulate entire arrays over specifications that are made in an element-wise fashion. An algebraicly consistent set of such operations is defined and several examples are given demonstrating the expressiveness of the proposed set of operations. The second part shows how such a set of array operations can be defined within the first-order functional array language SaC.Itdoes not only discuss the language design issues involved but it also tackles implementation issues that are crucial for achieving acceptable runtimes from such genericly specified array operations. 1 Introduction Traditionally, binary lists and algebraic data types are the predominant data structures in functional programming languages. They fit nicely into the frame- work of recursive program specifications, lazy evaluation and demand driven garbage collection. Support for arrays in most languages is confined to a very resricted set of basic functionality similar to that found in imperative languages. Even if some languages do support more advanced specificational means such as array comprehensions, these usualy do not provide the same genericity as can be found in array programming languages such as Apl,J,orNial. Besides these specificational issues, typically, there is also a performance issue. -
Loop Transformations and Parallelization
Loop transformations and parallelization Claude Tadonki LAL/CNRS/IN2P3 University of Paris-Sud [email protected] December 2010 Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Introduction Most of the time, the most time consuming part of a program is on loops. Thus, loops optimization is critical in high performance computing. Depending on the target architecture, the goal of loops transformations are: improve data reuse and data locality efficient use of memory hierarchy reducing overheads associated with executing loops instructions pipeline maximize parallelism Loop transformations can be performed at different levels by the programmer, the compiler, or specialized tools. At high level, some well known transformations are commonly considered: loop interchange loop (node) splitting loop unswitching loop reversal loop fusion loop inversion loop skewing loop fission loop vectorization loop blocking loop unrolling loop parallelization Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Dependence analysis Extract and analyze the dependencies of a computation from its polyhedral model is a fundamental step toward loop optimization or scheduling. Definition For a given variable V and given indexes I1, I2, if the computation of X(I1) requires the value of X(I2), then I1 ¡ I2 is called a dependence vector for variable V . Drawing all the dependence vectors within the computation polytope yields the so-called dependencies diagram. Example The dependence vectors are (1; 0); (0; 1); (¡1; 1). Claude Tadonki Loop transformations and parallelization C. Tadonki – Loop transformations Scheduling Definition The computation on the entire domain of a given loop can be performed following any valid schedule.A timing function tV for variable V yields a valid schedule if and only if t(x) > t(x ¡ d); 8d 2 DV ; (1) where DV is the set of all dependence vectors for variable V .