Improving Tiling, Reducing Compilation Time, and Extending the Scope of Polyhedral Compilation Mohamed Riyadh Baghdadi

Total Page:16

File Type:pdf, Size:1020Kb

Improving Tiling, Reducing Compilation Time, and Extending the Scope of Polyhedral Compilation Mohamed Riyadh Baghdadi Improving tiling, reducing compilation time, and extending the scope of polyhedral compilation Mohamed Riyadh Baghdadi To cite this version: Mohamed Riyadh Baghdadi. Improving tiling, reducing compilation time, and extending the scope of polyhedral compilation. Programming Languages [cs.PL]. Université Pierre et Marie Curie - Paris VI, 2015. English. NNT : 2015PA066368. tel-01270558 HAL Id: tel-01270558 https://tel.archives-ouvertes.fr/tel-01270558 Submitted on 8 Feb 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. THESE DE DOCTORAT DE L’UNIVERSITE PIERRE ET MARIE CURIE École Doctorale Informatique, Télécommunications et Électronique de paris Improving Tiling, Reducing Compilation Time, and Extending the Scope of Polyhedral Compilation Présentée par Riyadh BAGHDADI Codirigée par Albert COHEN et Sven VERDOOLAEGE soutenue le 25/09/2015 devant le jury composé de : Rapporteurs M. Cédric Bastoul Professor, University of Strasbourg M. Paul H J Kelly Professor, Imperial College London Examinateurs M. Stef Graillat Professor, University Pierre et Marie Curie (Paris 6) M. Cédric Bastoul Professor, University of Strasbourg M. Paul H J Kelly Professor, Imperial College London M. J. Ramanujam Professor, Louisiana State University M. Albert Cohen Senior Research Scientist, INRIA, France M. Sven Verdoolaege Associate Researcher, Polly Labs and KU Leuven Contents List of Figures vii List of Tables ix Nomenclature xi 1 Introduction and Background 1 1.1 Introduction................................... 1 1.2 BackgroundandNotations . 5 1.2.1 MapsandSets ............................. 5 1.2.2 AdditionalDefinitions . 7 1.2.3 PolyhedralRepresentationofPrograms . ... 8 1.2.4 DataLocality.............................. 13 1.2.5 LoopTransformations . 14 1.3 Outline ..................................... 15 2 Tiling Code With Memory-Based Dependences 19 2.1 Introductionandrelatedwork. ... 19 2.2 SourcesofFalseDependences . 21 2.3 MotivatingExample .............................. 23 2.4 LiveRangeNon-Interference . .. 24 2.4.1 Liveranges ............................... 24 2.4.2 Constructionofliveranges . 25 2.4.3 Liverangenon-interference . 26 2.4.4 Live range non-interference and tiling: a relaxed permutability criterion 27 2.4.5 Illustrativeexamples . 29 2.4.6 Parallelism ............................... 31 2.5 Effectof3AContheTilabilityofaLoop. .... 32 2.6 ExperimentalEvaluation . 34 iv Contents 2.7 ExperimentalResults . .. .. .. .. .. .. .. .. 36 2.7.1 Evaluating the Effect of Ignoring False Dependences on Tiling . 38 2.7.2 Performance evaluation for Polybench-3AC . .... 43 2.7.3 Reasonsforslowdowns. 45 2.7.4 Performance evaluation for Polybench-PRE . .... 45 2.8 ConclusionandFutureWork . 46 3 Scheduling Large Programs 49 3.1 Introduction................................... 49 3.2 Definitions.................................... 51 3.3 ExampleofStatementClustering. ... 51 3.4 ClusteringAlgorithm . .. .. .. .. .. .. .. .. 53 3.5 Using Offline Clustering with the Pluto Affine Scheduling Algorithm. 54 3.5.1 Correctness of Loop Transformations after Clustering ......... 56 3.6 ClusteringHeuristics . 56 3.6.1 ClusteringDecisionValidity . 56 3.6.2 SCCClustering............................. 57 3.6.3 Basic-blockClustering . 59 3.7 Experiments................................... 60 3.8 RelatedWork .................................. 65 3.9 ConclusionandFutureWork . 67 4 The PENCIL Language 69 4.1 Introduction................................... 69 4.2 PENCIL Language................................ 70 4.2.1 Overviewof PENCIL .......................... 70 4.2.2 PENCIL DefinitionasaSubsetofC99 . 73 4.2.3 PENCIL ExtensionstoC99. .. .. .. .. .. .. 77 4.3 DetailedDescription .............................. 77 4.3.1 , and ..................... 77 4.3.2 DescriptionoftheMemoryAccessesofFunctions . .... 78 4.3.3 FunctionAttribute. 83 4.3.4 PragmaDirectives ........................... 83 4.3.5 BuiltinFunctions .. .. .. .. .. .. .. .. 88 4.4 Examplesof PENCIL and non-PENCIL Code.................. 93 4.4.1 BFSExample.............................. 93 4.4.2 ImageResizingExample. 95 Contents v 4.4.3 Recursive Data Structures and Recursive Function Calls . 96 4.5 PolyhedralCompilationofPENCILcode . ... 97 4.6 RelatedWork .................................. 100 5 Evaluating PENCIL 103 5.1 Introduction................................... 103 5.2 Benchmarks................................... 105 5.2.1 ImageProcessingBenchmarkSuite . 105 5.2.2 RodiniaandSHOCBenchmarkSuites . 109 5.2.3 VOBLADSLforLinearAlgebra . 110 5.2.4 SpearDEDSLforData-StreamingApplications . .... 114 5.3 DiscussionoftheResults . 114 5.4 ConclusionandFutureWork . 115 6 Conclusions and Perspectives 117 6.1 Contributions .................................. 117 6.2 FutureDirections ................................ 118 Bibliography 121 A EBNF Grammar for PENCIL 131 B Personal Publications 135 Index 136 List of Figures 1.1 Graphicalrepresentationofaset . 6 1.2 Graphicalrepresentationofamap . ... 7 1.3 Exampleofakernel............................... 9 1.4 Iteration domain of ˽ .............................. 9 2.1 Acodewithfalsedependencesthatpreventtiling . ........ 20 2.2 Gemm kernel .................................. 22 2.3 Gemm kernelafterapplyingPRE. 22 2.4 Threenestedloopswithloop-invariantcode . ....... 22 2.5 Three nested loops after loop-invariantcode motion . .......... 22 2.6 Tiledversion of the gemm kernel......................... 24 2.7 Oneloopfromthe mvt kernel.......................... 25 2.8 Examples where {S1 → S2} and {S3 → S4} donotinterfere . 26 2.9 Examples where {S1 → S2} and {S3 → S4} interfere ............. 26 2.10 Anexampleillustratingtheadjacencyconcept . ......... 27 ℄ 2.11 Live ranges for Ø and in mvt ....................... 30 2.12 Examplewheretilingisnotpossible . ..... 31 2.13 Live ranges for the example in Figure 2.12 ................... 31 2.14 Comparing the number of private copies of scalars and arrays created to enable parallelization,tilingorboth . 32 2.15 Original gesummv kernel............................. 35 2.16 gesummv in3ACform. ............................. 35 2.17 Original gemm kernel. ............................. 35 2.18 gemm afterapplyingPRE. ........................... 35 2.19 2mm-3AC (2mm in3ACform).......................... 36 2.20 Expanded version of 2mm-3AC. ........................ 36 2.21 gesummv kernelafterapplyingPRE. 36 2.22 Theeffectof3ACtransformationontiling . ....... 38 viii List of Figures 2.23 Speedup against non-tiled sequential code for Polybench—3AC ....... 39 2.24 2mm-3AC optimized while false dependences are ignored. .......... 40 2.25 Optimizingtheexpandedversionof2mm-3AC. ...... 41 2.26 Speedup against non-tiled sequential code for Polybench—PRE ....... 42 2.27 Innermost parallelism in the expanded version of trmm............. 46 3.1 Illustrativeexampleforstatementclustering . .......... 51 3.2 Originalprogramdependencegraph . ... 52 3.3 Dependencegraphafterclustering . .... 52 3.4 Flowdependencegraph. .. .. .. .. .. .. .. .. 57 3.5 Illustrativeexamplefor basic-blockclustering . ........... 59 3.6 Speedup in scheduling time obtained after applying statementclustering . 63 3.7 Execution time when clustering is enabled normalized to the execution time whenclusteringisnotenabled(lowermeansfaster) . 65 4.1 High level overview of our vision for the use of PENCIL ........... 72 4.2 General2Dconvolution. 91 4.3 Code extracted from Dilate (imageprocessing) . 99 4.4 Example of code containing a Ö statement................. 99 A.1 PENCIL syntaxasanEBNF. .......................... 131 A.1 PENCIL syntaxasanEBNFcontinuedoverleaf. 132 A.1 PENCIL syntaxasanEBNFcontinuedoverleaf. 133 List of Tables 2.1 Atableshowingwhich3ACkernelsweretiled. 37 2.2 AtableshowingwhichPREkernelsweretiled. ..... 37 3.1 The factor of reduction in the number of statements and dependences after applyingstatementclustering. 62 5.1 Statistics about patterns of non static-affine code in the image processing benchmark.................................... 106 5.2 Effect of enabling support for individual PENCIL features on the ability to generatecodeandongainsinspeedups . 106 5.3 Speedups of the OpenCL code generated by ÈÈ overOpenCV . 107 5.4 SelectedbenchmarksfromRodiniaandSHOC. .... 109 5.5 PENCIL Features that are useful for SHOC and Rodinia benchmarks . 109 5.6 Speedups for the OpenCL code generated by ÈÈ for the selected Rodinia andSHOCbenchmarks............................. 110 5.7 Gains in performance obtained when support for the ××ÙÑ builtin is enabled inVOBLA.................................... 113 5.8 Speedups obtained with ÈÈ overhighlyoptimizedBLAS libraries . 113 Nomenclature Roman Symbols 3AC Three-Address Code ABF Adaptive Beamformer APIs Application Program Interface AST Abstract Syntax Tree BFS Breadth-First Search C11 C Standard 2011 C99 C Standard 1999 CUDA Compute Unified Device Architecture DSL Domain-Specific Language EBNF Extended Backus-Naur Form GCC GNUCompilerCollection GFLOPS Giga FLoating-point Operations Per Second GPU Graphics Processing Unit HMPP Hybrid Multicore Parallel Programming
Recommended publications
  • Redundancy Elimination Common Subexpression Elimination
    Redundancy Elimination Aim: Eliminate redundant operations in dynamic execution Why occur? Loop-invariant code: Ex: constant assignment in loop Same expression computed Ex: addressing Value numbering is an example Requires dataflow analysis Other optimizations: Constant subexpression elimination Loop-invariant code motion Partial redundancy elimination Common Subexpression Elimination Replace recomputation of expression by use of temp which holds value Ex. (s1) y := a + b Ex. (s1) temp := a + b (s1') y := temp (s2) z := a + b (s2) z := temp Illegal? How different from value numbering? Ex. (s1) read(i) (s2) j := i + 1 (s3) k := i i + 1, k+1 (s4) l := k + 1 no cse, same value number Why need temp? Local and Global ¡ Local CSE (BB) Ex. (s1) c := a + b (s1) t1 := a + b (s2) d := m&n (s1') c := t1 (s3) e := a + b (s2) d := m&n (s4) m := 5 (s5) if( m&n) ... (s3) e := t1 (s4) m := 5 (s5) if( m&n) ... 5 instr, 4 ops, 7 vars 6 instr, 3 ops, 8 vars Always better? Method: keep track of expressions computed in block whose operands have not changed value CSE Hash Table (+, a, b) (&,m, n) Global CSE example i := j i := j a := 4*i t := 4*i i := i + 1 i := i + 1 b := 4*i t := 4*i b := t c := 4*i c := t Assumes b is used later ¡ Global CSE An expression e is available at entry to B if on every path p from Entry to B, there is an evaluation of e at B' on p whose values are not redefined between B' and B.
    [Show full text]
  • Polyhedral Compilation As a Design Pattern for Compiler Construction
    Polyhedral Compilation as a Design Pattern for Compiler Construction PLISS, May 19-24, 2019 [email protected] Polyhedra? Example: Tiles Polyhedra? Example: Tiles How many of you read “Design Pattern”? → Tiles Everywhere 1. Hardware Example: Google Cloud TPU Architectural Scalability With Tiling Tiles Everywhere 1. Hardware Google Edge TPU Edge computing zoo Tiles Everywhere 1. Hardware 2. Data Layout Example: XLA compiler, Tiled data layout Repeated/Hierarchical Tiling e.g., BF16 (bfloat16) on Cloud TPU (should be 8x128 then 2x1) Tiles Everywhere Tiling in Halide 1. Hardware 2. Data Layout Tiled schedule: strip-mine (a.k.a. split) 3. Control Flow permute (a.k.a. reorder) 4. Data Flow 5. Data Parallelism Vectorized schedule: strip-mine Example: Halide for image processing pipelines vectorize inner loop https://halide-lang.org Meta-programming API and domain-specific language (DSL) for loop transformations, numerical computing kernels Non-divisible bounds/extent: strip-mine shift left/up redundant computation (also forward substitute/inline operand) Tiles Everywhere TVM example: scan cell (RNN) m = tvm.var("m") n = tvm.var("n") 1. Hardware X = tvm.placeholder((m,n), name="X") s_state = tvm.placeholder((m,n)) 2. Data Layout s_init = tvm.compute((1,n), lambda _,i: X[0,i]) s_do = tvm.compute((m,n), lambda t,i: s_state[t-1,i] + X[t,i]) 3. Control Flow s_scan = tvm.scan(s_init, s_do, s_state, inputs=[X]) s = tvm.create_schedule(s_scan.op) 4. Data Flow // Schedule to run the scan cell on a CUDA device block_x = tvm.thread_axis("blockIdx.x") 5. Data Parallelism thread_x = tvm.thread_axis("threadIdx.x") xo,xi = s[s_init].split(s_init.op.axis[1], factor=num_thread) s[s_init].bind(xo, block_x) Example: Halide for image processing pipelines s[s_init].bind(xi, thread_x) xo,xi = s[s_do].split(s_do.op.axis[1], factor=num_thread) https://halide-lang.org s[s_do].bind(xo, block_x) s[s_do].bind(xi, thread_x) print(tvm.lower(s, [X, s_scan], simple_mode=True)) And also TVM for neural networks https://tvm.ai Tiling and Beyond 1.
    [Show full text]
  • The LLVM Instruction Set and Compilation Strategy
    The LLVM Instruction Set and Compilation Strategy Chris Lattner Vikram Adve University of Illinois at Urbana-Champaign lattner,vadve ¡ @cs.uiuc.edu Abstract This document introduces the LLVM compiler infrastructure and instruction set, a simple approach that enables sophisticated code transformations at link time, runtime, and in the field. It is a pragmatic approach to compilation, interfering with programmers and tools as little as possible, while still retaining extensive high-level information from source-level compilers for later stages of an application’s lifetime. We describe the LLVM instruction set, the design of the LLVM system, and some of its key components. 1 Introduction Modern programming languages and software practices aim to support more reliable, flexible, and powerful software applications, increase programmer productivity, and provide higher level semantic information to the compiler. Un- fortunately, traditional approaches to compilation either fail to extract sufficient performance from the program (by not using interprocedural analysis or profile information) or interfere with the build process substantially (by requiring build scripts to be modified for either profiling or interprocedural optimization). Furthermore, they do not support optimization either at runtime or after an application has been installed at an end-user’s site, when the most relevant information about actual usage patterns would be available. The LLVM Compilation Strategy is designed to enable effective multi-stage optimization (at compile-time, link-time, runtime, and offline) and more effective profile-driven optimization, and to do so without changes to the traditional build process or programmer intervention. LLVM (Low Level Virtual Machine) is a compilation strategy that uses a low-level virtual instruction set with rich type information as a common code representation for all phases of compilation.
    [Show full text]
  • Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?
    Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ? Chung-Ju Wu, Chia-Han Lu, and Jenq-Kuen Lee Department of Computer Science, National Tsing-Hua University, Hsinchu 30013, Taiwan {jasonwu,chlu}@pllab.cs.nthu.edu.tw,[email protected] Abstract. Spill code is the overhead of memory load/store behavior if the available registers are not sufficient to map live ranges during the process of register allocation. Previously, works have been proposed to reduce spill code for the unified register file. For reducing power and cost in design of VLIW DSP processors, distributed register files and multi- bank register architectures are being adopted to eliminate the amount of read/write ports between functional units and registers. This presents new challenges for devising compiler optimization schemes for such ar- chitectures. This paper aims at addressing the issues of reducing spill code via rematerialization for a VLIW DSP processor with distributed register files. Rematerialization is a strategy for register allocator to de- termine if it is cheaper to recompute the value than to use memory load/store. In the paper, we propose a solution to exploit the character- istics of distributed register files where there is the chance to balance or split live ranges. By heuristically estimating register pressure for each register file, we are going to treat them as optional spilled locations rather than spilling to memory. The choice of spilled location might pre- serve an expression result and keep the value alive in different register file. It increases the possibility to do expression rematerialization which is effectively able to reduce spill code.
    [Show full text]
  • Analysis of Program Optimization Possibilities and Further Development
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Elsevier - Publisher Connector Theoretical Computer Science 90 (1991) 17-36 17 Elsevier Analysis of program optimization possibilities and further development I.V. Pottosin Institute of Informatics Systems, Siberian Division of the USSR Acad. Sci., 630090 Novosibirsk, USSR 1. Introduction Program optimization has appeared in the framework of program compilation and includes special techniques and methods used in compiler construction to obtain a rather efficient object code. These techniques and methods constituted in the past and constitute now an essential part of the so called optimizing compilers whose goal is to produce an object code in run time, saving such computer resources as CPU time and memory. For contemporary supercomputers, the requirement of the proper use of hardware peculiarities is added. In our opinion, the existence of a sufficiently large number of optimizing compilers with real possibilities of producing “good” object code has evidently proved to be practically significant for program optimization. The methods and approaches that have accumulated in program optimization research seem to be no lesser valuable, since they may be successfully used in the general techniques of program construc- tion, i.e. in program synthesis, program transformation and at different steps of program development. At the same time, there has been criticism on program optimization and even doubts about its necessity. On the one hand, this viewpoint is based on blind faith in the new possibilities bf computers which will be so powerful that any necessity to worry about program efficiency will disappear.
    [Show full text]
  • User-Directed Loop-Transformations in Clang
    User-Directed Loop-Transformations in Clang Michael Kruse Hal Finkel Argonne Leadership Computing Facility Argonne Leadership Computing Facility Argonne National Laboratory Argonne National Laboratory Argonne, USA Argonne, USA [email protected] hfi[email protected] Abstract—Directives for the compiler such as pragmas can Only #pragma unroll has broad support. #pragma ivdep made help programmers to separate an algorithm’s semantics from popular by icc and Cray to help vectorization is mimicked by its optimization. This keeps the code understandable and easier other compilers as well, but with different interpretations of to optimize for different platforms. Simple transformations such as loop unrolling are already implemented in most mainstream its meaning. However, no compiler allows applying multiple compilers. We recently submitted a proposal to add generalized transformations on a single loop systematically. loop transformations to the OpenMP standard. We are also In addition to straightforward trial-and-error execution time working on an implementation in LLVM/Clang/Polly to show its optimization, code transformation pragmas can be useful for feasibility and usefulness. The current prototype allows applying machine-learning assisted autotuning. The OpenMP approach patterns common to matrix-matrix multiplication optimizations. is to make the programmer responsible for the semantic Index Terms—OpenMP, Pragma, Loop Transformation, correctness of the transformation. This unfortunately makes it C/C++, Clang, LLVM, Polly hard for an autotuner which only measures the timing difference without understanding the code. Such an autotuner would I. MOTIVATION therefore likely suggest transformations that make the program Almost all processor time is spent in some kind of loop, and return wrong results or crash.
    [Show full text]
  • Register Allocation and Method Inlining
    Combining Offline and Online Optimizations: Register Allocation and Method Inlining Hiroshi Yamauchi and Jan Vitek Department of Computer Sciences, Purdue University, {yamauchi,jv}@cs.purdue.edu Abstract. Fast dynamic compilers trade code quality for short com- pilation time in order to balance application performance and startup time. This paper investigates the interplay of two of the most effective optimizations, register allocation and method inlining for such compil- ers. We present a bytecode representation which supports offline global register allocation, is suitable for fast code generation and verification, and yet is backward compatible with standard Java bytecode. 1 Introduction Programming environments which support dynamic loading of platform-indep- endent code must provide supports for efficient execution and find a good balance between responsiveness (shorter delays due to compilation) and performance (optimized compilation). Thus, most commercial Java virtual machines (JVM) include several execution engines. Typically, there is an interpreter or a fast com- piler for initial executions of all code, and a profile-guided optimizing compiler for performance-critical code. Improving the quality of the code of a fast compiler has the following bene- fits. It raises the performance of short-running and medium length applications which exit before the expensive optimizing compiler fully kicks in. It also ben- efits long-running applications with improved startup performance and respon- siveness (due to less eager optimizing compilation). One way to achieve this is to shift some of the burden to an offline compiler. The main question is what optimizations are profitable when performed offline and are either guaranteed to be safe or can be easily validated by a fast compiler.
    [Show full text]
  • Elimination of Memory-Based Dependences For
    Elimination of Memory-Based Dependences for Loop-Nest Optimization and Parallelization: Evaluation of a Revised Violated Dependence Analysis Method on a Three-Address Code Polyhedral Compiler Konrad Trifunovic1, Albert Cohen1, Razya Ladelsky2, and Feng Li1 1 INRIA Saclay { ^Ile-de-France and LRI, Paris-Sud 11 University, Orsay, France falbert.cohen, konrad.trifunovic, [email protected] 2 IBM Haifa Research, Haifa, Israel [email protected] Abstract In order to preserve the legality of loop nest transformations and parallelization, data- dependences need to be analyzed. Memory dependences come in two varieties: they are either data-flow dependences or memory-based dependences. While data-flow de- pendences must be satisfied in order to preserve the correct order of computations, memory-based dependences are induced by the reuse of a single memory location to store multiple values. Memory-based dependences reduce the degrees of freedom for loop transformations and parallelization. While systematic array expansion techniques exist to remove all memory-based dependences, the overhead on memory footprint and the detrimental impact on register-level reuse can be catastrophic. Much care is needed when eliminating memory-based dependences, and this is particularly essential for polyhedral compilation on three-address code representation like the GRAPHITE pass of GCC. We propose and evaluate a technique allowing a compiler to ignore some memory- based dependences while checking for the legality of a given affine transformation. This technique does not involve any array expansion. When this technique is not sufficient to expose data parallelism, it can be used to compute the minimal set of scalars and arrays that should be privatized.
    [Show full text]
  • Generalizing Loop-Invariant Code Motion in a Real-World Compiler
    Imperial College London Department of Computing Generalizing loop-invariant code motion in a real-world compiler Author: Supervisor: Paul Colea Fabio Luporini Co-supervisor: Prof. Paul H. J. Kelly MEng Computing Individual Project June 2015 Abstract Motivated by the perpetual goal of automatically generating efficient code from high-level programming abstractions, compiler optimization has developed into an area of intense research. Apart from general-purpose transformations which are applicable to all or most programs, many highly domain-specific optimizations have also been developed. In this project, we extend such a domain-specific compiler optimization, initially described and implemented in the context of finite element analysis, to one that is suitable for arbitrary applications. Our optimization is a generalization of loop-invariant code motion, a technique which moves invariant statements out of program loops. The novelty of the transformation is due to its ability to avoid more redundant recomputation than normal code motion, at the cost of additional storage space. This project provides a theoretical description of the above technique which is fit for general programs, together with an implementation in LLVM, one of the most successful open-source compiler frameworks. We introduce a simple heuristic-driven profitability model which manages to successfully safeguard against potential performance regressions, at the cost of missing some speedup opportunities. We evaluate the functional correctness of our implementation using the comprehensive LLVM test suite, passing all of its 497 whole program tests. The results of our performance evaluation using the same set of tests reveal that generalized code motion is applicable to many programs, but that consistent performance gains depend on an accurate cost model.
    [Show full text]
  • A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel
    A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel To cite this version: Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel. A Tiling Perspective for Register Op- timization. [Research Report] RR-8541, Inria. 2014, pp.24. hal-00998915 HAL Id: hal-00998915 https://hal.inria.fr/hal-00998915 Submitted on 3 Jun 2014 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A Tiling Perspective for Register Optimization Łukasz Domagała, Fabrice Rastello, Sadayappan Ponnuswany, Duco van Amstel RESEARCH REPORT N° 8541 May 2014 Project-Teams GCG ISSN 0249-6399 ISRN INRIA/RR--8541--FR+ENG A Tiling Perspective for Register Optimization Lukasz Domaga la∗, Fabrice Rastello†, Sadayappan Ponnuswany‡, Duco van Amstel§ Project-Teams GCG Research Report n° 8541 — May 2014 — 21 pages Abstract: Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering.
    [Show full text]
  • Efficient Symbolic Analysis for Optimizing Compilers*
    Efficient Symbolic Analysis for Optimizing Compilers? Robert A. van Engelen Dept. of Computer Science, Florida State University, Tallahassee, FL 32306-4530 [email protected] Abstract. Because most of the execution time of a program is typically spend in loops, loop optimization is the main target of optimizing and re- structuring compilers. An accurate determination of induction variables and dependencies in loops is of paramount importance to many loop opti- mization and parallelization techniques, such as generalized loop strength reduction, loop parallelization by induction variable substitution, and loop-invariant expression elimination. In this paper we present a new method for induction variable recognition. Existing methods are either ad-hoc and not powerful enough to recognize some types of induction variables, or existing methods are powerful but not safe. The most pow- erful method known is the symbolic differencing method as demonstrated by the Parafrase-2 compiler on parallelizing the Perfect Benchmarks(R). However, symbolic differencing is inherently unsafe and a compiler that uses this method may produce incorrectly transformed programs without issuing a warning. In contrast, our method is safe, simpler to implement in a compiler, better adaptable for controlling loop transformations, and recognizes a larger class of induction variables. 1 Introduction It is well known that the optimization and parallelization of scientific applica- tions by restructuring compilers requires extensive analysis of induction vari- ables and dependencies
    [Show full text]
  • A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen
    A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds Jie Zhao, Albert Cohen To cite this version: Jie Zhao, Albert Cohen. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds. IMPACT 2017 - 7th International Workshop on Polyhedral Compilation Techniques, Jan 2017, Stockholm, Sweden. pp.1-10. hal-01657608 HAL Id: hal-01657608 https://hal.inria.fr/hal-01657608 Submitted on 7 Dec 2017 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. A general compilation algorithm to parallelize and optimize counted loops with dynamic data-dependent bounds Jie Zhao Albert Cohen INRIA & DI, École Normale Supérieure 45 rue d’Ulm, 75005 Paris fi[email protected] ABSTRACT iterating until a dynamically computed, data-dependent up- We study the parallelizing compilation and loop nest opti- per bound. Such bounds are loop invariants, but often re- mization of an important class of programs where counted computed in the immediate vicinity of the loop they con- loops have a dynamically computed, data-dependent upper trol; for example, their definition may take place in the im- bound. Such loops are amenable to a wider set of trans- mediately enclosing loop.
    [Show full text]