Delicm: Scalar Dependence Removal at Zero Memory Cost

DeLICM: Scalar Dependence Removal at Zero Memory Cost Michael Kruse Tobias Grosser Inria ETH Zürich Paris, France Zürich, Switzerland [email protected] [email protected] Abstract 1 Introduction Increasing data movement costs motivate the integration Advanced high-level loop optimizations, often using poly- of polyhedral loop optimizers in the standard flow (-O3) hedral program models [31], have been shown to effectively of production compilers. While polyhedral optimizers have improve data locality for image processing [22], iterated been shown to be effective when applied as source-to-source stencils [4], Lattice-Boltzmann computations [23], sparse- transformation, the single static assignment form used in matrices [30] and even provide some of the leading automatic modern compiler mid-ends makes such optimizers less effec- GPU code generation approaches [5, 32]. tive. Scalar dependencies (dependencies carried over a single While polyhedral optimizers have been shown to be ef- memory location) are the main obstacle preventing effective fective in research, but are rarely used in production, at optimization. We present DeLICM, a set of transformations least partly because they are disabled in the most common which, backed by a polyhedral value analysis, eliminate prob- optimization levels. because none of the established frame- lematic scalar dependences by 1) relocating scalar memory works is enabled by default in a standard optimization set- references to unused array locations and by 2) forwarding ting (e.g. -O3). With GCC/graphite [26], IBM XL-C [8], and computations that otherwise cause scalar dependences. Our LLVM/Polly [15] there are three polyhedral optimizers that experiments show that DeLICM effectively eliminates de- work directly in production C/C++/FORTRAN compilers and pendencies introduced by compiler-internal canonicalization are shipped to millions of users. While all provide good per- passes, human programmers, optimizing code generators, formance on various benchmarks, one of the last roadblocks or inlining – without the need for any additional memory that prevents enabling polyhedral optimizers by default is allocation. As a result, polyhedral loop optimizations can that they generally run as pre-processing pass before the be better integrated into compiler pass pipelines which is standard -O3 pass pipeline. essential for metaprogramming optimization. Most compilers require canonicalization passes for their detection of single-entry-single-exit regions, loops and CCS Concepts • Software and its engineering → Com- iteration count to work properly. For instance, natural loops pilers; need recognizable headers, pre-headers, backedges and Keywords Polyhedral Framework, Scalar Dependence, latches. Since a polyhedral optimizer also makes use of LLVM, Polly these structures, running the compiler’s canonicalization passes is necessary before the polyhedral analyzer. This ACM Reference Format: introduces changes in the intermediate representation that Michael Kruse and Tobias Grosser. 2018. DeLICM: Scalar appear as runtime performance noise when enabling Dependence Removal at Zero Memory Cost. In Proceedings of 2018 IEEE/ACM International Symposium on Code Generation and polyhedral optimization. At the same time, applying loop Optimization (CGO’18). ACM, New York, NY, USA, 13 pages. optimizations – which often increase code size – early https://doi.org/10.1145/3168815 means that no inlining has yet been performed and consequently only smaller loop regions are exposed or Permission to make digital or hard copies of all or part of this work for future inlining opportunities are prevented. personal or classroom use is granted without fee provided that copies Therefore, the better position for a polyhedral optimizer is are not made or distributed for profit or commercial advantage and that later in the pass pipeline, after all inlining and canonicaliza- copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must tion has taken place. At this position the IR can be analyzed be honored. Abstracting with credit is permitted. To copy otherwise, or without having to modify it and heuristic inlining decisions republish, to post on servers or to redistribute to lists, requires prior specific are not influenced. However, when running late in the pass permission and/or a fee. Request permissions from [email protected]. pipeline, various earlier passes introduce IR changes that CGO’18, February 24–28, 2018, Vienna, Austria polyhedral optimizers do not handle well. © 2018 Copyright held by the owner/author(s). Publication rights licensed For instance, Listing 1 works entirely on arrays and is to the Association for Computing Machinery. ACM ISBN 978-1-4503-5617-6/18/02...$15.00 easier to optimize than Listings 2 to4. This is because of the https://doi.org/10.1145/3168815 temporaries a and c, which baked-in the assumption that CGO’18, February 24–28, 2018, Vienna, Austria Michael Kruse and Tobias Grosser for (int i = 0; i < N; i += 1) { multiple independent instances of the scalars, one for each T: C[i] = 0; SIMD lane. An automatic parallelizer can privatize the tem- for (int k = 0; k < K; k += 1) poraries per thread. However, general purpose loop sched- S: C[i] += A[i] * B[k]; uling algorithms such as Feautrier [12, 13] or Pluto [9] do } not take privatization opportunities into account. These are fundamentally split into an analysis and a scheduling part. Listing 1. Running example: sums-of-products. Analysis extracts dependences to determine which execution orders are possible, and the scheduler uses this information double a; to find a new ordering that does not violate any dependence. for (int i = 0; i < N; i += 1) { The analysis result must be correct even for transformations T: C[i] = 0; a = A[i]; such as loop distribution and -reversal which are not enabled for (int k = 0; k < K; k += 1) by these transformation-specific techniques. Hence, to allow S: C[i] += a * B[k]; maximal scheduling freedom, it is desirable to model pro- } grams with a minimal number of scalar dependencies. As Listing 2. Sum-of-products after hoisting the load of A[i]. the semantics of Listings 1 to4 are identical, a good compiler should be able to compile them to the same optimized double c; program, after all loop optimizations have been applied. for (int i = 0; i < N; i += 1) { In this article we present a novel memory dependence T: c = 0; removal approach that reduces a polyhedral’s compiler sen- for (int k = 0; k < K; k += 1) sitivity on the shape and style of the input code and enables S: c += A[i] * B[k]; effective high-level loop optimizations as part of a standard U: C[i] = c; optimization pipeline. Our contributions are: } • A polyhedral value analysis that provides, at statement instance granularity, information about the liveness Listing 3. Sum-of-products after promotion of C[i] to a and the source statement of individual data values. scalar. • Greedy Value Coalescing (Section 2): Map promoted scalars to array elements that are unused to handle double c; cases such as Listings 3 and4. for (int i = 0; i < N; i += 1) { • Operand Tree Forwarding (Section 3): Move instruc- T: c = 0; tion and memory reads to the statements where they for (int k = 0; k < K; k += 1) are used to handle cases such as Listing 2. S: C[i] = (c += A[i] * B[k]); • An implementation (Section 4) of these techniques in } LLVM/Polly. Listing 4. Sum-of-products after partial promotion of C[i]. • An evaluation (Section 5) showing that polyhedral loop optimizations still can be effective in the presence of scalar dependencies. the i-loop is the outer loop. Although the data-flow dependencies are the same, new false dependencies are introduced Sources of Scalar Dependencies Scalar memory depen- that prevent the scalars a or c from being overwritten before dencies arise for several reasons and not necessarily intro- all statements using that value have been executed. For List- duced by compiler passes. ing 3, this means that the dynamic statement instance T ¹iº, statement T at loop iteration i, cannot execute before the Single Static Assignment (SSA) -based intermediate previous iteration i − 1 has finished. In this case, statement representations used in modern compilers cause instance U ¹i − 1º must execute before T ¹iº, represented by memory to be promoted to virtual registers, like the a c the (anti-)dependencies scalar temporaries or from the examples. Values that are defined in conditional statements are i 2 f 1;:::; N − 1 g : U ¹i − 1º ! T ¹iº . forwarded through φ-nodes. 8 We call dependencies that are induced by a single mem- Loop-Invariant Code Motion (LICM) moves instruc- ory location such as a or c, scalar dependencies. Such false tions that evaluate to the same value in every loop dependencies effectively serialize the code into the order it iteration before the loop, and instructions whose re- is written, inhibiting or at least complicating most advanced sult is only used after the loop, after the loop. The loop optimizations including tiling, strip-mining, and paral- transformation of Listing 1 to Listing 2 is an example lelization. Certain specific optimizations can cope well with for this, but also the temporary promotion of memory scalar dependencies. A vectorizer can, for example, introduce location to a scalar as in Listing 3 or4. DeLICM: Scalar Dependence Removal at Zero Memory Cost CGO’18, February 24–28, 2018, Vienna, Austria c = 0 c += A[0][0]c += A[0][1]c += A[0][2] c = 0 c += A[1][0]c += A[1][1]c += A[1][2] c = 0 c += A[2][0]c += A[2][1]c += A[2][2] Θ : 0 0 1 S¹0; 0º 2 S¹0; 1º 3 S¹0; 2º 4 S¹0; 2º 5 0 6 S¹1; 0º 7 S¹1; 1º 8 S¹1; 2º 9 S¹1; 2º 10 0 11 S¹2; 0º 12 S¹2; 1º 13 S¹2; 2º 14 ..

Delicm: Scalar Dependence Removal at Zero Memory Cost

State Model Syllabus for Undergraduate Courses in Science (2019-2020)

073-080.Pdf (568.3Kb)

PERL – a Register-Less Processor

Instruction Pipelining in Computer Architecture Pdf

Utilizing Parametric Systems for Detection of Pipeline Hazards

Pipeliningpipelining

Processor Design：System-On-Chip Computing For

Operand Registers and Explicit Operand Forwarding James Balfour, R

A CUDA Program • a CUDA Program Consists of a Mixture of the Following Two Parts

Pipelining Design Techniques

Realizing High IPC Through a Scalable Memory-Latency Tolerant Multipath Microarchitecture

ECE 341 Lecture # 13 Lecture Topics Types of Pipeline Hazards