Loop Transformations: Convexity, Pruning and Optimization

Loop Transformations: Convexity, Pruning and Optimization Louis-Noel¨ Pouchet Uday Bondhugula Cedric´ Bastoul The Ohio State University IBM T.J. Watson Research Center University of Paris-Sud 11 [email protected] [email protected] [email protected] Albert Cohen J. Ramanujam P. Sadayappan Nicolas Vasilache INRIA Louisiana State University The Ohio State University Reservoir Labs, Inc. [email protected] [email protected] [email protected] [email protected] Abstract One reason for this widening gap is due to the way loop nest High-level loop transformations are a key instrument in mapping optimizers attempt to decompose the global optimization problem computational kernels to effectively exploit resources in modern into simpler sub-problems. Recent results in feedback-directed op- processor architectures. However, determining appropriate compo- timization [33] indicate that oversimplification of this decomposi- sitions of loop transformations to achieve this remains a signif- tion is largely responsible for the failure. The polyhedral compiler icantly challenging task; current compilers may achieve signifi- framework provides a convenient and powerful abstraction to apply cantly lower performance than hand-optimized programs. To ad- composite transformations in one step [20, 24]. However, the space dress this fundamental challenge, we first present a convex char- of affine transformations is extremely large. Linear optimization in acterization of all distinct, semantics-preserving, multidimensional such a high-dimensional space raises complexity issues and its ef- affine transformations. We then bring together algebraic, algorith- fectiveness is limited by the lack of accurate profitability models. mic, and performance analysis results to design a tractable opti- This paper makes fundamental progresses in the understanding mization algorithm over this highly expressive space. The frame- of polyhedral loop nest optimization. It is organized as follows. work has been implemented and validated experimentally on a Section 2 formalizes the optimization problem and contributes representative set of benchmarks run on state-of-the-art multi-core a complete, convex characterization of all distinct, semantics- platforms. preserving, multidimensional affine transformations. Focusing on one critical slice of the transformation space, Section 3 pro- Categories and Subject Descriptors D 3.4 [Programming lan- poses a general framework to reason about all distinct, semantics- guages]: Processor — Compilers; Optimization preserving, multidimensional statement interleavings. Section 4 presents a complete and tractable optimization algorithm driven by General Terms Algorithms; Performance the iterative evaluation of these interleavings. Section 5 presents experimental data. Related work is discussed in Section 6. Keywords Compilation; Compiler Optimization; Parallelism; Loop Transformations; Affine Scheduling 2. Problem Statement and Formalization 1. Introduction The formal semantics of a language allows the definition of pro- Loop nest optimization continues to drive much of the ongoing gram transformations and equivalence criteria to validate these research in the fields of optimizing compilation [8, 27, 33], high- transformations. Denotational semantics can be used as an abstrac- level hardware synthesis [22], and adaptive library generation [19, tion to decide on the functional equivalence of two programs. But 47]. Loop nest optimization attempts to map the proper granularity operational semantics is preferred to express program transforma- of independent computation to a complex hierarchy of memory, tions themselves, especially in the case of transformations impact- computing, and interconnection resources. Despite four decades ing the detailed interaction with particular hardware. Program op- of research and development, it remains a challenging task for timization amounts to searching for a particular point in a space of compiler designers and a frustrating experience for programmers semantics-preserving transformations. When considering optimiz- — the performance gap, between expert-written code for a given ing compilation as a whole, the space has no particular property machine and that optimized and parallelized automatically by a that makes it amenable to any formal characterization or optimiza- compiler, is widening with newer generations of hardware. tion algorithm. As a consequence, compilers decompose the problem into sub-problems that can be formally defined, and for which complexity results and effective optimization algorithms can be de- rived. The ability to encompass a large set of program transformations Permission to make digital or hard copies of all or part of this work for personal or into a single, well understood optimization problem is the strongest classroom use is granted without fee provided that copies are not made or distributed asset of the polyhedral compilation model. In the polyhedral model, for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute transformations are validated against dependence relations among to lists, requires prior specific permission and/or a fee. dynamic instances of individual statements of a loop nest. The de- POPL’11, January 26–28, 2011, Austin, Texas, USA. pendence relation is the denotational abstraction that defines func- Copyright c 2011 ACM 978-1-4503-0490-0/11/01. $10.00 tional equivalence. This dependence relation may not be statically computable, but (conservative) abstractions may be (finitely) rep- in Figure 2). But on the AMD machine, distributing all statements resented through systems of affine inequalities, or unions of para- and individually tiling them performs best, 1.23× better than 2(b). metric polyhedra [2, 16, 22]. Two decades of work in polyhedral The efficient execution of a computation kernel on a modern compilation has demonstrated that the main loop nest transforma- multi-core architecture requires the synergistic operation of many tions can be modeled and effectively constructed as multidimen- hardware resources, via the exploitation of thread-level parallelism; sional affine schedules, mapping dynamic instances of individual the memory hierarchy, including prefetch units, different cache lev- statements to lexicographically sorted vectors of integers (or ratio- els, memory buses and interconnect; and all available computa- nal numbers) [6, 8, 17, 20, 24, 38]. Optimization in the polyhedral tional units, including SIMD units. Because of the very complex model amounts to selecting a good multidimensional affine sched- interplay between all these components, it is currently infeasible ule for the program. to guarantee at compile-time which set of transformatios leads to In this paper, we make the following contributions: maximal performance. To maximize performance, one must carefully tune for the • We take this well defined optimization problem, and provide trade-off between the different levels of parallelism and the us- a complete, convex formalization for it. This convex set of age of the local memory components. This can be performed via semantics-preserving, distinct, affine multidimensional sched- loop fusion and distribution choices, that drive the success of subse- ules opens the door to more powerful linear optimization algo- quent optimizations such as vectorization, tiling or parallelization. rithms. It is essential to adapt the fusion structure to the program and target • We propose a decomposition of the general optimization prob- machine. However, the state-of-the-art provides only rough models lem over this convex polyhedron into sub-problems of much of the impact of loop transformations on the actual execution. To lower complexity, introducing the powerful fusibility concept. achieve performance portability across a wide variety of architec- The fusibility relation generalizes the legality conditions for tures, empirical evaluation of several fusion choices is an alterna- loop fusion, abstracting a large set of multidimensional, affine, tive, but requires the construction and traversal of a search space fusion-enabling transformations. of all possible fusion/distribution structures. To make the problem • We demonstrate that exploring fusibility opportunities in a loop sound and tractable, we develop a complete framework to reason nest reduces to the enumeration of total preorders, and to the and optimize in the space of all semantics-preserving combinations existence of pairwise compatible loop permutations. We also of loop structures. study the transitivity of the fusibility relation. 2.2 Background and notation • Based on these results, we design a multidimensional affine scheduling algorithm targeted at achieving portable high per- The polyhedral model is a flexible and expressive representation formance across modern processor architectures. for loop nests. It captures the linear algebraic structure of statically predictable control flow, where loop bounds and conditionals are • Our approach has been fully implemented in a source-to-source affine functions of the surrounding loop iterators and invariants compiler and validated on representative benchmarks. (a.k.a. parameters, unknown at compilation time) [15, 20]; but it is not restricted to static control flow [2, 4]. 2.1 Loop optimization challenge The set of all executed instances of each statement is called an Let us illustrate the challenge of loop

Loop Transformations: Convexity, Pruning and Optimization

Expression Rematerialization for VLIW DSP Processors with Distributed Register Files ?

User-Directed Loop-Transformations in Clang

Elimination of Memory-Based Dependences For

Generalizing Loop-Invariant Code Motion in a Real-World Compiler

A Tiling Perspective for Register Optimization Fabrice Rastello, Sadayappan Ponnuswany, Duco Van Amstel

Efficient Symbolic Analysis for Optimizing Compilers*

Using Static Single Assignment SSA Form

A General Compilation Algorithm to Parallelize and Optimize Counted Loops with Dynamic Data-Dependent Bounds Jie Zhao, Albert Cohen

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Foundations of Scientific Research

Autotuning for Automatic Parallelization on Heterogeneous Systems