Compiler Transformations for High-Performance Computing

Compiler Transformations for High-Performance Computing DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP Computer Sc~ence Diviszon, Unwersbty of California, Berkeley, California 94720 In the last three decades a large number of compiler transformations for optimizing programs have been implemented. Most optimization for uniprocessors reduce the number of instructions executed by the program using transformations based on the analysis of scalar quantities and data-flow techniques. In contrast, optimization for high-performance superscalar, vector, and parallel processors maximize parallelism and memory locality with transformations that rely on tracking the properties of arrays using loop dependence analysis. This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages such as C and Fortran. Transformations for both sequential and various types of parallel architectures are covered in depth. We describe the purpose of each traneformation, explain how to determine if it is legal, and give an example of its application. Programmers wishing to enhance the performance of their code can use this survey to improve their understanding of the optimization that compilers can perform, or as a reference for techniques to be applied manually. Students can obtain an overview of optimizing compiler technology. Compiler writers can use this survey as a reference for most of the important optimizations developed to date, and as a bibliographic reference for the details of each optimization. Readers are expected to be familiar with modern computer architecture and basic program compilation techniques. Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent Programming; D.3.4 [Programming Languages]: Processors—compilers; optimization; 1.2.2 [Artificial Intelligence]: Automatic Programming-program transformation General Terms: Languages, Performance Additional Key Words and Phrases: Compilation, dependence analysis, locality, multiprocessors, optimization, parallelism, superscalar processors, vectorization INTRODUCTION As optimizing compilers become more Optimizing compilers have become an es- effective, programmers can become less sential component of modern high-perfor- concerned about the details of the under- mance computer systems. In addition to lying machine architecture and can em- translating the input program into ma- ploy higher-level, more succinct, and chine language, they analyze it and ap- more intuitive programming constructs ply various transformations to reduce its and program organizations. Simultane- running time or its size. ously, hardware designers are able to This research has been sponsored in part by the Defense Advanced Research Projects Agency (DARPA) under contract DABT63-92-C-O026, by NSF grant CDA-8722788, and by an IBM Resident Study Program Fellowship to David Bacon, Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission. @ 01994 0360-0300/94/1200-0345 $03.50 ACM Computing Surveys, Vol. 26, No. 4, December 1994 346 * David F. Bacon et al. CONTENTS imperative languages such as Fortran and C for high-performance architectures, including superscalar, vector, and various classes of multiprocessor INTRODUCTION machines. 1. SOURCE LANGUAGE Most of the transformations we de- 2 TRANSFORMATION ISSUES scribe can be applied to anyofthe Algol- 2 1 Correctness 22 Scope family languages; many are applicable to 3 TARGET ARCHITECTURES functional, logic, distributed, and object- 31 Characterizing Performance oriented languages as well. These other 32 Model Architectures languages raise additional optimization 4 COMPILER ORGANIZATION issues that space does not permit us 5 DEPENDENCE ANALYSIS 51 Types of Dependences to cover in this survey. The references 52 Representing Dependences include some starting points for investi- 5,3 Loop Dependence Analysis gation of optimizations for LISP and 5.4 Subscript Analysis functional languages [Appel 1992; Clark 6. TRANSFORMATIONS and Peyton-Jones 1985; Kranz et al. 61 Data-Flow-Based Loop Transformations 6.2 Loop Reordering 1986], object-oriented languages [Cham- 6.3 Loop Restructuring bers and Ungar 1989], and the set-based 6.4 Loop Replacement Transformations language SETL [Freudenberger et al. 6.5 Memory Access Transformations 1983]. 6.6 Partial Evaluation 6.7 Redundancy Ehmmatlon We have also restricted the discussion 68 Procedure Call Transformations to higher-level transformations that re- 7. TRANSFORMATIONS FOR PARALLEL quire some program analysis. Thus we MACHINES exclude peephole optimizations and most 71 Data Layout machine-level optimizations. We use the 72 Exposing Coarse-Grained Parallelism 73 Computation PartltIOnmg term optimization as shorthand for opti- 74 Communication Optimization mizing transformation. 75 SIMD Transformations Finally, because of the richness of the 76 VLIWTransformatlons topic, we have not given a detailed de- 8 TRANSFORMATION FRAMEWORKS scription of intermediate program repre- 8.1 Unified Transformation sentations and analysis techniques. 8.2 Searching the Transformation Space 9. COMPILER EVALUATION We make use of a number of different 9.1 Benchmarks machine models, all based on a hypothet- 9.2 Code Characterlstlcs ical superscalar processor called S-DLX. 9.3 Compiler Effectiveness The Appendix details the machine mod- 94 Instruction-Level Parallehsm CONCLUSION els, presenting a simplified architecture APPENDIX: MACHINE MODELS and instruction set that we use when we A.1 Superscalar DLX need to discuss machine code. While all A 2 Vector DLX assembly language examples are com- A3 Multiprocessors ACKNOWLEDGMENTS mented, the reader will need to refer to REFERENCES the Armendix to understand some details of the’ ~xamples (such as cycle counts). We assume a basic familiarity with ~ro~am compilation issues. Readers un- employ designs that yield greatly im- ~am~liar with program flow analysis or proved performance because they need other basic compilation techniques may only concern themselves with the suit- consult Aho et al. [1986]. ability of the designas a compiler target, not with its suitability as a direct pro- 1. SOURCE LANGUAGE grammer interface. In this survey we describe transforma- All of the high-level examples in this tions that optimize programs written in survey are written in a language similar ACM Computing Surveys, Vol. 26, No. 4, December 1994 Compiler Transformations “ 347 to Fortran 90, with minor variations and Programmers unfamiliar with Fortran extensions; most examples use only fea- should also note that all arguments are tures found in Fortran 77. We have cho- passed by reference. sen to use Fortran because it is the de When describing compilation for vector facto standard of the high-performance machines, we sometimes use array nota- engineering and scientific computing tion when the mapping to hardware reg- community. Fortran has also been the isters is clear. For instance, the loop input language for a number of research doalli=l,64 projects studying parallelization [Allen et a[i] = a[i] + c al. 1988a; Balasundaram et al. 1989; end do all Polychronopoulos et al. 1989]. It was cho- sen by these projects not only because of could be implemented with a scalar-vec- its ubiquity among the user community, tor add instruction, assuming a machine but also because its lack of pointers and with vector registers of length 64. This its static memory model make it more would be written in Fortran 90 array amenable to analysis. It is not yet clear notation as what effect the recent introduction of a[l :64] = a[l :64] + c pointers into Fortran 90 will have on optimization. or in vector machine assembly language The optimizations we have presented as are not specific to Fortran—in fact, many commercial compilers use the same in- LF F2, c(R30) ;Ioad c into register F2 termediate language and optimizer for ADDI R8, R30, #a ;Ioad addr. of a into R8 both C and Fortran. The presence of un- LV VI, R8 ;Ioad vector a[l :64] to restricted pointers in C can reduce oppor- V1 tunities for optimization because it is ADDSV VI, F2, VI ;add scalar to vector Sv VI. R8 ;store vector in a[l :64] impossible to determine which variables may be referenced by a pointer. The pro- cess of determining which references may Array notation will not be used when point to the same storage locations is loop bounds are unknown, because there called alias analysis [Banning 1979; Choi is no longer an obvious correspondence et al. 1993; Cooper and Kennedy 1989; between the source code and the fixed- Landi et al. 1993]. length vector operations that perform the The only changes to Fortran in our computation. To use vector operations, examples are that array subscripting is the compiler must perform the transfor- denoted by square brackets to distin- mation called strip-mining, which is dis- guish it from function calls; and we use cussed in Section 6.2.4. do all loops to indicate textually that all the iterations of a loop may be executed 2. TRANSFORMATION ISSUES concurrently. To make the structure of For a compiler to apply an optimization the computation more explicit, we will to a program, it must do three things: generally express loops

Compiler Transformations for High-Performance Computing

CSE 582 – Compilers

ICS803 Elective – III Multicore Architecture Teacher Name: Ms

Fast-Path Loop Unrolling of Non-Counted Loops to Enable Subsequent Compiler Optimizations∗

Compiler Optimization for Configurable Accelerators Betul Buyukkurt Zhi Guo Walid A

Functional Array Programming in Sac

Loop Transformations and Parallelization

Compiler-Based Code-Improvement Techniques

Foundations of Scientific Research

Survey of Compiler Technology at the IBM Toronto Laboratory

Compiler Transformations for High-Performance Computing

Generalized Index-Set Splitting

Compiler Construction